Architecture Overview
The Indexer pipeline turns every finalized Parquet DataFile produced by Storage into a compact set of index artifacts, publishes those artifacts to object storage through OpenDAL, and registers them in Postgres so Search can discover and use them. It activates when Storage finishes a Parquet file and writes a datafile row; from there, the Indexer schedules a per-file job, builds the configured artifacts, uploads them, and records the mapping from file ULID → artifact set. This keeps indexing off the hot ingest path while making new data queryable shortly after upload.
Position in the system
Upstream: Storage receives Arrow batches, finalizes a stable Parquet file, uploads it via OpenDAL, and writes authoritative file metadata to Postgres (cloud URI, time range, size, schema hash). That metadata row is the unit of work for the Indexer and acts as the trigger.
Indexer trigger: After upload and registration, Storage notifies the Indexer (control-plane or internal message) and provides the Parquet location and time bounds. The Indexer then enqueues a per-file job so only one worker processes a file at a time.
Downstream: Once artifacts are uploaded and recorded, Search consults Postgres to find the Parquet and its artifacts for the query time window, downloads only what it needs, and prunes before scanning.
Primary inputs and outputs
- Input: a DataFile row that includes file_ulid, Parquet URI, row count, and min/max timestamps. This identifies the file to index and the time range it covers.
- Output: one or more IndexArtifact rows, each with artifact_id, file_ulid, artifact_type, artifact_uri, size_bytes, created_at, and a metadata JSON field for versioning or parameters. These records allow deterministic discovery by file and type.
Components and responsibilities
- IndexerController: Learns about new files (notification or poll), enqueues per-file jobs, and ensures no duplicates (single worker per file). It applies concurrency limits so work can scale safely.
- IndexerWorkers: Download Parquet via OpenDAL or read Arrow if provided, inspect row-group statistics, and invoke builders. CPU-bound steps run on a blocking pool; I/O remains async.
- ArtifactBuilder: Builds the configured artifact types (zone map, bloom/loom, bitmap, Tantivy) in memory before serialization.
- StorageClient: Handles upload with retry/backoff and bounded concurrency to respect provider throttling. Returns durable URIs.
- DbClient: Inserts IndexArtifact rows, and when used, publishes a snapshot manifest for atomic visibility.
- Compaction/Merging: Optional background process that merges many small artifacts into larger hour/day-level artifacts under size-based or time-based policies, with validation and atomic replacement.
End-to-end sequence
- Storage completes and registers a file: Upload stable Parquet via OpenDAL → write datafile metadata in Postgres. This is the authoritative signal that a file is ready for indexing.
- Controller discovers work: Receives a notification or polls for pending files, then enqueues a job keyed by file. A single worker takes the job, preventing duplicate processing.
- Worker prepares inputs: Download Parquet via OpenDAL (or read Arrow inline), parse row-group stats and schema, and set up per-artifact builders.
- Build artifacts: Construct zone maps, bloom/loom filters, bitmap indexes, and Tantivy, as configured. Builders return in-memory artifacts ready to serialize.
- Validate: Check schema compatibility, size sanity, and for bloom/bitmap ensure hashing or dictionary consistency so artifacts are correct.
- Serialize, compress, upload: Artifacts carry a small header with file_ulid, artifact version, type, and a content hash; they are zstd-compressed and uploaded via OpenDAL using retry/backoff and upload_concurrency limits.
- Register and publish: Insert IndexArtifact rows in Postgres; optionally create a JSON Snapshot manifest listing all artifacts for the file and mark it published so visibility is all-or-none.
- Notify: Signal Query nodes or the control plane (gRPC or pub/sub) that artifacts are available. Search subsequently uses Postgres to discover and download them.
Index artifact types (algorithms and formats)
| Type | Purpose | Construction & Format | Naming |
|---|---|---|---|
| Zone maps | Cheap min/max pruning at row-group granularity. Planner can drop row groups outside a requested range (for example, time). | Read Parquet row-group statistics or compute from Arrow; serialize mapping {row_group → {col → (min,max)}}; zstd-compress; validate min ≤ max and types. | zone.<file_ulid>.zst or zone.json.zst |
| Bloom/Loom | Probabilistic membership to skip files or groups for equality/IN filters. False positives are possible; false negatives are not. | Choose FPR; estimate distinct values; hash normalized bytes; write compact binary with header (k, m, seeds); compress. | bloom::<file_ulid>::colname.bloom.zst |
| Bitmap (Roaring) | Exact membership with fast set union/intersection for categorical fields. | Build dictionary value→id and one roaring bitmap per value; serialize roaring format and dictionary metadata. | bitmap::<file_ulid>::colname.bitmap |
| Tantivy | Full-text phrase and ranked search without scanning all Parquet pages. | Define Tantivy schema (text, keyword, numeric), index rows or grouped docs, merge segments for read speed, export as tar.zst. | tantivy::<file_ulid>.tar.zst |
Deterministic names encode the source file ULID, which simplifies reconciliation and idempotent registration.
Storage and metadata model
Artifacts are uploaded with OpenDAL (S3, Azure, GCS) using retry with jitter and bounded concurrency to avoid throttling. Postgres stores the authoritative mapping among data_file, index_artifacts (unique on (file_ulid, artifact_type)), and snapshots for atomic publish. The docs recommend small, focused metadata (uri, size, content_hash, timestamps, status) and idempotent upserts so retries do not create duplicates.
Concurrency and scaling
The controller polls for pending files, enqueues work, and enforces per-tenant concurrency limits to avoid noisy neighbor effects. Workers separate CPU-intensive building (for example, Tantivy) from async I/O on a blocking pool, so the runtime stays responsive. This model scales horizontally by adding workers or Indexer pods as the backlog grows.
Atomic publishing and snapshots
To guarantee all-or-nothing visibility, the Indexer can write a JSON Snapshot manifest listing every artifact for a file, upload it to a predictable path, and set a published flag in Postgres only after all artifacts are persisted. Query nodes discover and use only published snapshots, so half-indexed files never appear.
Compaction and merging (optional)
When many small artifacts accumulate, policies can merge them into larger artifacts. The docs describe SizeBased and TimeBased policies: select inputs by tenant/signal/level, download, merge in memory, upload the merged artifact, register it, and deprecate inputs. Validation can be enabled before replacement, and garbage collection removes deprecated artifacts after verification.
Reliability and recovery
If a worker crashes mid-build, partial artifacts remain in temp locations, and the job stays pending for retry. If uploads fail, the client retries with exponential backoff; Postgres writes use upsert, so duplicates are not introduced. If metadata insert fails after upload, a janitor process can reconcile object store contents back into the DB using deterministic names. These guardrails let the pipeline recover after restarts without double-indexing.
Observability
The docs surface metrics, traces, and logs across the stages: scheduling, downloading Parquet, building each artifact type, serialization, upload, and DB registration. Search uses the recorded artifact metadata to prune during query planning and download only necessary objects.
Configuration touch-points
Indexer behavior is controlled by focused knobs in your config and pipeline sections: worker thread count, upload concurrency, and per-artifact parameters (for example, Bloom FPR; Tantivy export as tar.zst). These settings tune the size and effectiveness of artifacts and the pressure on storage backends, without changes to ingest or storage nodes.