Executive Summary

The Indexer Pipeline converts each uploaded Parquet data file into a compact set of pruning and search artifacts, then publishes those artifacts so the Query tier can skip irrelevant data and answer fast. The pipeline begins when Storage finishes a Parquet file and registers it in Postgres. It ends when every configured artifact for that file is built, uploaded to object storage, and recorded in metadata so Search can discover and use them. The design is modular, trait-driven, and runs as workers that scale with backlog.

Role in the system

When Storage uploads a Parquet file, it writes a “datafile” row in Postgres that includes the file URI, time range, and row counts. That row is the unit of work for the Indexer. The Indexer learns about this work either by a direct notification or by polling for pending files. It leases a file so only one worker processes it at a time, then builds the artifacts that were configured for that dataset or tenant. This separation keeps indexing out of the hot ingest path while making newly arrived data queryable soon after upload.

After building and uploading artifacts, the Indexer writes dedicated metadata rows so downstream components can find them by file ULID and artifact type. Search later joins file metadata with artifact metadata to decide which objects to touch, which row groups to open, and which text segments to scan. This is the contract that gives predictable query speed even when object counts are large.

What the Indexer produces

The Indexer supports multiple artifact types. Each addresses a different pruning need, and each is stored in cloud storage with a pointer in Postgres that links back to the source file.

  • Zone maps record min or max for selected columns at row group granularity. If a query asks for a time range or a bounded numeric range, the planner can drop row groups whose min or max cannot match. Artifacts are serialized as compact JSON or binary and compressed to keep them cheap to fetch. Filenames encode the source file ULID so they are easy to reconcile.
  • Bloom or Loom filters give probabilistic set membership per column. They help with equality and small IN lists. The builder picks a false positive rate, sizes the bitset, hashes normalized values, then writes a tiny binary that the planner can check before opening the Parquet page. Results are compressed and named per source file and column.
  • Bitmap (Roaring) indexes provide exact membership for high-reuse categorical predicates. The builder maintains a dictionary and one roaring bitmap per value to allow fast union and intersection. This makes common dashboards and multi-select filters cheap.
  • Tantivy full-text indexes serve phrase and ranked text queries. The builder maps text, keyword, and numeric fields to a Tantivy schema, builds segments, merges for read speed, and exports a small tarball. Search can then route text predicates to the index rather than scanning every Parquet page.

All four can be enabled at once or per dataset. The choice is configuration-driven and can evolve without touching ingest or storage.

End-to-end flow

  1. Storage completes a Parquet file: It uploads to object storage, then inserts a data file row in Postgres with URI, row count, and time range. This makes the file discoverable for indexing.
  2. IndexerController picks work: It polls for “pending” files, takes a per-file lease, and places a job on the worker queue so duplicates cannot race. This is a lightweight scheduling loop that scales horizontally.
  3. Worker downloads and inspects: The worker fetches the Parquet object via OpenDAL, reads row group statistics, and prepares per-artifact builders with the schema and time bounds. I/O is async, while CPU-heavy steps run on blocking threads to keep the runtime responsive.
  4. Build artifacts: Builders create zone maps, bloom or loom filters, bitmap indexes, and Tantivy segments as requested by the config. Each builder returns an in-memory artifact plus a small header that carries the file ULID, type, version, and content hash to support validation and dedupe.
  5. Serialize, compress, and upload: Artifacts are serialized to compact binary or tar.zst, then uploaded through OpenDAL with retries and bounded concurrency. Errors use backoff with jitter to avoid store throttling. Success returns a durable URI.
  6. Register metadata and publish: The Indexer inserts IndexArtifact rows in Postgres, optionally writes a per-file Snapshot manifest that lists all artifacts, then marks the data file as “indexed.” Publishing can be atomic, so Search only sees a complete set.
  7. Notify downstream: The controller signals availability, or Search discovers the new artifacts by querying metadata. From here on, every query can leverage them for pruning.

Interfaces and data shapes

The Indexer hides I/O and database details behind traits so you can test builders without a real cloud or Postgres, and swap backends without touching orchestration.

InterfacePurposeNotes
ArtifactBuilderBuild artifacts from Parquet or ArrowExposes per-type build functions so you can enable only what a dataset needs.
StorageClientMove bytes to and from the object storeUpload returns a URI. Download can stream to a temp file to bound memory.
DbClientRead work, write resultsMethods include “fetch pending,” “insert artifact,” and “mark indexed” with simple transactions.

Two data records tie everything together:

  • DataFile: The unit of work with file_ulid, tenant, parquet_uri, row_count, min_ts, max_ts, and optional partition keys. It is created by Storage and consumed by the Indexer.
  • IndexArtifact: One row per artifact with artifact_id, file_ulid, artifact_type, artifact_uri, size_bytes, created_at, plus a metadata field for versioning or parameters. The pair (file_ulid, artifact_type) is unique by design.

Persistence and metadata model

Artifacts live in the same object store as Parquet. Names are deterministic and include the source file ULID and the artifact type, so reconciliation is straightforward. Postgres stores the authoritative mapping from file to artifact URIs, which lets Search look up the needed files fast. A Snapshot manifest can list all artifacts that belong to a file, so visibility becomes all-or-none during refresh. Upserts and content hashes keep retries idempotent and safe on crash restart.

Scheduling, concurrency, and scaling

The Indexer is an actor system. A controller loop selects pending files, takes leases, and feeds workers. Workers download Parquet, build artifacts, and upload results. CPU-intensive work like Tantivy runs on a blocking thread pool, while I/O and coordination stay async. You can enforce per-tenant concurrency limits so one noisy tenant does not starve others. Horizontal scaling is simple: increase worker count or add more Indexer pods when the queue grows.

A small in-process cache can hold recent artifacts or stats to avoid repeated downloads across related jobs. This helps when dashboards hit the same time windows.

Configuration highlights

Your docs list focused knobs to balance speed, memory, and cloud limits: worker thread count, upload concurrency, per-type settings like bloom false positive rate or Tantivy memory budget, and compaction parameters to control the number of small artifacts. Tune worker threads near available CPU, keep upload concurrency under provider throttling, and size Tantivy memory to protect the node. Each change is local to the Indexer and does not affect ingest or storage.

Reliability, validation, and publishing

Every artifact follows the same lifecycle: build → validate → serialize → upload → register → publish. Validation checks schema compatibility, size sanity, and content hash. Bloom and bitmap builders validate hashing or dictionary consistency. Uploads use retry with backoff and jitter, and database writes are idempotent. If any step fails, the controller keeps the file pending so a later attempt can complete the set. When you enable Snapshot publishing, the file appears “indexed” only after all artifacts are present. This keeps Search from seeing a half-indexed file.

How Search benefits

Search reads Postgres to find files in the time range, fetch the associated artifacts, and compute what to skip. Zone maps remove row groups whose min or max values do not match. Bloom or bitmap avoids downloading files or groups that cannot contain the requested values. Tantivy serves text predicates without scanning every page. What remains is a smaller, well-targeted set of Parquet reads that DataFusion can execute quickly. This is how OpenWit keeps query speed steady even as data volume and object counts grow.