Data Processing Pipeline

This page describes how data moves through OpenWit from the first request at the edge to a result returned to a client. The pipeline is built around streaming ingestion, columnar transformation and index-first search so input is accepted once, stored efficiently and queried with minimal I/O.

1. Ingestion

Producers send data over Kafka, gRPC or HTTP. The Ingestion Gateway is the front door. It authenticates the request, validates the payload against the dataset schema, normalizes field names so keys are consistent, then groups events into batches by size or time. After batching, the gateway sends a bulk protobuf to the Ingest Node over gRPC. This shapes incoming traffic into predictable units for the cluster.

2. Ingest processing

The Ingest Node receives the bulk batch and applies the first guarantees inside the cluster. It deduplicates within a time window using a hash key so repeated events in that window do not proceed. It writes the batch to short-term WAL for immediate durability, then aggregates entries into long-term WAL. It converts the payload into an Arrow RecordBatch using a zero-copy path and forwards Arrow batches to Storage over Arrow Flight. This is where durability is secured and where in-memory data becomes columnar for the rest of the system.

3. Storage processing:

The Storage Node merges incoming Arrow batches into an active Parquet file. When the active file reaches the configured size, it becomes a stable Parquet file. Stable files are uploaded to object storage through OpenDAL. After upload, Storage updates Postgres with file path, timestamps and size, then triggers the Indexer actor to build indexes asynchronously.

4. Indexing

The Indexer reads the Arrow or Parquet view and produces per-file index artifacts. The document lists bitmap for categorical columns, bloom or loom for membership tests, zonemap for numeric and time pruning, and Tantivy for full text. Index files are uploaded to object storage. Index metadata is written to Postgres so each index is linked to its Parquet file and time range.

5. Metadata Catalog

Postgres holds the catalog for files and indexes with their time ranges and versions. The catalog links ingestion, storage and indexing so a query can always discover which files to read and operators can trace any result back to its origin.

The Search Node receives SQL or text queries through the Proxy. It first consults Postgres to find candidate Parquet and index ranges for the requested time window, then downloads the index files it needs and either fetches Parquet from the object store or reads from cache if present. Query predicates are translated to an SQL AST and optimized in DataFusion. Indexes and time ranges are used to prune aggressively so only needed Parquet is read. Results return to the client as Arrow RecordBatches.

7. Caching in the read path

Frequently accessed data resides in tiered cache so queries avoid unnecessary downloads. Electro keeps Arrow in RAM for nanosecond access, Pyro keeps Parquet on SSD for microsecond access, Cryo serves from object storage for second-level access. The search path prefers cache before hitting the cloud.

8. Communication lanes used by the pipeline

OpenWit uses three channels and each one serves a distinct purpose in the pipeline: Gossip with Chitchat carries low-frequency cluster metadata and node health, gRPC carries control commands and small RPC messages, Arrow Flight carries high-throughput Arrow columnar batches. Separating control traffic from heavy data keeps the system efficient.

Reliability and failure handling

Short-term WAL guarantees no data loss if an ingest process crashes. Long-term WAL provides a warm, time-organized copy that supports recovery and background work. Object storage durability keeps Parquet and index files safe independent of compute nodes. Postgres can run in an HA setup so the catalog remains consistent. All gRPC and Arrow Flight calls retry with exponential backoff on transient failures which keeps the pipeline moving through temporary faults.

Formats at each hop

The pipeline uses a format that fits the job at every step. This table matches what travels on the wire or lands on storage so you can check payloads during ops.

StageFormat or Unit
Entry → GatewayJSON or Protobuf
Gateway → IngestgRPC bulk proto
Ingest → StorageArrow RecordBatch
Storage → ObjectParquet
Indexes → Object StoreIndex file formats: bitmap, tantivy, bloom or loom, zonemap
Search → ProxyArrow RecordBatch results

Operator checklist

  • Gateway batches are being formed and forwarded as gRPC bulk payloads to Ingest
  • Ingest shows healthy short-term and long-term WAL activity before Arrow conversion
  • Storage is rolling active to stable Parquet and uploading via OpenDAL
  • Index files appear in the object store and index rows appear in Postgres
  • Search shows pruning by time and index and returns Arrow results with minimal I/O These checks mirror the stages above and the responsibilities in the document.