Data Flow Architecture

This page explains how data moves through OpenWit from the first byte at the edge to a result set returned to a client. You will also get a glimpse of multiple stages of the flow.

Principles that shape the flow

OpenWit’s data flow is built on three ideas. First, streaming ingestion because data arrives continuously. Second, columnar transformation so memory and storage are efficient. Third, index-first search so lookups and aggregations are fast.

Stage 1: Data sources → Ingestion Gateway

Producers send data over Kafka, gRPC or HTTP. The Ingestion Gateway is the front door. It authenticates the request, validates the payload against the dataset schema, normalizes field names for key consistency, then batches records using the configured size or time thresholds. After batching it sends a bulk protobuf to the Ingest Node over gRPC. This keeps downstream work predictable and safe.

Stage 2: Ingest Node

The Ingest Node receives the bulk batch and applies deduplication within a time window. It writes the batch to short-term WAL for immediate durability, aggregates entries into a long-term WAL, then converts the payload into an Arrow RecordBatch. Arrow batches leave the node over Arrow Flight toward storage. This stage is where durability is guaranteed and where data becomes columnar in memory.

Stage 3: Storage Node

The Storage Node merges incoming Arrow batches into an active Parquet file. When the active file reaches the target size it becomes a stable Parquet file. Stable files are uploaded to the object store using OpenDAL. This node is responsible for turning columnar memory batches into durable columnar files and sending them to cloud storage.

Stage 4: Indexer

After upload the Indexer builds per-file index artifacts. Supported types include bitmap, bloom or loom filters, zonemap, Tantivy for text. The Indexer uploads these files to the object store then updates metadata so every index is linked to its Parquet file and time range. Indexes are used later to skip work during queries.

Stage 5: Metadata catalog

All file and index details are written to Postgres. Records include file paths, time ranges and versions. The catalog allows the system to trace any result back to its exact files and lets the query planner prune the search space by time and index.

Stage 6: Search

The Search Node receives SQL or text queries through the Proxy. It queries Postgres for candidate files in the time range, fetches the required index files, uses the indexes to prune irrelevant Parquet, then downloads only what is needed. The plan executes in DataFusion and returns an Arrow result to the client. If data is cached it fetches from cache instead of the cloud. This gives OLAP-speed analytics with search-style flexibility.

Communication Fabric Across the Flow

OpenWit uses three channels so each message travels on the right lane:

Gossip (Chitchat) for low-frequency cluster metadata and node health
gRPC for control commands, coordination and small RPC messages
Arrow Flight for high-throughput movement of Arrow columnar batches

Separating these paths keeps coordination light and data transport fast.

Formats and Units at Each Hop

OpenWit uses a format that fits each stage. This table matches what travels on the wire or lands on storage.

Stage	Format or Unit
Entry → Gateway	JSON or Protobuf
Gateway → Ingest	gRPC bulk proto
Ingest → Storage	Arrow RecordBatch
Storage → Object	Parquet
Indexes → Object Store	Index files
Search → Proxy	Arrow RecordBatch

Putting the pieces together

Sources send to the Gateway which authenticates, validates, normalizes and batches, then sends a bulk proto over gRPC.
The Ingest Node deduplicates, writes short-term then long-term WAL, converts to Arrow and streams over Arrow Flight.
The Storage Node rolls active Parquet into stable Parquet and uploads via OpenDAL.
The Indexer builds bitmap, bloom or loom, zonemap and Tantivy artifacts and uploads them, then records them in Postgres.
The Search Node uses catalog records and indexes to prune, fetches only needed Parquet and executes the plan with DataFusion, then returns Arrow results.