Data Flow Architecture
This page explains how data moves through OpenWit from the first byte at the edge to a result set returned to a client. You will also get a glimpse of multiple stages of the flow.
Principles that shape the flow
OpenWit’s data flow is built on three ideas. First, streaming ingestion because data arrives continuously. Second, columnar transformation so memory and storage are efficient. Third, index-first search so lookups and aggregations are fast.
Stage 1: Data sources → Ingestion Gateway
Producers send data over Kafka, gRPC or HTTP. The Ingestion Gateway is the front door. It authenticates the request, validates the payload against the dataset schema, normalizes field names for key consistency, then batches records using the configured size or time thresholds. After batching it sends a bulk protobuf to the Ingest Node over gRPC. This keeps downstream work predictable and safe.
Stage 2: Ingest Node
The Ingest Node receives the bulk batch and applies deduplication within a time window. It writes the batch to short-term WAL for immediate durability, aggregates entries into a long-term WAL, then converts the payload into an Arrow RecordBatch. Arrow batches leave the node over Arrow Flight toward storage. This stage is where durability is guaranteed and where data becomes columnar in memory.
Stage 3: Storage Node
The Storage Node merges incoming Arrow batches into an active Parquet file. When the active file reaches the target size it becomes a stable Parquet file. Stable files are uploaded to the object store using OpenDAL. This node is responsible for turning columnar memory batches into durable columnar files and sending them to cloud storage.
Stage 4: Indexer
After upload the Indexer builds per-file index artifacts. Supported types include bitmap, bloom or loom filters, zonemap, Tantivy for text. The Indexer uploads these files to the object store then updates metadata so every index is linked to its Parquet file and time range. Indexes are used later to skip work during queries.
Stage 5: Metadata catalog
All file and index details are written to Postgres. Records include file paths, time ranges and versions. The catalog allows the system to trace any result back to its exact files and lets the query planner prune the search space by time and index.
Stage 6: Search
The Search Node receives SQL or text queries through the Proxy. It queries Postgres for candidate files in the time range, fetches the required index files, uses the indexes to prune irrelevant Parquet, then downloads only what is needed. The plan executes in DataFusion and returns an Arrow result to the client. If data is cached it fetches from cache instead of the cloud. This gives OLAP-speed analytics with search-style flexibility.
Communication Fabric Across the Flow
OpenWit uses three channels so each message travels on the right lane:
- Gossip (Chitchat) for low-frequency cluster metadata and node health
- gRPC for control commands, coordination and small RPC messages
- Arrow Flight for high-throughput movement of Arrow columnar batches
Separating these paths keeps coordination light and data transport fast.
Formats and Units at Each Hop
OpenWit uses a format that fits each stage. This table matches what travels on the wire or lands on storage.
| Stage | Format or Unit |
|---|---|
| Entry → Gateway | JSON or Protobuf |
| Gateway → Ingest | gRPC bulk proto |
| Ingest → Storage | Arrow RecordBatch |
| Storage → Object | Parquet |
| Indexes → Object Store | Index files |
| Search → Proxy | Arrow RecordBatch |
Putting the pieces together
- Sources send to the Gateway which authenticates, validates, normalizes and batches, then sends a bulk proto over gRPC.
- The Ingest Node deduplicates, writes short-term then long-term WAL, converts to Arrow and streams over Arrow Flight.
- The Storage Node rolls active Parquet into stable Parquet and uploads via OpenDAL.
- The Indexer builds bitmap, bloom or loom, zonemap and Tantivy artifacts and uploads them, then records them in Postgres.
- The Search Node uses catalog records and indexes to prune, fetches only needed Parquet and executes the plan with DataFusion, then returns Arrow results.