Key Concepts
This page builds a shared vocabulary for OpenWit. It explains the ideas that show up across the system so you can map features to how data is ingested, stored and queried.
Core Components at a Glance
Each node type plays a focused role in the cluster and exposes clear interfaces. This table summarizes the function and the primary interfaces used by each role.
| Component | Function | Key Interfaces |
|---|---|---|
| Control Node | Tracks ingested batches such as size, time range, WAL statusCluster brain: orchestrates health, routing and discovery | gRPC control plane, Gossip |
| Ingest Node | Receives data, deduplicates, writes WAL, forwards Arrow batches | gRPC bulk ingest, Arrow Flight |
| Kafka/gRPC/HTTP Gateways | Entry adapters for producers | HTTP or OTLP APIs |
| Storage Node | Converts Arrow batches to Parquet, uploads via OpenDAL | Arrow Flight, gRPC, Object store |
| Indexer | Builds bitmap, bloom, zonemap, Tantivy indexes, uploads to object store | gRPC, Object store |
| Proxy Node | Client facing router that sends queries to warm nodes | HTTP or gRPC |
| Cache Node | Holds hot or warm data in RAM, disk or object tiers | Arrow Flight, gRPC |
| Search Node | Executes DataFusion and Tantivy queries | gRPC, Arrow Flight |
Batch-oriented Ingestion
OpenWit accepts data as structured batches rather than one row at a time. A batch is represented in memory as an Arrow RecordBatch. Batching reduces per event overhead, improves throughput, and makes compression efficient. This is the unit that the system validates, writes to WAL, converts to Arrow, and forwards to storage.
What to remember: think in batches, not rows. Your producers should group events so the gateway can pass compact batches into ingest.
Dual WALs
OpenWit uses two WAL layers with different purposes. The short-term WAL guarantees immediate durability and safe acknowledgments to senders. The long-term WAL aggregates by time, typically daily, which supports recovery, indexing and retention work. Together they protect incoming data and make later background processing reliable.
Actor-based Concurrency
Each subsystem runs as an independent actor that communicates over async message passing. This model gives high concurrency without blocking and isolates work units so components stay responsive under load. In practice you see actors in ingest, storage and indexing.
Zero-copy Data Movement
Data moves across components by reference using Arrow buffers. Passing pointers instead of copying payloads avoids extra memory churn between the network and compute layers, which preserves throughput as batches travel from ingest to storage and onward.
Object Storage Tiering
Data flows down a hot → warm → cold path. RAM holds in-flight columnar data, local disk holds hot working files, and the object store is the durable cold tier. This layout keeps costs predictable and still supports fast reads for active data.
Unified Metadata
Postgres is the single source of truth for batch, file and index metadata. The catalog keeps the cluster consistent so search can prune correctly and operators can trace any result back to the exact files.
Core metadata tables
| Table | Purpose |
|---|---|
| batches | Tracks ingested batches such as size, time range, WAL status |
| files | Tracks Parquet files such as location, timestamps, upload status |
| indexes | Tracks index files such as type, file path, time range |
Extensible Indexing
You can enable multiple index types per schema or field. Bitmap indexes help equality filters on categorical fields. Bloom or loom filters help membership tests. Zonemaps help numeric and time range pruning. Tantivy adds full-text search over messages. Index files live in object storage and are linked to their Parquet files in metadata.
Config-driven Architecture
One YAML file controls the system. You configure batching, deduplication windows, WAL thresholds, Parquet file sizing, indexing mode and cache tiers in openwit-unified-control.yaml. The same image can start in monolith or distributed mode by reading this config.
Communication Layers
OpenWit uses three channels with clear responsibilities:
- Gossip (Chitchat) for low frequency cluster metadata and node health
- gRPC for control commands, coordination and small RPC messages
- Arrow Flight for high throughput transfer of Arrow columnar batches
Choosing the right lane for the right payload keeps coordination light and data movement fast.
Data Representation at Each Stage
Different stages use formats that fit the job. Entry adapters accept JSON or Protobuf, ingest receives a bulk proto over gRPC, the storage path carries Arrow batches, and durable files are Parquet. Search returns an Arrow RecordBatch as the result.
Stage → Format
| Stage | Format |
|---|---|
| Entry → Gateway | JSON or Protobuf |
| Gateway → Ingest | gRPC bulk proto |
| Ingest → Storage | Arrow RecordBatch |
| Storage → Object Store | Parquet |
| Indexes → Object Store | Index files such as bitmap, tantivy, bloom |
| Search → Proxy | Arrow RecordBatch results |
How the Ideas Connect in the Pipeline
The ingestion gateway authenticates, validates schema, normalizes fields and batches data. The ingest node deduplicates, writes short-term then long-term WAL, converts to Arrow and forwards batches over Arrow Flight. The storage node builds active then stable Parquet and uploads with OpenDAL. The indexer builds file-level indexes and records them in metadata. The search node looks up time ranges in Postgres, fetches required indexes, prunes Parquet, then executes SQL or text queries and returns Arrow results. These steps apply the concepts above in a single flow.