Key Concepts

This page builds a shared vocabulary for OpenWit. It explains the ideas that show up across the system so you can map features to how data is ingested, stored and queried.

Core Components at a Glance

Each node type plays a focused role in the cluster and exposes clear interfaces. This table summarizes the function and the primary interfaces used by each role.

Component	Function	Key Interfaces
Control Node	Tracks ingested batches such as size, time range, WAL statusCluster brain: orchestrates health, routing and discovery	gRPC control plane, Gossip
Ingest Node	Receives data, deduplicates, writes WAL, forwards Arrow batches	gRPC bulk ingest, Arrow Flight
Kafka/gRPC/HTTP Gateways	Entry adapters for producers	HTTP or OTLP APIs
Storage Node	Converts Arrow batches to Parquet, uploads via OpenDAL	Arrow Flight, gRPC, Object store
Indexer	Builds bitmap, bloom, zonemap, Tantivy indexes, uploads to object store	gRPC, Object store
Proxy Node	Client facing router that sends queries to warm nodes	HTTP or gRPC
Cache Node	Holds hot or warm data in RAM, disk or object tiers	Arrow Flight, gRPC
Search Node	Executes DataFusion and Tantivy queries	gRPC, Arrow Flight

Batch-oriented Ingestion

OpenWit accepts data as structured batches rather than one row at a time. A batch is represented in memory as an Arrow RecordBatch. Batching reduces per event overhead, improves throughput, and makes compression efficient. This is the unit that the system validates, writes to WAL, converts to Arrow, and forwards to storage.

What to remember: think in batches, not rows. Your producers should group events so the gateway can pass compact batches into ingest.

Dual WALs

OpenWit uses two WAL layers with different purposes. The short-term WAL guarantees immediate durability and safe acknowledgments to senders. The long-term WAL aggregates by time, typically daily, which supports recovery, indexing and retention work. Together they protect incoming data and make later background processing reliable.

Actor-based Concurrency

Each subsystem runs as an independent actor that communicates over async message passing. This model gives high concurrency without blocking and isolates work units so components stay responsive under load. In practice you see actors in ingest, storage and indexing.

Zero-copy Data Movement

Data moves across components by reference using Arrow buffers. Passing pointers instead of copying payloads avoids extra memory churn between the network and compute layers, which preserves throughput as batches travel from ingest to storage and onward.

Object Storage Tiering

Data flows down a hot → warm → cold path. RAM holds in-flight columnar data, local disk holds hot working files, and the object store is the durable cold tier. This layout keeps costs predictable and still supports fast reads for active data.

Unified Metadata

Postgres is the single source of truth for batch, file and index metadata. The catalog keeps the cluster consistent so search can prune correctly and operators can trace any result back to the exact files.

Core metadata tables

Table	Purpose
batches	Tracks ingested batches such as size, time range, WAL status
files	Tracks Parquet files such as location, timestamps, upload status
indexes	Tracks index files such as type, file path, time range

Extensible Indexing

You can enable multiple index types per schema or field. Bitmap indexes help equality filters on categorical fields. Bloom or loom filters help membership tests. Zonemaps help numeric and time range pruning. Tantivy adds full-text search over messages. Index files live in object storage and are linked to their Parquet files in metadata.

Config-driven Architecture

One YAML file controls the system. You configure batching, deduplication windows, WAL thresholds, Parquet file sizing, indexing mode and cache tiers in openwit-unified-control.yaml. The same image can start in monolith or distributed mode by reading this config.

Communication Layers

OpenWit uses three channels with clear responsibilities:

Gossip (Chitchat) for low frequency cluster metadata and node health
gRPC for control commands, coordination and small RPC messages
Arrow Flight for high throughput transfer of Arrow columnar batches

Choosing the right lane for the right payload keeps coordination light and data movement fast.

Data Representation at Each Stage

Different stages use formats that fit the job. Entry adapters accept JSON or Protobuf, ingest receives a bulk proto over gRPC, the storage path carries Arrow batches, and durable files are Parquet. Search returns an Arrow RecordBatch as the result.

Stage → Format

Stage	Format
Entry → Gateway	JSON or Protobuf
Gateway → Ingest	gRPC bulk proto
Ingest → Storage	Arrow RecordBatch
Storage → Object Store	Parquet
Indexes → Object Store	Index files such as bitmap, tantivy, bloom
Search → Proxy	Arrow RecordBatch results

How the Ideas Connect in the Pipeline

The ingestion gateway authenticates, validates schema, normalizes fields and batches data. The ingest node deduplicates, writes short-term then long-term WAL, converts to Arrow and forwards batches over Arrow Flight. The storage node builds active then stable Parquet and uploads with OpenDAL. The indexer builds file-level indexes and records them in metadata. The search node looks up time ranges in Postgres, fetches required indexes, prunes Parquet, then executes SQL or text queries and returns Arrow results. These steps apply the concepts above in a single flow.