Key Concepts

This page builds a shared vocabulary for OpenWit. It explains the ideas that show up across the system so you can map features to how data is ingested, stored and queried.

Core Components at a Glance

Each node type plays a focused role in the cluster and exposes clear interfaces. This table summarizes the function and the primary interfaces used by each role.

ComponentFunctionKey Interfaces
Control NodeTracks ingested batches such as size, time range, WAL statusCluster brain: orchestrates health, routing and discoverygRPC control plane, Gossip
Ingest NodeReceives data, deduplicates, writes WAL, forwards Arrow batchesgRPC bulk ingest, Arrow Flight
Kafka/gRPC/HTTP GatewaysEntry adapters for producersHTTP or OTLP APIs
Storage NodeConverts Arrow batches to Parquet, uploads via OpenDALArrow Flight, gRPC, Object store
IndexerBuilds bitmap, bloom, zonemap, Tantivy indexes, uploads to object storegRPC, Object store
Proxy NodeClient facing router that sends queries to warm nodesHTTP or gRPC
Cache NodeHolds hot or warm data in RAM, disk or object tiersArrow Flight, gRPC
Search NodeExecutes DataFusion and Tantivy queriesgRPC, Arrow Flight

Batch-oriented Ingestion

OpenWit accepts data as structured batches rather than one row at a time. A batch is represented in memory as an Arrow RecordBatch. Batching reduces per event overhead, improves throughput, and makes compression efficient. This is the unit that the system validates, writes to WAL, converts to Arrow, and forwards to storage.

What to remember: think in batches, not rows. Your producers should group events so the gateway can pass compact batches into ingest.

Dual WALs

OpenWit uses two WAL layers with different purposes. The short-term WAL guarantees immediate durability and safe acknowledgments to senders. The long-term WAL aggregates by time, typically daily, which supports recovery, indexing and retention work. Together they protect incoming data and make later background processing reliable.

Actor-based Concurrency

Each subsystem runs as an independent actor that communicates over async message passing. This model gives high concurrency without blocking and isolates work units so components stay responsive under load. In practice you see actors in ingest, storage and indexing.

Zero-copy Data Movement

Data moves across components by reference using Arrow buffers. Passing pointers instead of copying payloads avoids extra memory churn between the network and compute layers, which preserves throughput as batches travel from ingest to storage and onward.

Object Storage Tiering

Data flows down a hot → warm → cold path. RAM holds in-flight columnar data, local disk holds hot working files, and the object store is the durable cold tier. This layout keeps costs predictable and still supports fast reads for active data.

Unified Metadata

Postgres is the single source of truth for batch, file and index metadata. The catalog keeps the cluster consistent so search can prune correctly and operators can trace any result back to the exact files.

Core metadata tables

TablePurpose
batchesTracks ingested batches such as size, time range, WAL status
filesTracks Parquet files such as location, timestamps, upload status
indexesTracks index files such as type, file path, time range

Extensible Indexing

You can enable multiple index types per schema or field. Bitmap indexes help equality filters on categorical fields. Bloom or loom filters help membership tests. Zonemaps help numeric and time range pruning. Tantivy adds full-text search over messages. Index files live in object storage and are linked to their Parquet files in metadata.

Config-driven Architecture

One YAML file controls the system. You configure batching, deduplication windows, WAL thresholds, Parquet file sizing, indexing mode and cache tiers in openwit-unified-control.yaml. The same image can start in monolith or distributed mode by reading this config.

Communication Layers

OpenWit uses three channels with clear responsibilities:

  • Gossip (Chitchat) for low frequency cluster metadata and node health
  • gRPC for control commands, coordination and small RPC messages
  • Arrow Flight for high throughput transfer of Arrow columnar batches

Choosing the right lane for the right payload keeps coordination light and data movement fast.

Data Representation at Each Stage

Different stages use formats that fit the job. Entry adapters accept JSON or Protobuf, ingest receives a bulk proto over gRPC, the storage path carries Arrow batches, and durable files are Parquet. Search returns an Arrow RecordBatch as the result.

Stage → Format

StageFormat
Entry → GatewayJSON or Protobuf
Gateway → IngestgRPC bulk proto
Ingest → StorageArrow RecordBatch
Storage → Object StoreParquet
Indexes → Object StoreIndex files such as bitmap, tantivy, bloom
Search → ProxyArrow RecordBatch results

How the Ideas Connect in the Pipeline

The ingestion gateway authenticates, validates schema, normalizes fields and batches data. The ingest node deduplicates, writes short-term then long-term WAL, converts to Arrow and forwards batches over Arrow Flight. The storage node builds active then stable Parquet and uploads with OpenDAL. The indexer builds file-level indexes and records them in metadata. The search node looks up time ranges in Postgres, fetches required indexes, prunes Parquet, then executes SQL or text queries and returns Arrow results. These steps apply the concepts above in a single flow.