User Guide - Overview

This page gives you a clear mental model of Openwit. You will see what it is, when to use it, how it is organized, and how data moves through it from ingestion to query. The goal is to keep language simple and the description complete so you can map features to your own environment.

What Openwit is

Openwit is a Rust-native distributed OLAP database for telemetry data such as logs, traces and metrics. It combines streaming ingestion, columnar storage, search indexing, and SQL analytics in one system that scales out across nodes or runs as a compact single-node monolith.

When it fits best

  • Data arrives continuously from many sources
  • Queries need sub-second latency on large time-ordered datasets
  • Durability, schema consistency and cost-efficient retention matter

How Openwit is Organized

Openwit is built from small focused roles that you can run together on one machine or scale independently in a cluster.

  • Control coordinates node health, routing and discovery
  • Ingest receives batches, deduplicates, writes WAL, converts to Arrow and forwards
  • Gateways accept data over Kafka, gRPC and HTTP
  • Storage writes Parquet and uploads to object storage using OpenDAL
  • Indexer builds bitmap, bloom or loom, zonemap and Tantivy indexes
  • Search executes SQL with DataFusion and uses indexes for pruning
  • Cache holds hot data in RAM and on disk for fast reads
  • Proxy fronts client access and routes queries
  • Janitor cleans expired data and reconciles metadata

All roles communicate over gossip for lightweight cluster state, gRPC for control, and Arrow Flight for high-throughput columnar data movement.

Concepts that Shape the System

  • Batch-oriented ingestion: data enters and moves as Arrow RecordBatches for throughput and compression.
  • Dual WALs: short-term WAL for immediate durability and acknowledgments, long-term WAL aggregated by time for restore and reindex.
  • Actor-based concurrency: each subsystem runs as an actor that communicates over async channels.
  • Zero-copy data movement: components pass Arrow buffers by reference to avoid unnecessary copying.
  • Unified metadata: Postgres tracks batches, files and indexes for consistent state.
  • Object storage tiering: RAM and local disk serve as hot layers, cloud object store holds durable cold data.
  • Config-driven behavior: one YAML controls batching, dedup, WAL thresholds, Parquet size, indexing mode and cache tiers.

Data Flow at a High Level

  1. Ingest: Producers send data through Kafka, gRPC or HTTP. The gateway authenticates, validates schema, normalizes fields, and batches events to configured thresholds.
  2. Write and deduplicate: Ingest deduplicates within a time window. It writes the batch to short-term WAL, records metadata in Postgres, and aggregates to long-term WAL.
  3. Convert and transfer: Ingest converts the batch to Arrow and streams it to Storage over Arrow Flight.
  4. Materialize and upload: Storage appends to an active Parquet file and rolls to a stable file at target size. It uploads the stable file to object storage and records the file path, size and time range in Postgres.
  5. Index: Indexer builds bitmap, bloom or loom, zonemap and Tantivy indexes for that file, uploads them, and links them in metadata.
  6. Query: Proxy receives a query and routes it to Search. Search consults metadata and indexes, prunes file lists, reads only required Parquet from cache or cloud, runs the plan with DataFusion, and returns Arrow results.
  7. Cache and retain: Cache keeps hot data in RAM and on disk for speed. Janitor enforces TTL and reconciles catalog entries with the object store.