Data Model

This page explains how OpenWit represents data from the moment it enters the system to the moment a query reads it. The model is columnar and batch-oriented. Instead of appending single rows, OpenWit accepts data in batches that map cleanly to Arrow in memory and Parquet on storage. This keeps I/O predictable and makes vectorized execution possible during queries.

Core Unit: the Batch

A batch is the atomic unit that OpenWit validates, persists, converts and indexes. Each batch is a collection of rows that share one dataset schema. The batch carries enough metadata to describe what it contains and where it belongs in time. At minimum this includes the schema ID and version, a start and end timestamp that bound the rows, the row count and byte size, and source metadata such as ingestion type, offset and tenant ID.

When a batch arrives it follows a consistent lifecycle. First it is written to short-term WAL for immediate durability, then aggregated into long-term WAL to support recovery and heavy background tasks. The ingested payload is converted to Arrow RecordBatches. Arrow batches are appended into an active Parquet file and on threshold the file is rolled and finalized as stable Parquet. Stable files are uploaded to object storage. Finally, index artifacts are generated and the batch, file and index records are written to Postgres to keep lineage complete.

Dataset Schema

Every dataset has a fixed schema, either inferred or explicitly defined. Schemas provide consistent types so queries are stable and predictable. If data with mismatched types arrives the gateway can coerce values when allowed, or it rejects the input to preserve schema integrity. This protects columnar layout quality and keeps query plans simple.

Example schema for a logs dataset:

OpenWit enforces this schema across all ingestion sources. This means Kafka, gRPC and HTTP inputs must shape their payloads to the same field names and types so the system remains consistent end to end.

Metadata Model

Postgres is the source of truth for batch, file and index metadata. The catalog links ingestion to storage to indexing so a query can always discover which files to read and operators can trace any result back to its origin. Three core tables cover the model.

Core tables

This foreign key design provides full traceability from ingestion through storage and indexing to the query layer, which is required for pruning and for operational reconciliation.

Storage Locations

OpenWit separates logical storage into layers. Each layer has a purpose, a default location for local runs and a backing medium. This layout is configurable through YAML and makes durability and cost easy to reason about.

Data flows from in-memory to local disk to object storage so hot data stays fast and cold data stays durable and cost efficient.

Indexing Model

When a stable Parquet file is ready OpenWit creates one or more index artifacts for that file. Index types cover different query patterns. Bitmap helps equality filters on categorical fields. Zonemap helps numeric and time range pruning. Bloom or Loom filters help membership checks. Tantivy enables full-text search over messages. Index files are uploaded to object storage and linked in metadata to the corresponding Parquet file and batch ID. These links let the search node download only relevant index files for pruning.

Index types and use

Query Execution at a Glance

Queries run through two engines that work together. SQL plans execute with DataFusion while full-text queries execute with Tantivy. The search node begins by consulting the metadata catalog in Postgres to find candidate files. It prunes by time range, fetches the index files it needs, uses those indexes to skip irrelevant Parquet and reads only what is required. The final plan is executed and the result is returned as an Arrow columnar payload to the proxy or client. This gives OLAP-speed analytics with search-like flexibility in one system.

Execution steps:

Look up candidate files in Postgres using the requested time range.
Fetch the required index files for those candidates.
Use indexes to skip files and row groups that do not match.
Execute the query plan in DataFusion.
Return Arrow results to the client.