Storage Location

This page explains where OpenWit stores data at each stage and why the layers are separated. The goal is to make durability, performance and cost easy to reason about. OpenWit separates logical storage into distinct layers for metadata, WAL, Parquet and index files. All of these are configurable through YAML.

Storage Layers at a Glance

The table below lists each layer, what it is used for, the local default path for developer runs, and the underlying medium.

How Data Moves Across the Layers

  1. Short-term WAL: When the ingest node receives a batch it writes it to the short-term WAL to get immediate durability. This write is part of the normal ingest flow and is the first durable hop on local disk. The short-term WAL is the hot layer with the ./data/wal/short default for local runs.
  2. Long-term WAL: Batches are then aggregated into the long-term WAL. This is a warm layer that organizes WAL entries on a daily basis to preserve durability at a coarser granularity. Its default path is ./data/wal/long.
  3. Active Parquet: Arrow RecordBatches from ingest are merged into an active Parquet file on the storage node. This file acts like a rolling buffer while data is still coming in. The default path is ./data/storage/active.
  4. Stable Parquet: When the active file reaches the target size it is finalized as stable Parquet. This is the handoff point to durable cloud storage. The default local path is ./data/storage/stable.
  5. Object Store: Stable Parquet files are uploaded to object storage using OpenDAL. Providers include S3, Azure and GCS. In documents and examples this appears as paths like s3://bucket/path.
  6. Indexes and catalog updates: After upload the indexer builds index files for the Parquet data and uploads those files to object storage as well. The system writes file and index records to the metadata catalog.
  7. Metadata Store: Postgres is the source of truth for file and index records along with time ranges and versions. The catalog allows search to find the correct files and enables complete lineage from ingestion to query.

What Each Layer is Responsible for

Short-term WAL

This is the first durable write for a batch and exists to provide immediate durability at ingest time. It is the hot path and uses local disk by default. The role is to secure the batch before any downstream processing proceeds.

Long-term WAL

This layer aggregates WAL data at a daily cadence to keep a warm copy that is still local but organized for background operations. It complements the short-term WAL and continues the durability story beyond the first write.

Active Parquet

The storage node merges incoming Arrow batches into an active Parquet file. This file is hot and changes as new batches arrive. It serves as the real-time buffer on the way to a finalized Parquet artifact.

Stable Parquet

When the active file reaches the configured target size it is finalized as a stable Parquet file. Stable files are ready for durable upload and for index builds. They live under the local ./data/storage/stable prefix until uploaded.

Object Store

Stable Parquet files are uploaded via OpenDAL to your cloud object store. Supported providers noted in the docs are S3, Azure and GCS. Object storage is the cold tier and is the durable home for Parquet and index artifacts.

Metadata Store

The metadata catalog in Postgres records every Parquet file and every index artifact, along with time ranges and versions. This catalog is the single source of truth that search consults to prune by time and to resolve the exact files to read.

Cache Tiers and Where They Live

Cache is a separate concern that improves read performance by keeping hot data close to the query path.

  • Electro (RAM) for the hottest data with nanosecond access.
  • Pyro (SSD) for warm data with microsecond access.
  • Cryo (Object store) for cold data with second-level access.

These tiers are configurable and map to RAM, local SSD and the object store respectively.