High Availability & Scalability

This page explains how OpenWit stays available during failures and how it scales in production. The design favors self healing and stateless scaling so nodes can be added or replaced without complex coordination.

1. Availability goals at a glance

OpenWit’s availability story combines fast failure detection, durable write paths and consistent decisions at the control plane. The platform declares itself self healing and stateless at scale which means failed workers can be replaced and traffic can be routed around them.

HA mechanisms and purposes

Mechanism	Purpose
Gossip protocol (Chitchat)	Detects node liveness and shares cluster metadata
Postgres replication / HA	Keeps metadata redundant and consistent
Short-term WAL	Guarantees no data loss on an ingest crash
Object storage durability	Preserves Parquet and index files independent of compute
Control node quorum	Keeps cluster-wide decisions consistent
Retry and backoff	Retries gRPC and Arrow Flight calls on transient failures

Failure detection and cluster state

Gossip protocol: Nodes exchange small heartbeats and metadata through Gossip so the cluster can tell who is alive and route work to healthy peers. Gossip is used for low frequency cluster metadata and node health and forms the base signal for self healing.

Control node quorum: Control relies on quorum to keep decisions consistent when nodes fail or partitions occur. The document calls out quorum to ensure cluster-wide decisions remain consistent which protects routing and orchestration during turbulence.

Durable write path and recovery

Short-term WAL: Every incoming batch is written to the short-term WAL before the system acknowledges success. This guarantees that an ingest crash does not lose data since the WAL can be replayed.

Object storage durability: After Parquet and index files reach stable state they are uploaded to the object store. Durability here is independent of compute nodes which allows replacement or scaling without risking stored data.

Postgres replication / HA: The metadata catalog uses Postgres. Replication or HA keeps file and index records redundant and consistent so the query path can always discover which objects to read even when a database node fails.

Retry and backoff: All gRPC and Arrow Flight calls retry with exponential backoff on transient failures. This masks brief network issues and helps the system continue without operator intervention.

Scalability Model

OpenWit scales horizontally per role. Each node type has a clear scaling trigger so you can add capacity where it is needed without overprovisioning other paths.

Role-based scaling strategies

| Node type | Scale when | Notes | | ------ | ----------- | | Ingest | Ingestion rate increases or you add Kafka partitions or gRPC producers | Add ingest workers to match producer fan-out | | Storage | WAL write load grows or Parquet uploads lag | Add storage workers to keep active files rolling to stable on time | | Indexer | Indexing queue depth rises | Add indexers so index build and upload keep pace with storage | | Search | Query concurrency or latency targets slip | Add search workers to raise parallelism | | Cache | Cache hit ratio drops | Add cache nodes to keep hot data near the query path |

Control-aware routing and traffic shaping

Scaling is reinforced by control-plane features that keep traffic balanced and shard work predictably. The document lists four mechanisms that you can enable as you grow.

Control-plane aware load balancing so routing decisions reflect real health and capacity.
Weighted round-robin in the Proxy so heavier nodes receive more queries and lighter nodes receive fewer.
Shard-aware partitioning by time so data and queries land on the right workers with minimal overlap.
Distributed query execution with Ballista so SQL plans fan out across multiple workers when needed.

How HA and scale play together along the path

Ingestion is protected by short-term WAL and can be scaled by adding ingest nodes that read more Kafka partitions or accept more gRPC producers. Gossip and control quorum notice new capacity and route accordingly.
Storage remains available by keeping stable files in object storage and metadata in Postgres HA while you add storage nodes to remove upload lag.
Indexing stays current by scaling indexers with queue depth. Index files land in the same durable store so failure or replacement does not risk searchability.
Search holds its SLOs by adding workers when concurrency rises. Weighted round-robin in the Proxy and time-based sharding help keep load balanced.