Executive Summary

The ingestion pipeline is the system’s write path for logs, traces, and metrics. It accepts data at a high rate from multiple frontends, validates and normalizes records, batches them for efficient I/O, deduplicates within a time window, persists to dual WALs for durability, and then hands off accepted batches to storage in Arrow format. Flow control and backpressure keep the cluster stable under bursty load.

Entry points

  • OpenWit supports three ingress options so teams can plug in with what they already run:
  • Kafka via librdkafka consumers for production-scale streams
  • gRPC for OTLP or custom exporters
  • HTTP for lightweight testing or pipeline validation All three feed a common gateway before anything enters the main system.

Ingestion gateway

Every request first passes through a gateway that enforces write access, checks the schema, and makes payloads consistent. It validates header tokens, rejects malformed or inconsistent types, normalizes key names and ordering, then batches events by size or time. Batch sizing is controlled in the YAML with ingestion.batching.batch_size and ingestion.batching.batch_timeout_ms. Batches are kept as JSON or Arrow buffers until they are serialized into gRPC messages.

Ingest node

The gateway sends ready batches to the ingest node over gRPC in bulk. The ingest node is built to keep throughput high and latency low. It uses zero-copy handoffs to avoid duplicate buffers, runs deterministic deduplication within a configured window, and writes to two WALs:

  • Short-term WAL written before client ACK or Kafka commit to guarantee immediate durability
  • Long-term WAL that aggregates by day for restore and indexing

A batch is acknowledged only after short-term WAL persistence. When the downstream slows, the node applies backpressure to protect the cluster.

Handoff to storage

After the WAL step, the ingest node converts each batch into an Arrow RecordBatch and transmits it to a storage endpoint using Arrow Flight. Arrow Flight gives columnar, high-throughput transport with direct buffer reuse, so inter-node transfer stays efficient.

Fault tolerance and coordination

If a node fails mid-transfer, the WALs allow replay of any in-flight batches. The control plane watches ingest health over gossip and can reroute producers when a node is unresponsive. Batch IDs and WAL offsets are tracked in Postgres as the system’s metadata truth.

Operating characteristics

The pipeline is engineered for very high message rates with bounded memory and fast recovery. It relies on zero-copy handoffs, lock-free queues, per-thread batching, and the dual WAL writer to meet durability goals without sacrificing throughput. Backpressure engages when limits are reached, so the system remains predictable under load.

Observability

Ingestion exposes traces and metrics so you can see each stage doing its work. Typical spans cover plan and transfer steps, and standard counters include latency, backlog, bytes processed, rows accepted, and dedup hits. This gives clear signals for tuning batch size, WAL settings, and buffer limits.