Architecture Overview
OpenWit accepts telemetry through three ingress methods: Kafka, gRPC, and HTTP. All three feed the same gateway, which authenticates the caller, validates and normalizes payloads, batches records by size or time, then forwards bulk batches to the ingest node over gRPC. The ingest node deduplicates within a window, writes a short-term and a long-term WAL for durability, converts accepted batches to Arrow RecordBatch, and hands them to storage over Arrow Flight. This section explains how each method fits that shared path, how to configure it, and what to watch in production.
The common ingestion path
- Client sends data through Kafka, gRPC or HTTP.
- The gateway authenticates, enforces quotas, validates schema, normalizes field names and types, then buffers into batches until batch_size or batch_timeout_ms triggers.
- The gateway sends a bulk proto to the ingest node over gRPC.
- The ingest node deduplicates in a configured window, checks buffer ceilings, and writes both short-term and long-term WAL.
- The client is acknowledged, or the Kafka offset is committed, only after the short-term WAL write succeeds.
- The batch is converted to Arrow and sent to storage over Arrow Flight. Keep batch_size × avg_event_size safely below your gRPC message size limit with headroom.
Gateway knobs: ingestion.sources.* to enable sources, ingestion.batching.batch_size, ingestion.batching.batch_timeout_ms, plus proxy caps like proxy.http.max_payload_size_mb and proxy.grpc.max_concurrent_streams.
Buffering and dedup: processing.buffer.max_size_messages, processing.buffer.max_size_bytes, processing.buffer.flush_interval_seconds, processing.flush_size_messages, processing.dedup_enabled, processing.dedup_window_seconds. Watch buffer queue depth and memory to catch congestion early.
Durability: short-term WAL before ack, plus a long-term WAL stream for replay and indexing. processing.wal.sync_on_write trades strict durability for throughput.
When to choose which method
| Scenario | Recommended method | Why it fits |
|---|---|---|
| Large and continuous telemetry streams | Kafka | Production-grade throughput with consumer and fetch tuning. |
| Direct app to OpenWit using OTLP or a custom proto | gRPC | Accepts OpenTelemetry collectors and custom exporters with clear size and concurrency controls. |
| Local tests and pipeline canaries | HTTP | Lightweight JSON endpoint to validate the downstream path. Not recommended for sustained high throughput. |
Kafka ingestion
Purpose and fit: Kafka is the preferred entry for sustained high-volume streams. OpenWit runs librdkafka consumers that read protobuf messages from topics, then pass them to the shared gateway.
Core configuration: Enable Kafka under ingestion.sources.kafka. Then tune the Kafka block: brokers, group id, topic patterns, and the consumer profile. High-performance settings expose parallel consumers, processing threads, parallel gRPC clients, and per-thread channel buffer size.
Advanced fetch sizes increase throughput but raise per-thread memory demand. Memory optimizations such as zero copy, preallocation, and a buffer pool trade memory for speed. Batch size and timeout determine how many events are combined before forwarding. Pick the telemetry type to select parsers and index strategies.
| Key | What it controls |
|---|---|
brokers, group_id, topics | Source brokers and subscription patterns. |
consumer_type | Standard or high-performance path. |
high_performance.num_consumers, processing_threads, num_grpc_clients, channel_buffer_size | Parallelism and inter-thread capacity. |
fetch_min_bytes, fetch_max_bytes, max_partition_fetch_bytes | Throughput versus memory tradeoff. |
enable_zero_copy, preallocate_buffers, buffer_pool_size | Lower CPU copies at the cost of a higher memory footprint. |
batch_size, batch_timeout_ms | Flush triggers at the gateway and consumer layer. Start conservative, then ramp. |
telemetry_type | Hints for parser and index strategy. |
Operational guardrails: Batch bytes must fit under the gRPC message cap and inside buffer ceilings. Adjust batch_size with event size and keep a margin for Arrow metadata. The gateway can pause Kafka consumption if backpressure is raised. Offsets are committed only after the short-term WAL is persisted.
Example values from a working config: Eight consumers with sixteen processing threads and four gRPC clients per process. Fetch min 1 MiB, fetch max 50 MiB, per partition 10 MiB. Batch size 1000 with 5s timeout to stay under 4 MiB gRPC limits. Zero copy and preallocation enabled.
gRPC ingestion
Purpose and fit: The gRPC entry accepts OpenTelemetry collectors and custom exporters. It is a direct path from applications with explicit controls for concurrency, size limits, connection health, compression, and TLS.
Core configuration: Enable gRPC under ingestion.sources.grpc. Bind address, port, and runtime thread count control server capacity. Concurrency and pooling limits cap total requests and connections.
Message size limits must be set with care. Keepalive and connection age options avoid stuck connections. You can toggle OTLP signals for traces, metrics, or logs. Compression reduces bandwidth at CPU cost. TLS should be on in production.
| Key | What it controls |
|---|---|
bind, port, runtime_size | Network binding and runtime threads. |
max_concurrent_requests, connection_pool_size, worker_threads, max_concurrent_streams | Total concurrency across requests and streams. |
max_message_size, max_receive_message_size_mb, max_send_message_size_mb | Caps on message size. High values allow large batches but raise memory risk. |
keepalive_*, max_connection_* | Connection liveness and lifecycle. |
otlp_*_enabled | Which OTLP signals are accepted. |
compression_* | Network bandwidth versus CPU tradeoff. |
tls.enabled, cert_path, key_path | Transport security. Off by default in the sample configuration which is a risk in production. |
Operational guardrails: Batch bytes must stay below the gRPC message caps. Leave space for Arrow metadata that is added before Arrow Flight. If overload occurs, the gateway returns 429 or pauses upstream reads.
HTTP ingestion
Purpose and fit: The HTTP endpoint accepts JSON payloads with an OTLP-style schema for logs, metrics or traces. It is intended for canaries and pipeline validation rather than continuous high load.
Core configuration: Enable under ingestion.sources.http. Set port, max_concurrent_requests, request_timeout_ms, and max_payload_size_mb. Upstream pool knobs control idle connections and timeouts. Use conservative payload and concurrency limits unless you have headroom on memory and CPU.
Operational guardrails: High concurrency and large payload caps can quickly push gRPC and downstream limits. Keep payloads small for tests and prefer Kafka or gRPC for production throughput.
What backpressure and acks mean for all methods
- The gateway applies backpressure when buffers or downstream components reach limits. With Kafka, it can pause consumption. With gRPC or HTTP, it returns 429
- The ingest node acknowledges only after the short-term WAL is written. Long term WAL is published in parallel for replay and indexing.
Observability signals to emit
Emit counters, gauges, and spans at each lifecycle stage to see capacity and bottlenecks:
event.ingest.receivedandevent.ingest.validatedat the gatewaybuffer.sizeandbuffer_fullto track pressurebatch.createdplusbatch.size_byteshistogramwal.write.latencyandack.latencyarrow.convert.latencyandarrow.send.latencystorage.append.latency,parquet.upload.latency,index.build.latencyandindex.lag_secondsTrace spans should includeingest.receive,batch.build,wal.write,arrow.convert,arrow.send,storage.append,upload,index.build.