Architecture Overview

OpenWit accepts telemetry through three ingress methods: Kafka, gRPC, and HTTP. All three feed the same gateway, which authenticates the caller, validates and normalizes payloads, batches records by size or time, then forwards bulk batches to the ingest node over gRPC. The ingest node deduplicates within a window, writes a short-term and a long-term WAL for durability, converts accepted batches to Arrow RecordBatch, and hands them to storage over Arrow Flight. This section explains how each method fits that shared path, how to configure it, and what to watch in production.

The common ingestion path

  1. Client sends data through Kafka, gRPC or HTTP.
  2. The gateway authenticates, enforces quotas, validates schema, normalizes field names and types, then buffers into batches until batch_size or batch_timeout_ms triggers.
  3. The gateway sends a bulk proto to the ingest node over gRPC.
  4. The ingest node deduplicates in a configured window, checks buffer ceilings, and writes both short-term and long-term WAL.
  5. The client is acknowledged, or the Kafka offset is committed, only after the short-term WAL write succeeds.
  6. The batch is converted to Arrow and sent to storage over Arrow Flight. Keep batch_size × avg_event_size safely below your gRPC message size limit with headroom.

Gateway knobs: ingestion.sources.* to enable sources, ingestion.batching.batch_size, ingestion.batching.batch_timeout_ms, plus proxy caps like proxy.http.max_payload_size_mb and proxy.grpc.max_concurrent_streams.

Buffering and dedup: processing.buffer.max_size_messages, processing.buffer.max_size_bytes, processing.buffer.flush_interval_seconds, processing.flush_size_messages, processing.dedup_enabled, processing.dedup_window_seconds. Watch buffer queue depth and memory to catch congestion early.

Durability: short-term WAL before ack, plus a long-term WAL stream for replay and indexing. processing.wal.sync_on_write trades strict durability for throughput.

When to choose which method

ScenarioRecommended methodWhy it fits
Large and continuous telemetry streamsKafkaProduction-grade throughput with consumer and fetch tuning.
Direct app to OpenWit using OTLP or a custom protogRPCAccepts OpenTelemetry collectors and custom exporters with clear size and concurrency controls.
Local tests and pipeline canariesHTTPLightweight JSON endpoint to validate the downstream path. Not recommended for sustained high throughput.

Kafka ingestion

Purpose and fit: Kafka is the preferred entry for sustained high-volume streams. OpenWit runs librdkafka consumers that read protobuf messages from topics, then pass them to the shared gateway.

Core configuration: Enable Kafka under ingestion.sources.kafka. Then tune the Kafka block: brokers, group id, topic patterns, and the consumer profile. High-performance settings expose parallel consumers, processing threads, parallel gRPC clients, and per-thread channel buffer size.

Advanced fetch sizes increase throughput but raise per-thread memory demand. Memory optimizations such as zero copy, preallocation, and a buffer pool trade memory for speed. Batch size and timeout determine how many events are combined before forwarding. Pick the telemetry type to select parsers and index strategies.

KeyWhat it controls
brokers, group_id, topicsSource brokers and subscription patterns.
consumer_typeStandard or high-performance path.
high_performance.num_consumers, processing_threads, num_grpc_clients, channel_buffer_sizeParallelism and inter-thread capacity.
fetch_min_bytes, fetch_max_bytes, max_partition_fetch_bytesThroughput versus memory tradeoff.
enable_zero_copy, preallocate_buffers, buffer_pool_sizeLower CPU copies at the cost of a higher memory footprint.
batch_size, batch_timeout_msFlush triggers at the gateway and consumer layer. Start conservative, then ramp.
telemetry_typeHints for parser and index strategy.

Operational guardrails: Batch bytes must fit under the gRPC message cap and inside buffer ceilings. Adjust batch_size with event size and keep a margin for Arrow metadata. The gateway can pause Kafka consumption if backpressure is raised. Offsets are committed only after the short-term WAL is persisted.

Example values from a working config: Eight consumers with sixteen processing threads and four gRPC clients per process. Fetch min 1 MiB, fetch max 50 MiB, per partition 10 MiB. Batch size 1000 with 5s timeout to stay under 4 MiB gRPC limits. Zero copy and preallocation enabled.

gRPC ingestion

Purpose and fit: The gRPC entry accepts OpenTelemetry collectors and custom exporters. It is a direct path from applications with explicit controls for concurrency, size limits, connection health, compression, and TLS.

Core configuration: Enable gRPC under ingestion.sources.grpc. Bind address, port, and runtime thread count control server capacity. Concurrency and pooling limits cap total requests and connections.

Message size limits must be set with care. Keepalive and connection age options avoid stuck connections. You can toggle OTLP signals for traces, metrics, or logs. Compression reduces bandwidth at CPU cost. TLS should be on in production.

KeyWhat it controls
bind, port, runtime_sizeNetwork binding and runtime threads.
max_concurrent_requests, connection_pool_size, worker_threads, max_concurrent_streamsTotal concurrency across requests and streams.
max_message_size, max_receive_message_size_mb, max_send_message_size_mbCaps on message size. High values allow large batches but raise memory risk.
keepalive_*, max_connection_*Connection liveness and lifecycle.
otlp_*_enabledWhich OTLP signals are accepted.
compression_*Network bandwidth versus CPU tradeoff.
tls.enabled, cert_path, key_pathTransport security. Off by default in the sample configuration which is a risk in production.

Operational guardrails: Batch bytes must stay below the gRPC message caps. Leave space for Arrow metadata that is added before Arrow Flight. If overload occurs, the gateway returns 429 or pauses upstream reads.

HTTP ingestion

Purpose and fit: The HTTP endpoint accepts JSON payloads with an OTLP-style schema for logs, metrics or traces. It is intended for canaries and pipeline validation rather than continuous high load.

Core configuration: Enable under ingestion.sources.http. Set port, max_concurrent_requests, request_timeout_ms, and max_payload_size_mb. Upstream pool knobs control idle connections and timeouts. Use conservative payload and concurrency limits unless you have headroom on memory and CPU.

Operational guardrails: High concurrency and large payload caps can quickly push gRPC and downstream limits. Keep payloads small for tests and prefer Kafka or gRPC for production throughput.

What backpressure and acks mean for all methods

  • The gateway applies backpressure when buffers or downstream components reach limits. With Kafka, it can pause consumption. With gRPC or HTTP, it returns 429
  • The ingest node acknowledges only after the short-term WAL is written. Long term WAL is published in parallel for replay and indexing.

Observability signals to emit

Emit counters, gauges, and spans at each lifecycle stage to see capacity and bottlenecks:

  • event.ingest.received and event.ingest.validated at the gateway
  • buffer.size and buffer_full to track pressure
  • batch.created plus batch.size_bytes histogram
  • wal.write.latency and ack.latency
  • arrow.convert.latency and arrow.send.latency
  • storage.append.latency, parquet.upload.latency, index.build.latency and index.lag_seconds Trace spans should include ingest.receive, batch.build, wal.write, arrow.convert, arrow.send, storage.append, upload, index.build.