Configuration
This page explains every section of the OpenWit configuration and how the fields work together. For better understanding, we have broken down the entire configuration file in small parts with required explanation.
${POSTGRES_DSN} or ${TLS_CERT_PATH} are placeholders. Use env vars or a secrets manager. Do not hardcode secrets.
1. Deployment and cluster basics
environment: "production"
deployment:
mode: "local"
discovery_mode: "gossip"
mandatory_nodes:
required_types:
- "control"
- "proxy"
- "ingestor"
- "storage"
- "indexer"
- "search"
- "janitor"
minimum_counts:
control: 1
proxy: 2
ingestor: 2
storage: 3
indexer: 2
search: 2
janitor: 1Explanation
environmentsets the runtime posture. Useproductionfor hard defaults. Usedevelopmentorlocalfor monolith runs.deployment.modechooses single process or a distributed cluster. local or monolith is for dev.distributedis for production where gossip and the control plane are active. Do not uselocalin production because you lose HA.discovery_modetells nodes how to find peers.gossipis decentralized,kubernetesuses DNS,staticis explicit endpoints. On k8s choosekubernetes. On VMs choose gossip.mandatory_nodes.required_typeslists roles that must exist.minimum_countsstates how many for basic health. In production, keepcontrol ≥ 3for quorum. Scale ingest, storage, search by load. Too low risks unavailability. Too high wastes resources.
2. Control plane, health, autoscaling
control_plane:
enabled: true
grpc_endpoint: "http://localhost:7019"
leader_election:
enabled: true
lease_duration_seconds: 30
renew_deadline_seconds: 20
retry_period_seconds: 5
monitoring:
interval_seconds: 10
thresholds:
cpu_usage_percent: 80
memory_usage_percent: 85
queue_depth_warning: 10000
queue_depth_critical: 50000
error_rate_percent: 5
optimization:
enabled: true
decision_interval_seconds: 60
targets:
max_latency_p99_ms: 1000
min_throughput_messages_per_sec: 10000
max_memory_usage_percent: 80
max_cpu_usage_percent: 75
segment_remediation:
enabled: true
check_interval_seconds: 60
max_concurrent_actions: 5
action_timeout_seconds: 300
retry_attempts: 3
backoff_multiplier: 2.0
max_retries: 3
stuck_threshold_minutes: 10
auto_remediate: true
max_remediate_per_cycle: 100
autoscaling:
enabled: true
scale_up:
cpu_threshold_percent: 70
memory_threshold_percent: 75
queue_depth_threshold: 5000
scale_down:
cpu_threshold_percent: 30
memory_threshold_percent: 40
queue_depth_threshold: 1000
min_replicas: 1
max_replicas: 20
cooldown_seconds: 300
backpressure_advanced:
max_buffer_size: 10000
high_watermark: 0.8
low_watermark: 0.3
backoff_duration_ms: 100
max_backoff_ms: 5000
retry_config:
max_attempts: 3
base_delay_ms: 1000
max_delay_ms: 30000
exponential_base: 2.0Explanation
control_plane.enabledturns on orchestration. Nodes callgrpc_endpointfor control APIs.leader_election.*picks an active control instance. Keep the lease about 30 s so failover is not too slow or too twitchy.monitoring.interval_secondsis the poll cadence. thresholds trigger warnings based on CPU, memory, queue depth, error rate. Tune to your SLOs.optimization.*runs decisions every decision_interval_seconds and aims for the targets you set. These targets are the basis for scaling and placement.segment_remediation.*heals stuck segments with bounded concurrency, timeouts, and backoff. Keep concurrency small to avoid a thundering herd.autoscaling.*defines scale up and down triggers. Guard with min_replicas, max_replicas, and a long cooldown_seconds so you can observe the effect.backpressure_advanced.*caps internal queues and sets retry backoff. Too small buffers lead to frequent 429. Too big buffers risk memory pressure.
3. Proxy front door
proxy:
enabled: true
routing:
strategy: "round_robin"
health_check_interval_ms: 5000
max_retries: 3
retry_timeout_ms: 1000
http:
enabled: true
port: 4318
max_concurrent_requests: 10000
request_timeout_ms: 30000
max_payload_size_mb: 100
upstream_pool:
max_idle_per_host: 10
idle_timeout_seconds: 90
connection_timeout_seconds: 30
grpc:
enabled: true
port: 4317
max_concurrent_streams: 1000
channel_pool:
max_channels_per_backend: 10
keepalive_time_ms: 10000
keepalive_timeout_ms: 5000
idle_timeout_minutes: 5
kafka:
enabled: true
consumer:
group_id_prefix: "openwit-proxy"
session_timeout_ms: 30000
heartbeat_interval_ms: 3000
max_poll_records: 500
producer:
batch_size: 16384
linger_ms: 10
compression_type: "snappy"
max_in_flight_requests: 5
health_check:
enabled: true
interval_seconds: 30
timeout_seconds: 5
metadata_timeout_ms: 5000
health_reporting:
report_interval_seconds: 10
include_upstream_health: true
include_connection_stats: true
include_latency_metrics: true
weights:
cpu_weight: 0.3
memory_weight: 0.3
connection_weight: 0.4Explanation
routing.strategydecides how to spread client traffic.Weightedstrategy uses weights for CPU, memory, and connections.- HTTP has clear concurrency, timeout, and payload limits plus an upstream connection pool. gRPC has stream limits and a channel pool with keepalive. Kafka pass-through has consumer, producer, and broker health settings.
4. Ingestion: Kafka, gRPC, HTTP
ingestion:
sources:
kafka: { enabled: true }
grpc: { enabled: true }
http: { enabled: false }
kafka:
brokers: "ns1005571.ip-147-135-65.us:9092"
group_id: "ingestion-group"
topics:
- "v6.qtw.traces.*.*"
consumer_type: "standard"
high_performance:
num_consumers: 8
processing_threads: 16
num_grpc_clients: 4
channel_buffer_size: 100000
fetch_min_bytes: 1048576
fetch_max_bytes: 52428800
max_partition_fetch_bytes: 10485760
enable_zero_copy: true
preallocate_buffers: true
buffer_pool_size: 1000
topic_index_config:
default_index_position: 3
patterns:
- match: "v6.qtw.*.*.*"
index_position: 3
auto_generate:
enabled: false
prefix: "auto_"
strategy: "hash"
consumer:
session_timeout_ms: 180000
heartbeat_interval_ms: 3000
max_poll_interval_ms: 900000
max_poll_records: 50000
batching:
batch_size: 1000
batch_timeout_ms: 5000
telemetry_type: "traces"Explanation
sourcestoggles entry points. Enable only what you need to keep the surface small.Kafka basics: brokers, group_id, topics, consumer_type. You can switch to high_performance to raise parallelism and buffer sizes.high_performance.*sets consumer count, processing threads, gRPC clients, and channel buffer size. These lift throughput at higher memory cost.fetch_*increase pull size. Large values raise RAM usage. Start conservative, then ramp.enable_zero_copy,preallocate_buffers,buffer_pool_sizeare memory speedups. Watch lifetime and footprint.topic_index_configcontrols where to read the index position from topic parts. auto_generate can create names with a prefix and strategy.consumer.*sets liveness and polling limits. batching.batch_size is 1000 in the sample to avoid gRPC size limits. batch_timeout_ms forces flush on time.
grpc:
bind: "0.0.0.0"
port: 50051
runtime_size: 8
max_concurrent_requests: 10000
connection_pool_size: 100
max_message_size: 208715200
max_receive_message_size_mb: 200
max_send_message_size_mb: 100
keepalive_time_ms: 60000
keepalive_timeout_ms: 20000
max_connection_idle_seconds: 300
max_connection_age_seconds: 0
max_connection_age_grace_seconds: 10
otlp_traces_enabled: true
otlp_metrics_enabled: true
otlp_logs_enabled: true
worker_threads: 4
max_concurrent_streams: 1000
compression_enabled: true
accepted_compression_algorithms: ["gzip", "zstd"]
tls:
enabled: false
cert_path: "${TLS_CERT_PATH}"
key_path: "${TLS_KEY_PATH}"Explanation
- Bind address, port, and
runtime_sizecontrol concurrency. Usemax_concurrent_requestsandconnection_pool_sizeto match load. - Size limits allow big batches. Ensure downstream can handle them. Very large messages create memory spikes.
- Keepalive settings prevent idle disconnects. OTLP toggles pick signal types. Compression reduces bandwidth at some CPU cost.
- TLS is off in the sample. This is a risk in production. Turn it on and load certs from secrets.
http:
port: 4318
max_concurrent_requests: 5000
request_timeout_ms: 30000
max_payload_size_mb: 100Explanation
- HTTP is fine for canaries and checks. Prefer gRPC or Kafka for sustained load.
5. Processing, WAL, pipeline, backpressure
processing:
buffer:
max_size_messages: 1000000
max_size_bytes: 10737418240
flush_interval_seconds: 10
flush_size_messages: 50000
worker_threads: 16
io_threads: 8
compression: "snappy"
dedup_enabled: true
dedup_window_seconds: 60Explanation
- Buffer caps protect memory. Flushes happen on time or size. Dedup removes duplicates inside a 60 s window. Memory needs grow with rate and window.
wal:
enabled: true
dir: "./data/wal"
max_size_bytes: 53687091200
segment_size_bytes: 1073741824
sync_on_write: false
long_term_retention_hours: 168
wal_cleanup:
enabled: true
interval_minutes: 30
batch_size: 100
short_term:
delete_after_storage: true
long_term:
retention_hours: 168
delete_after_index: trueExplanation
- WAL ensures durability before ack. The sample has sync_on_write: false for throughput. Enable sync for strict durability. Cleanup removes short-term WAL after storage, long-term after indexing.
pipeline:
batch_size: 10000
batch_timeout_seconds: 10
worker_threads: 16
backpressure:
enabled: true
initial_queue_size: 1000
max_queue_size: 100000
target_utilization_percent: 70.0Explanation
- Pipeline batch and timeout should match ingestion and gRPC limits. Backpressure caps queues and targets safe utilization.
6. System metrics and LSM engine
lsm_engine:
memtable_size_limit_mb: 256
system_metrics:
collection_interval_seconds: 5
enable_cpu_metrics: true
enable_memory_metrics: true
enable_disk_metrics: true
enable_network_metrics: trueExplanation
- LSM memtable limit is small and safe for local storage paths. System metrics collect CPU, memory, disk, network every 5 s.
7. Storage, compaction, cloud backends, Parquet
storage:
backend: "local"
data_dir: "./data/storage"
ram_time_threshold_seconds: 3600
disk_time_threshold_seconds: 86400
parquet_split_size_mb: 500Explanation
- Local backend with a data dir. Data sits in RAM for 1 h, on disk for 1 day, then is ready for cloud uploads at about 500 MB splits.
compaction:
enabled: true
target_size_mb: 100
min_files_to_compact: 5
max_files_per_task: 20
min_file_age_minutes: 5
scan_interval_minutes: 30
task_timeout_minutes: 60
stale_check_minutes: 10
parallelism: 2
memory_limit_mb: 1024
max_concurrent_reads: 5
retention_overlap_hours: 2
query_grace_period_ms: 5000
validation_enabled: true
use_hard_links: trueExplanation
- Compaction merges small files. Size, timing, and safety settings are explicit. Validation checks output. Hard links speed transition. Keep parallelism and reads low to avoid IO spikes.
local:
enabled: true
path: "./data/storage"
max_disk_usage_percent: 80
cleanup_interval_seconds: 3600Explanation
- Local housekeeping prevents full disks. Clean every hour.
azure:
enabled: true
account_name: "mwbetaazureeastus"
container_name: "mwopenwitbeta"
access_key: "PdNPJEDb/.../o5wUw=="
prefix: "openwit/storage"
endpoint: "https://mwbetaazureeastus.blob.core.windows.net"
# connection_string: ""
# sas_token:Explanation
- Azure is enabled with account, container, and endpoint. The sample shows an inline access key which should be moved to a secrets manager. Both connection string and SAS token are alternatives. Rotate example credentials before any use.
s3:
enabled: false
bucket: "${S3_BUCKETNAME}"
region: "${S3_REGION}"
access_key_id: "${AWS_ACCESS_KEY_ID}"
secret_access_key: "${AWS_SECRET_ACCESS_KEY}"
prefix: "openwit/storage"
# endpoint: ""
gcs:
enabled: false
bucket: "${GCS_BUCKET_NAME}"
service_account_key: "${GCS_SERVICE_ACCOUNT_KEY}"
# project_id: "${GCP_PROJECT_ID}"Explanation
- S3 and GCS are present but disabled. Use env vars for secrets. Endpoint is optional for S3-compatible systems.
parquet:
row_group_size: 1000000
compression_codec: "zstd"
compression_level: 3
enable_statistics: true
data_page_size_kb: 1024
file_rotation:
file_size_mb: 200
file_duration_minutes: 60
enable_active_stable_files: true
upload_delay_minutes: 2
local_retention:
retention_days: 7
cleanup_interval_hours: 1
delete_after_upload: falseExplanation
- Parquet writer and rotation values are tuned for scan speed, compression, and manageable object size. Align
file_size_mbwith cloud split size. Keep a short upload delay to batch arrivals. Local retention governs cleanup on the node.
8. Networking, gossip, inter-node, memory, monitoring
networking:
discovery:
mode: "static"
kubernetes:
enabled: false
service_endpoints:
control_plane: "http://localhost:7019"
storage: "http://localhost"
ingestion: "http://localhost"
indexer: "http://localhost"
search: "http://localhost:8083"
gossip:
listen_addr: "127.0.0.1:8000"
seed_nodes:
- "127.0.0.1:9092"
- "127.0.0.1:9093"
- "127.0.0.1:9094"
- "127.0.0.1:9095"
heartbeat_interval_ms: 1000
failure_detector_timeout_ms: 5000
inter_node:
connection_timeout_ms: 5000
request_timeout_ms: 30000
max_connections_per_node: 10
keepalive_interval_ms: 10000Explanation
service_endpointsare easy URLs for roles. Useful in monolith or static setups.- Gossip seeds and heartbeats set the membership fabric. A 1 s heartbeat with a 5 s failure timeout is aggressive. Keep the network stable.
inter_node.*sets timeouts, connection caps, and keepalive. Matchmax_connections_per_nodewith gRPC channel pools to avoid congestion.
memory:
max_memory_gb: 32
monitoring:
prometheus_port: 9090
health_check_port: 8080
node_health_thresholds:
warning_cpu_percent: 70
warning_memory_percent: 75
warning_disk_percent: 80
critical_cpu_percent: 90
critical_memory_percent: 95
critical_disk_percent: 95
check_interval_ms: 1000
unhealthy_duration_ms: 5000
recovery_duration_ms: 3000Explanation
- Global memory cap must match container limits. Overcommitting causes OOM.
- Monitoring exposes Prometheus and health ports. Thresholds and durations add hysteresis so status does not flap.
9. Metastore, indexing, search
metastore:
backend: "postgres"
sled:
path: "./data/metastore"
cache_size_mb: 512
postgres:
connection_string: "postgresql://sanjithmeta:sanjith123@localhost:5432/sanjithdb"
max_connections: 20
schema_name: "metastore"Explanation
backend: postgresis recommended for production. Use TLS and store credentials in secrets. Keep the pool small. Use managed Postgres with backups and PITR. Sled is fine for monolith.
indexing:
enabled: true
batch_size: 1000
worker_threads: 4
search:
enabled: true
max_results: 1000Explanation
- Indexing has a worker pool and a batch size per job. It is CPU and memory heavy, so keep it asynchronous. Search caps results to protect memory and network.
10. Actor system, janitor, node-local storage, alerting, security
actors:
mailbox_size: 1000
worker_threads: 16
janitor:
cleanup_interval_seconds: 3600
retention_days: 30
storage_node:
data_dir: "./data/storage_node"
alerting:
enabled: false
security:
tls:
enabled: falseExplanation
- Actors use a mailbox for pending messages. Size too small throttles. Size too big uses memory. Match worker threads to CPU cores.
- Janitor enforces lifecycle on a set interval. Retention must match compliance.
- Alerting off is risky in production. Enable it in real clusters. TLS is also off in the sample. Do not run without TLS or mTLS in production.
11. Performance tuning and service ports
performance:
threads:
worker_threads: 16
blocking_threads: 32
io_threads: 8
memory:
allocator: "jemalloc"
memory_limit_gb: 32
network:
tcp_nodelay: true
tcp_keepalive_seconds: 60
socket_buffer_size_kb: 8192Explanation
- Size worker threads near CPU cores. Use extra blocking threads for Parquet IO. jemalloc is recommended for high concurrency. Tune socket settings for stable traffic.
service_ports:
control_plane: { service: 7019, gossip: 9092 }
storage: { service: 8081, arrow_flight: 9401, gossip: 9096 }
ingestion: { grpc: 50051, http: 4318, arrow_flight: 8089, gossip: 9095 }
indexer: { service: 8085, arrow_flight: 8090, gossip: 9097 }
proxy: { service: 8080, gossip: 9094 }
search: { service: 8083, grpc: 50059, gossip: 9098 }Explanation
- Port map for each role. Avoid collisions. Expose only what the edge needs. Route internal ports inside the cluster network.
Safety Notes
- Do not run
localin production since HA is disabled. - Keep control replicas at three or more when you move to production.
- TLS is off in the sample. Turn it on for both external and inter-node traffic.
- The Azure access key appears inline in the sample. Replace secrets with env vars or a secrets manager. Rotate keys.