Monitoring & Observability

This page explains how to watch OpenWit in production. It covers what each signal means, how the signals relate to one another, and how to use them during normal operations and incident triage.

Metrics

Every node publishes Prometheus-compatible metrics at /metrics. You can scrape them directly or forward them through OpenTelemetry exporters. Your document lists the core series per role. Use them to spot load, backlog, or slow hops.

Ingestion

  • openwit_ingest_batches_total: Total batches accepted. Expect steady growth under normal load. A flat line while producers are active usually means the gateway is blocking or Ingest is not accepting data.
  • wal_write_latency_ms: Time to persist a batch to short-term WAL. Rising values indicate storage pressure on the ingest box or a slow disk path.

Storage

  • openwit_storage_upload_latency_ms: Time to upload a stable Parquet file to object storage. Use this to see if cloud uploads are the bottleneck.
  • active_file_size_bytes: Current size of the active Parquet file. If this grows without uploads, the node is building files but not finalizing them.

Indexer

  • openwit_index_build_duration_seconds: Time to build an index artifact per file. Watch for long builds if indexing lags behind uploads.
  • openwit_query_latency_ms: End-to-end query time seen by clients. Use this as your primary latency indicator.
  • cache_hit_ratio: Effectiveness of Electro or Pyro. Low ratios suggest the working set is not cached.

Control

  • openwit_nodes_healthy_total: Number of healthy nodes known to control. Drops here usually precede data plane problems.
  • gossip_roundtrip_ms: Time for gossip messages to circulate. Spikes can explain false negatives in health or delayed rebalancing.

How to read them together

  1. Check Control first. If healthy nodes drop or gossip round-trip spikes, expect routing noise or flapping.
  2. Look at Ingestion next. If batches stop growing or WAL latency climbs, the front door is stuck or the ingest disk is slow.
  3. Move to Storage. Big active files and slow uploads point to object storage drag or bandwidth limits.
  4. Check Indexer duration if you see an indexing backlog in diagnostics.
  5. Finish with Search. If query latency rises while cache hit ratio falls, the system is reading from cold tiers.

Tracing

OpenWit emits OpenTelemetry traces with spans for each batch, each upload, and each query. Trace context travels across nodes in gRPC metadata, so you see one end-to-end timeline for a request. Export supports Jaeger, Tempo, or Honeycomb. Use traces to find exactly where a request slowed down.

IngestBatch
  └─ WALWrite
      └─ ArrowSend
          └─ ParquetUpload
              └─ IndexBuild
                  └─ QueryPlan
                      └─ DataFusionExec

Start at the root span to see total time. Drill into WALWrite for ingest durability, ParquetUpload for cloud path delays, IndexBuild for indexing backlog, and DataFusionExec for execution time.

How traces complement metrics

  • A spike in wal_write_latency_ms should match longer WALWrite spans.
  • Rising openwit_storage_upload_latency_ms aligns with longer ParquetUpload spans.
  • Higher openwit_query_latency_ms lines up with longer QueryPlan or DataFusionExec spans.

Logging

All nodes log structured JSON via tracing_subscriber. You can set log levels per module to info, debug, or trace. The sample in your document shows fields for module, event, file, and latency. Use logs to confirm state transitions that are not obvious from metrics.

Example:

{"level":"INFO","module":"storage","event":"upload_complete","file":"batch_2025_10_15.parquet","latency_ms":234}

Look for “upload_complete” to match a Storage upload. Similar events appear for indexing and query execution.

Health and diagnostics

Every role exposes two endpoints. /health answers liveness checks. /ready answers readiness checks for traffic. The Control node aggregates cluster status from gossip so you have a single place to see membership and health. Diagnostics in the document include WAL backlog, indexing lag, and upload queue depth, which point to where the pipeline is slow or blocked.

Use case examples

  • Liveness green but readiness red on Storage: Node is up but cannot serve yet, likely finishing a file or waiting on uploads.
  • Control shows unhealthy peers while data still flows: Check gossip round-trip time and the network segment between nodes.
  • Indexing lag rises while uploads look normal: Indexer build duration will confirm if index creation is slow.

Dashboards

You can build a single Prometheus or Grafana dashboard using the series listed above. This mirrors the nodes and gives a full path view.

Panels

  • Control: openwit_nodes_healthy_total, gossip_roundtrip_ms
  • Ingestion: openwit_ingest_batches_total, wal_write_latency_ms
  • Storage: active_file_size_bytes, openwit_storage_upload_latency_ms
  • Indexer: openwit_index_build_duration_seconds
  • Search: openwit_query_latency_ms, cache_hit_ratio These match the exact metric names in your document, so they are safe to rely on.

Trace view

Export traces to Jaeger, Tempo, or Honeycomb. Filter by the span names shown in the example tree to follow a request across nodes.

Logs view

In your log tool, group by module and event to follow stage changes like upload_complete or index build events.

Alerts that align with these signals

Set alerts only on the series listed in the document and on the health endpoints. This keeps alerts aligned with supported signals.

  • Readiness failing on any node for a sustained interval: use /ready.
  • wal_write_latency_ms above a steady baseline: ingest durability slowed.
  • openwit_storage_upload_latency_ms high for many files: cloud uploads are slow.
  • openwit_index_build_duration_seconds above baseline: index backlog likely forming.
  • openwit_query_latency_ms rising while cache_hit_ratio falls: hot data not in cache.
  • openwit_nodes_healthy_total dropping or gossip_roundtrip_ms spiking: cluster instability or network drag.

Triage playbook

Follow this sequence when users report slow queries or missing data. Each step reads a signal listed in your document.

  1. Check Control health. If healthy nodes dropped, investigate node restarts or network issues first.
  2. Verify readiness on the reported path. Use /ready on Ingest, Storage, and Search.
  3. Look at ingestion. Confirm batches are arriving and WAL latency is normal. If not, pause producers and clear the cause.
  4. Check storage. If active file size grows without uploads, look at the object store or bandwidth to it.
  5. Check indexing. Long index build duration plus an indexing lag in diagnostics explains slow search starts.
  6. Trace a slow query. Inspect the span tree for ParquetUpload or DataFusionExec hotspots.
  7. Confirm logs. Find the JSON events around the time of the issue to correlate uploads, index builds, and query runs.