Architecture Overview

The Query Pipeline is the read path that takes a client query and returns results. It starts at the Proxy, plans on the Search node, prunes with metadata and indexes, fetches only the needed Parquet fragments through a tiered cache or from cloud storage, executes with DataFusion or Ballista, then streams Arrow RecordBatches or JSON back to the client. This page describes the components, the end-to-end sequence, the planning and pruning rules, the cache and fetch strategy, execution, and the main observability hooks.

Component map

  • Proxy: Entry point for SQL or text queries. It authenticates, applies light validation and rate limits, then forwards to the Search node. It can route by query type and by cache warmness, so hot requests go to warm nodes.
  • Search node: Orchestrates the read path. It runs the Planner, queries the Metastore, loads indexes, shapes reads, triggers fetch, and drives execution.
  • Planner: Parses SQL to AST, builds a logical plan, applies rewrites such as predicate and projection pushdown, extracts time bounds, and prepares pruning.
  • Metastore client: Queries Postgres for candidate files in a time window. The lookup is cheap because of an index on (start_ts, end_ts), and it returns row-group boundaries so reads can be precise.
  • Index loader: Loads zone maps, bloom or bitmap artifacts, and Tantivy segments from cache or cloud to support pruning.
  • Fetcher: Retrieves Parquet row groups from the cache tiers, Electro or Pyro when available, otherwise streams from cloud via OpenDAL. Uses Arrow Flight for cached reads.
  • Execution engine: DataFusion for local execution or Ballista for distributed execution across workers.
  • Full-text engine: Tantivy for text predicates with preferred pre-filter integration.
  • Result assembler: Formats and streams Arrow or JSON to the client.
  • Cache node: Serves Arrow in RAM and Parquet on local disk via Arrow Flight. Tiers are named Electro for RAM and Pyro for disk. The cold tier is the object store, Cryo.
  • Control plane: Provides node health, routing decisions, and fallbacks.

Conceptual wiring

Client → Proxy → Search (Planner)
                 ├─ Metastore (Postgres) → candidate files + row groups
                 ├─ Index loader → zonemap | bloom | bitmap | tantivy
                 ├─ Fetcher → Electro (RAM) / Pyro (Disk) / Cryo (Cloud)
                 └─ Execution → DataFusion or Ballista → Result stream

End-to-end sequence

  1. Ingress at Proxy: Client sends SQL or a search expression to Proxy. Proxy authenticates, applies rate limiting, and forwards to the Search node. Routing can prefer warm nodes and can vary by query type.
  2. Plan build on Search: The Planner parses SQL to AST, builds a DataFusion logical plan, and applies rewrites such as predicate and projection pushdown. It extracts the query time window. If there is no explicit time filter it applies a configurable default lookback to avoid scanning unbounded history.
  3. Candidate discovery in Metastore: The Metastore query to Postgres finds Parquet files that overlap the time window. The lookup is indexed on time bounds and returns file metadata with row-group boundaries.
  4. Pruning cascade with indexes: The Planner loads needed artifacts and prunes in a fixed order: first zonemap for coarse min-max elimination, then bloom or bitmap for equality or small IN predicates, then Tantivy for full-text. This drops files and row groups before heavy reads.
  5. Read shaping: After pruning, the Planner builds FilePartRequest entries with (file_uri, row_group_indices, byte_offsets, columns_needed). Only required columns are projected.
  6. Fetch via tiered cache or cloud: The Fetcher checks Electro first and uses Arrow Flight for cached Arrow batches. If not present, it checks Pyro for local Parquet. Otherwise, it streams from Cryo using OpenDAL with ranged reads. Fetch is parallel across files and row groups using a bounded worker pool. Optional prefetch can pull adjacent groups when locality is likely. Per-tenant and per-node limits keep disks and networks safe.
  7. Execution: The plan runs on DataFusion with projection and filter pushdown into Parquet readers. For heavy queries, the plan translates to Ballista and runs across workers that access caches or pull from the cloud. For text predicates, the preferred integration is to run Tantivy first to get file or row ids, then intersect with other filters before scan. A join-based approach exists but is less efficient.
  8. Result delivery: The pipeline assembles results and streams Arrow RecordBatches or JSON to the client. Hot results or loaded batches can be cached for reuse.

Query planning and optimization

The Planner performs a series of well-defined steps:

  • Parse SQL to AST using the DataFusion parser and capture predicates and projections.
  • Build and rewrite the logical plan with predicate pushdown, constant folding, and projection pushdown.
  • Extract the time range from the WHERE clause. If no filter is present, apply a default lookback to cap the history scanned.
  • Lookup candidate files in Postgres for the window and get row-group boundaries.
  • Apply index-driven pruning in a cascade: zonemap → bloom/bitmap → Tantivy. Combine per-index results with Boolean set logic to shrink the file and row-group set.
  • Select row groups and shape file parts with precise byte ranges and only the needed columns.

This process ensures that the execution plan touches the smallest set of objects and rows that can answer the query.

Index usage and integration

Zonemap artifacts store per-row-group min and max for selected columns and allow fast coarse elimination without opening Parquet pages. Bloom filters provide probabilistic membership checks for high-cardinality columns. Bitmap indexes provide exact set membership and enable fast row selection for low to mid cardinality. Tantivy supports full-text search and returns candidate documents or file references. The planner combines these using Boolean algebra to compute the final parts to read. The preferred full-text integration is to run Tantivy first to pre-filter, then intersect with other predicates.

Tiered cache and fetch strategy

Cache tiers are explicit and ordered:

  • Electro (RAM): Arrow RecordBatches in memory for the fastest access. Served via Arrow Flight.
  • Pyro (Disk): Pre-downloaded Parquet on local disk.
  • Cryo (Cloud): Object store for cold data.

The Fetcher checks Electro, then Pyro, then Cryo. It issues ranged reads, so it does not download entire files when only parts are needed, fetches in parallel across files and groups using a bounded worker pool, and may prefetch adjacent groups when locality is likely. Per-tenant and per-node limits protect the network and disks. When data is in cache, the system favors Arrow Flight to move columnar data quickly.

Execution details

Local execution creates a DataFusion SessionContext, registers Parquet tables or listing tables that point to the selected row groups, and relies on projection and filter pushdown in the Parquet reader. For heavy queries, the Search node translates the plan to Ballista and submits it to the cluster so workers can execute in parallel. For text queries, the pre-filter approach with Tantivy is preferred; a join-based approach exists but is less efficient. As a rule of thumb, DataFusion thread count tracks CPU cores, and Ballista is used when a single node cannot fit the working set.

Result assembly and delivery

Results are computed in a columnar engine and streamed back to the client as Arrow RecordBatches or JSON. The system can cache hot results or the loaded batches to improve subsequent latency for common dashboards and repeated time windows.

Constraints and guardrails

The design calls out three constraints that shape the pipeline:

  • Large and numerous files require that metadata lookup and pruning be cheap.
  • Indexing lag means a query may run before artifacts are present, so the planner must still answer with time-range pruning.
  • Multi-tenant spikes require fairness so one noisy tenant does not starve others. Rate limits and per-tenant concurrency caps apply at fetch.

The default lookback guardrail applies when a query has no explicit time filter, so the system avoids unbounded scans.

Observability

Each query is traced with spans such as plan_build, index_prune, fetch_parquet, execute_fusion, and serialize_result. Metrics include query latency, index hits, cache hit ratio, bytes read, and rows scanned. These signals show where time and I/O are spent and help tune cache, pruning, and concurrency.

What to configure for this module

Absolutely—here’s a fuller, doc-backed expansion of What to configure for this module, focused on the Query Pipeline only.

What to configure for this module

Parquet shaping (row groups + statistics): Set a large but consistent row_group_size and enable Parquet statistics so the Planner can prune early with zone maps and choose precise row groups to read. Your config calls out row_group_size = 1,000,000 rows, enable_statistics = true, and defaults like compression_codec = "zstd" and compression_level = 3.

Large row groups reduce metadata overhead, while statistics (min/max) are essential for the pruning cascade described in the query plan. Balance this with memory and downstream execution: writer-side data_page_size_kb = 1024 increases write-time memory but can improve I/O later.

Default time window (planner guardrail): If a user query has no explicit time predicate, the Planner applies a configurable default lookback to avoid scanning unbounded history.

Keep this window intentionally small for “hot” lookups so the system stays predictable under load; the docs explicitly recommend tuning the planner.default_time_window_seconds for sub-second paths when data is hot.

Fetch concurrency and fairness (bounded workers): Fetching should be parallel but bounded. The Fetcher uses a worker pool to read multiple files/row groups concurrently and ranged reads so it does not download whole files when only parts are needed.

Apply per-tenant and per-node read-concurrency limits to prevent noisy-neighbor effects and protect disks and networks. Prefetch adjacent row groups only when locality is likely. These controls keep latency stable without changing upstream ingestion or storage.

Tiered cache posture (Electro → Pyro → Cryo): Enforce the documented cache order—Electro (RAM) first via Arrow Flight, then Pyro (disk), and finally Cryo (object store)—so the planner’s pruned parts are satisfied at the lowest possible latency. Keeping hot files in Electro and increasing cache retention for common time windows helps sub-second dashboards while reducing repeated cloud egress.

Execution parallelism (local vs distributed): For single-node runs, align DataFusion threads roughly with CPU cores and rely on projection and filter pushdown into Parquet readers. For heavy scans, translate to Ballista so workers execute in parallel across pruned file parts. This is an execution-side knob that directly affects query latency and cost without touching ingest or storage.

Keep settings coherent across stages: When you tune the query module, ensure related writer/rotation settings stay compatible so the planner continues to benefit: e.g., file_size_mb near parquet_split_size_mb to produce cloud-optimized files that scan efficiently post-pruning. These are storage knobs, but they influence read-path cost directly.

Why these knobs matter here: All of the above target the Query Pipeline’s three consistent goals: prune early (statistics + default window), fetch precisely (bounded concurrency, ranged reads, cache order), and execute in parallel (right-sized threads or Ballista). Adjusting them shapes latency, bytes scanned, and fairness without altering ingestion formats or storage lifecycles.