Skip to content

Releases: demml/scouter

v0.25.0

26 Mar 10:34
3c509f3

Choose a tag to compare

v0.25.0 Release Summary

What changed

This release fixes a race condition in the trace storage layer. The old deregister_table + register_table two-step left a window where concurrent queries would get "table not found." All TableProvider updates are now atomic swaps through a DashMap-backed custom catalog (TraceCatalogProvider). The summary engine also gets a refresh ticker so read pods pick up commits from the write pod without a restart.


Breaking changes

None. No schema changes, no migration required. The ctx field on TraceSpanDBEngine is now private; use the new ctx() method instead. This only matters if you're constructing or testing the engine directly outside the service layer.


Changes

Trace storage: atomic TableProvider swaps

The built-in DataFusion catalog (datafusion.public) was replaced with TraceCatalogProvider, backed by a DashMap. All engines (TraceSpanDBEngine, TraceSummaryDBEngine, bifrost) now call catalog.swap(table_name, provider) instead of the deregister/register two-step.

Before:

self.ctx.deregister_table(TRACE_SPAN_TABLE_NAME)?;
self.ctx.register_table(TRACE_SPAN_TABLE_NAME, updated_table.table_provider().await?)?;

After:

let new_provider = updated_table.table_provider().await?;
self.catalog.swap(TRACE_SPAN_TABLE_NAME, new_provider);

DashMap::insert() is atomic. Concurrent readers see either the old provider or the new one, never a gap between them.

TraceSpanDBEngine and TraceSummaryDBEngine share the same TraceCatalogProvider via Arc. The span engine creates it; the summary service gets it through TraceSpanService::catalog. JOIN queries between trace_spans and trace_summaries work because both tables are in the same catalog.

Trace storage: summary engine refresh ticker

TraceSummaryDBEngine now has a background refresh loop (same as the span engine) that calls update_incremental() on the Delta table and swaps the TableProvider when a new version is found.

Refresh interval is SCOUTER_TRACE_REFRESH_INTERVAL_SECS, already in the server config. Values below 1 second are clamped up; tokio::time::interval panics on Duration::ZERO.

test_distributed_refresh covers the two-pod case: writer commits summaries, reader with a 1s ticker picks them up in the next query.

DataFusion session construction: get_session_with_catalog

ObjectStore has a new get_session_with_catalog(catalog_name, schema_name) method. It sets the named catalog as the SQL default, so unqualified table names and ctx.table(name) calls resolve through it instead of datafusion.public.

get_session() is unchanged. build_session_config() is now a private helper shared by both paths.

Bifrost engine: torn-write fix on refresh

The bifrost refresh path was calling table_provider() twice: once to update the write context, once for the catalog swap. If the second call failed, the write context would be updated but the catalog would not. Now there's one call, the result is shared, and the write context is only deregistered if the fetch succeeds.


Upgrading from v0.24.0

Nothing to do. The catalog wires up at server startup. SCOUTER_TRACE_REFRESH_INTERVAL_SECS already controls both the span and summary refresh rates.

v0.24.0

25 Mar 12:57
94afba6

Choose a tag to compare

v0.24.0 Release Summary

What Changed

This release introduces Bifrost, a Delta Lake-backed dataset storage and query system that turns Pydantic models into queryable tables. Combined with streamlined event queue infrastructure and expanded evaluation capabilities, v0.24.0 adds a production-grade data layer to Scouter for storing and querying high-volume records in AI applications.


Breaking Changes

None. No schema migrations, no database changes, no API removals. Existing drift, evaluation, and tracing functionality is unchanged.


Changes

Bifrost: Delta Lake Dataset Storage

Bifrost turns a Pydantic model into a Delta Lake table. You define the schema, push records through a gRPC queue, and query with SQL.

Write path:

  • DatasetProducer.insert(record) — serializes to JSON and sends via an unbounded channel (returns immediately, sub-microsecond latency).
  • Background batching — accumulates records until batch_size or scheduled_delay_secs triggers a flush.
  • Arrow serialization — DynamicBatchBuilder converts JSON rows to Arrow RecordBatch, injecting system columns automatically.
  • gRPC transport — batch sent to server as Arrow IPC bytes.
  • Delta Lake — server appends to table partitioned by scouter_partition_date.

Read path:

  • DatasetClient.sql(query) — sends SQL to server via gRPC.
  • DataFusion execution — full SQL support (joins, CTEs, window functions, aggregations).
  • Zero-copy delivery — results returned as Arrow IPC bytes.
  • Format conversion — call .to_arrow(), .to_polars(), .to_pandas() to convert.
  • Strict reads — DatasetClient.read() returns validated Pydantic model instances.

Schema validation (schema-on-write):

  • Pydantic JSON Schema → Arrow schema conversion, fingerprinted.
  • Fingerprint checked on every batch write.
  • Schema mismatch caught before data lands.
  • System columns injected automatically: scouter_created_at (microsecond timestamp), scouter_partition_date (Date32), scouter_batch_id (UUID v7).

Supported types:

  • Primitives: str, int, float, bool, datetime, date
  • Collections: Optional[T], List[T] (nested supported)
  • Enums: EnumDictionary(Int16, Utf8)
  • Nested models: BaseModelStruct(...) (recursive, up to 32 levels)

Clients:

  • Bifrost — unified read/write (long-lived, call shutdown() on exit)
  • DatasetProducer — write only (background queue, call shutdown() on exit)
  • DatasetClient — read only (stateless queries bound to a table via TableConfig)

All clients use gRPC transport configured via GrpcConfig. See Bifrost docs for examples.

Event Queue Refactor

Queue infrastructure refactored for clarity and maintainability:

  • Queue traits and implementations reorganized in scouter-events.
  • DatasetQueue added for high-throughput dataset inserts.
  • Existing Kafka, RabbitMQ, and Redis adapters unchanged.

gRPC API Expansion

New gRPC endpoints for dataset operations:

  • CreateDataset — register a table with fingerprint validation.
  • InsertBatch — append Arrow IPC bytes to Delta Lake.
  • QueryDataset — execute SQL and return Arrow IPC results.
  • ReadDataset — read records matching a filter.

Protobuf definitions updated in scouter.grpc.v1.proto.

Evaluation Improvements

Agent assertions:

  • TraceAssertionTask added — assertions on OpenTelemetry spans fetched from Delta Lake.
  • trace_id added to agent assertion context (enables cross-span evaluation).

Test coverage:

  • New test_agent_assertion.py with 84 lines of test cases.
  • New test_eval_orchestrator.py with 173 lines covering offline eval orchestration.
  • Trace evaluator test expanded.

Documentation

New Bifrost docs section:


Upgrading from v0.23.0

No action required.

  • Server: Standard build and deployment. No database migrations. Bifrost uses object storage (local, S3, GCS, Azure) configured via SCOUTER_STORAGE_URI.
  • Python client: Standard rebuild with make setup.project (rebuilds Rust extension).
  • Existing workflows: Drift, evaluation, and tracing work exactly as before.

To use Bifrost, define a Pydantic schema, create a TableConfig, and use Bifrost or DatasetProducer/DatasetClient to write and query data.

v0.23.0

18 Mar 17:46
9f2b77c

Choose a tag to compare

v0.23.0 Release Summary

What Changed

This release adds distributed Delta Lake support for the trace storage engine. In multi-pod deployments, reader pods now automatically pick up new data committed by writer pods without a restart. A new configurable refresh interval controls how often each pod refreshes its in-memory Delta table snapshot from shared object storage.


Breaking Changes

None. No schema changes, no migration required. The new SCOUTER_TRACE_REFRESH_INTERVAL_SECS env var defaults to 10 and requires no action for existing deployments.


Changes

Distributed trace storage: cross-pod Delta Lake refresh

Previously, each pod's TraceSpanDBEngine loaded the Delta table snapshot once at startup. In a multi-pod deployment sharing GCS/S3, reader pods would never see data committed by the writer pod until they restarted.

The engine's actor loop now runs a periodic refresh_table() tick alongside its existing command and compaction handlers. On each tick it calls update_incremental() on a cloned DeltaTable. If the version advanced, it deregisters and re-registers the DataFusion SessionContext table provider so subsequent queries return fresh results. If the incremental update fails (empty table, transient network error), the clone is discarded and the original table state is preserved.

Key details:

Setting Default Env var
Refresh interval 10 seconds SCOUTER_TRACE_REFRESH_INTERVAL_SECS

Set lower (e.g. 5) for faster cross-pod visibility at the cost of more object-store LIST calls. Set higher to reduce overhead when read latency is not critical.

The refresh runs independently on every pod — unlike compaction, there is no control-table mutual exclusion. Each pod refreshes its own in-memory snapshot.

Trace engine: safer incremental updates

All update_incremental calls in the engine (compaction, writes, optimizations) now call update_datafusion_session() after updating the table, ensuring the DataFusion SessionContext always has the correct object store registered. This fixes a class of bugs where DeltaScan::scan() could fail to resolve file URLs after a table update in cloud-backed deployments.

CI: release workflow fix

The release workflow tag comparison step now uses ${{ github.ref_name }} instead of $GITHUB_REF_NAME, fixing an interpolation issue where the tag name was not correctly resolved in the version check step.


Upgrading from v0.22.0

No action required. The refresh interval defaults to 10 seconds. To tune it, set SCOUTER_TRACE_REFRESH_INTERVAL_SECS in your environment.

v0.22.0

18 Mar 10:34
a5a0db3

Choose a tag to compare

v0.22.0 Release Summary

What Changed

This release ships the offline agent evaluation framework: EvalScenario, EvalScenarios, EvalRunner, and EvalOrchestrator let you define named test cases, run them against a live queue + tracer, and get structured pass/fail metrics with per-scenario breakdowns. The release also integrates AgentAssertionTask into the full offline eval pipeline and adds in-process span capture so scenarios can assert on traces without a running server.


Breaking Changes

Database migration required. A new column is added to scouter.genai_eval_record:

ALTER TABLE scouter.genai_eval_record
    ADD COLUMN IF NOT EXISTS tags TEXT[] NOT NULL DEFAULT ARRAY[]::TEXT[];

The migration runs automatically on server startup via sqlx.


Changes

Offline Evaluation: EvalScenario / EvalScenarios / EvalRunner

Three new types form the core of the offline scenario framework:

EvalScenario — a single named test case. Holds a query string, optional context dict, tasks (assertion, LLM judge, trace, or agent), and a pass_threshold (float 0–1, default 1.0). Each scenario gets a stable UUID7 ID on construction.

from scouter.evaluate import EvalScenario, AgentAssertionTask, AgentAssertion

scenario = EvalScenario(
    name="tool_use_check",
    query="Search for recent AI papers",
    tasks=[
        AgentAssertionTask(
            id="search_called",
            assertion=AgentAssertion.tool_called("web_search"),
        )
    ],
    pass_threshold=1.0,
)

EvalScenarios — a collection of EvalScenario objects with internal state (datasets, contexts) populated by EvalRunner.collect_scenario_data() and results populated by EvalRunner.evaluate(). Not intended to be constructed manually; produced by EvalRunner.

EvalRunner — stateful engine that owns scenario definitions and GenAIEvalProfile references (as Arcs, same pattern as ScouterQueue). Two-phase API:

  1. collect_scenario_data(queue, tracer) — drains records and spans captured during scenario execution and associates them with scenarios by scenario_id tag.
  2. evaluate() — runs all tasks, returns ScenarioEvalResults.

Optionally call compare(baseline: ScenarioEvalResults) to produce a ScenarioComparisonResults with per-scenario deltas.

New result types:

Type Purpose
EvalMetrics Aggregate pass rates: overall_pass_rate, scenario_pass_rate, dataset_pass_rates (per-alias)
ScenarioResult Pass/fail + task results for one scenario
ScenarioEvalResults All ScenarioResults + aggregate EvalMetrics
ScenarioDelta Δ pass rate between two runs for one scenario
ScenarioComparisonResults Full comparison across all scenarios with regression/improvement classification

Offline Evaluation: EvalOrchestrator (Python)

EvalOrchestrator is a high-level Python wrapper that manages the full capture lifecycle so callers don't have to sequence enable_capture / disable_capture / collect_scenario_data manually.

from scouter.evaluate import EvalOrchestrator, EvalScenario, EvalScenarios

orchestrator = EvalOrchestrator(
    scenarios=EvalScenarios(scenarios=[...]),
    queue=queue,
    tracer=tracer,
    profiles={"agent": profile},
)

results: ScenarioEvalResults = orchestrator.run(agent_fn=my_agent)

agent_fn is Callable[[str], str] — takes a query, returns a response string. The orchestrator:

  1. Enables queue capture + local span capture.
  2. Iterates scenarios, sets scouter.eval.scenario_id in OTel baggage, calls agent_fn(scenario.query).
  3. Disables capture, calls EvalRunner.collect_scenario_data(), then evaluate().
  4. Returns ScenarioEvalResults.

Subclass EvalOrchestrator and override execute_scenario() to handle non-string responses or add lifecycle hooks.


AgentAssertionTask: full pipeline integration

AgentAssertionTask was previously standalone (via execute_agent_assertion_tasks()). It is now fully wired into the EvalDataset pipeline:

  • EvaluationTask::AgentAssertion variant added.
  • EvaluationTaskType::AgentAssertion variant added (serializes as "AgentAssertion").
  • TaskConfig::AgentAssertion deserializes from stored task JSON.
  • AssertionTasks.agent: Vec<AgentAssertionTask> — tasks are routed to this bucket when building datasets from TasksFile.
  • EvaluationTask::AgentAssertion participates in depends_on resolution.

New supporting types:

TokenUsage — structured token count from LLM responses. Fields: input_tokens, output_tokens, total_tokens (all Optional[int]). Exposed to Python as a #[pyclass].

AgentContextBuilder (Rust-internal) — normalizes vendor-specific LLM response formats into a standard structure before assertion evaluation. Auto-detects format:

  1. Pre-normalized (Scouter standard shape)
  2. OpenAI — choices[].message.tool_calls, usage, model
  3. Anthropic — content[] with ToolUseBlock, usage, model
  4. Google/Gemini — candidates[].content.parts[] with function_call
  5. Fallback tree walk

Path limits enforced: max 512 chars per path, max 32 segments.


In-process span capture

A new local capture mode lets tests assert on trace spans without a running Scouter server or Delta Lake backend. Spans are buffered in memory instead of forwarded to the transport.

Buffer capacity: 20,000 spans (CAPTURE_BUFFER_MAX). Writes beyond this limit are dropped with a warning.

Rust API (Tracer):

tracer.enable_local_capture()?;
// ... instrumented code ...
let spans = tracer.drain_local_spans()?;
let by_trace = tracer.get_local_spans_by_trace_ids(vec!["abc123...".into()])?;
tracer.disable_local_capture()?; // discards buffer

Python API (ScouterInstrumentor):

instrumentor.enable_local_capture()
# ... instrumented code ...
spans: list[TraceSpanRecord] = instrumentor.drain_local_spans()
spans_filtered = instrumentor.get_local_spans_by_trace_ids(["abc123..."])
instrumentor.disable_local_capture()

Module-level aliases also available: enable_local_span_capture, disable_local_span_capture, drain_local_span_capture.

disable_local_capture logs a warning and discards buffered spans if any remain.


EvalRecord: tags and trace_id stamping

TagsEvalRecord now carries a tags: list[str] field in key=value format. Tags are persisted to PostgreSQL and returned in all query paths (get, paginated, archive).

record = EvalRecord(context={"response": "..."})
record.add_tag("environment", "staging")
record.add_tag("model", "gpt-4o")
# record.tags == ["environment=staging", "model=gpt-4o"]

trace_id at constructionEvalRecord(trace_id="<hex>") now accepted. Previously trace_id could only be set after construction.

Automatic stamping in QueueBus — when an EvalRecord is inserted via ScouterQueue and has no trace_id, the bus checks the active OTel span context. If a valid span is active, its trace ID is stamped onto the record (both the Rust struct and the Python-side object). The Python object is updated via a mutable borrow; a warning is logged if the cast fails.

Scenario tag auto-injection — if OTel baggage contains scouter.eval.scenario_id, the bus appends "scouter.eval.scenario_id=<value>" to record.tags automatically. Tag values are validated: alphanumeric, hyphens, underscores, max 128 chars. Invalid values are dropped with a warning.


ScouterQueue: offline record capture

New methods for offline use (mirroring the local span capture API):

queue.enable_capture()   # buffer EvalRecords in memory in addition to sending
queue.disable_capture()  # stop buffering and discard buffered records
queue.drain_records("alias")    # drain records from one queue by alias
queue.drain_all_records()       # drain from all queues, keyed by alias

Capture is off by default. Enabling it has negligible overhead; records are still forwarded to the normal transport.

GenAIEvalProfile references are now stored as Arc<GenAIEvalProfile> inside ScouterQueue.profiles, so EvalScenarios can share ownership without cloning.

shutdown() now releases the GIL during the 250ms wait periods, preventing deadlocks in multi-threaded Python programs.


Server: debug trace endpoint

A new diagnostic route returns the 10 most recent traces from the past 24 hours:

GET /scouter/trace/debug/recent

Returns the same TracePaginationResponse as the paginated trace query. Intended for local debugging and health verification; not authenticated differently from other trace routes.


Upgrading from v0.21.1

  1. Apply the database migration. It runs automatically on server startup. If you run migrations manually, execute:

    ALTER TABLE scouter.genai_eval_record
        ADD COLUMN IF NOT EXISTS tags TEXT[] NOT NULL DEFAULT ARRAY[]::TEXT[];
  2. No other action required. All API changes are additive. Existing EvalRecord, ScouterQueue, and tracing usage continues to work without modification.

v0.21.1

12 Mar 20:22
4e3ec57

Choose a tag to compare

v0.21.1 Release Summary

What Changed

This release renames the GenAI evaluation result types to remove the GenAI prefix. The names GenAIEvalResults, GenAIEvalSet, GenAIEvalTaskResult, and GenAIEvalResultSet have been replaced with shorter, framework-agnostic equivalents. No behavior or schema changes are included.


Breaking Changes

All four GenAI eval result types have been renamed. Any code importing or referencing these types must be updated.

Old name New name
GenAIEvalResults EvalResults
GenAIEvalSet EvalSet
GenAIEvalTaskResult EvalTaskResult
GenAIEvalResultSet EvalResultSet

This affects:

  • Python imports from scouter or scouter.evaluate
  • Type annotations referencing these classes
  • Any serialized JSON being deserialized back into these types via model_validate_json (type name is not embedded in the JSON payload, so existing stored results are unaffected)

Changes

GenAI evaluation — type renaming

The GenAI prefix has been dropped from four result container types. The change is purely nominal — fields, methods, and return values are identical. The rename touches the Rust crates (scouter_types, scouter_evaluate, scouter_dataframe, scouter_client, scouter_sql), PyO3 bindings, Python stubs, and the public scouter.evaluate module.

Drifter.compute_drift() return type annotation updated from GenAIEvalResultSet to EvalResultSet. EvalDataset.evaluate() return type updated from GenAIEvalResults to EvalResults.

Tracing — storage architecture documentation

Added py-scouter/docs/docs/tracing/storage-architecture.md documenting the Delta Lake + DataFusion write/query pipeline, component reference table, and the dual-actor buffer/engine design.


Upgrading from v0.21.0

  1. Replace all imports of the renamed types:

    # Before
    from scouter.evaluate import (
        GenAIEvalResults,
        GenAIEvalSet,
        GenAIEvalTaskResult,
        GenAIEvalResultSet,
    )
    
    # After
    from scouter.evaluate import (
        EvalResults,
        EvalSet,
        EvalTaskResult,
        EvalResultSet,
    )

    The top-level scouter namespace export has changed the same way:

    # Before
    from scouter import GenAIEvalResults
    # After
    from scouter import EvalResults
  2. Update any type annotations that reference the old names.

  3. No database migrations, no environment variable changes, no serialization format changes.

v0.21.0

12 Mar 17:00

Choose a tag to compare

v0.21.0 Release Summary

What Changed

v0.21.0 adds a read-side caching layer over cloud object stores and tunes the DataFusion SessionContext for higher-concurrency GCS/S3 workloads. A bug where the trace summary table skipped vacuum after compaction is fixed. An internal record type rename is propagated across all crates.


Breaking Changes

None. No schema changes, no migration required.


Changes

Object store caching layer (CachingStore)

A new CachingStore<T: ObjectStore> wrapper in scouter_dataframe caches head() responses and small get_range() reads (≤2 MB) from cloud object stores.

After Delta Lake Z-ORDER compaction, Parquet files are immutable — the same path always returns the same bytes. DataFusion issues repeated HEAD + footer range reads on every query. Without caching, each read is a separate cloud round-trip (~30–60 ms on GCS). CachingStore eliminates these by serving repeated reads from an in-process mini_moka cache.

Cache configuration:

Setting Default Env var
Max cache size 64 MB SCOUTER_OBJECT_CACHE_MB
TTL 1 hour
Max cacheable range read 2 MB

All mutating and streaming operations (put, delete, list, get for large ranges) pass through to the inner store uncached.

DataFusion SessionContext tuning

The shared SessionContext used for trace queries now includes explicit read-path and write-path settings:

Setting Old New Why
metadata_size_hint 512 KB 1 MB Captures bloom filter + footer + column indexes in one GCS round-trip instead of the default multi-step chain
bloom_filter_on_read default true Activates bloom filters on trace_id and entity_id to skip non-matching row groups before decoding
schema_force_view_types default true Zero-copy Utf8View/BinaryView — prevents DataFusion from downgrading these on read-back from Parquet
meta_fetch_concurrency 32 64 Parallel HEAD stats during Delta log replay; matches pool_max_idle_per_host
maximum_parallel_row_group_writers default 4 Concurrent row group encoding during compaction and flush
maximum_buffered_record_batches_per_stream default 8 Smooths bursty reads from GCS

Connection pool tuning

Cloud object store HTTP client settings updated:

Setting Old New
pool_max_idle_per_host 16 64
pool_idle_timeout 90s 120s
Request timeout 30s
Connect timeout 5s

Bug fix: vacuum missing after summary optimize

TraceSummaryDBEngine::run_maintenance() called optimize_table() but not vacuum_table() afterward. Compaction tombstones old Parquet files; without an immediate vacuum those files remain on storage until the next scheduled vacuum cycle.

Fixed to vacuum immediately after a successful optimize:

Ok(()) => {
    if let Err(e) = self.vacuum_table(0).await {
        error!("Post-optimize vacuum failed: {}", e);
    }
    // release task ...
}

This matches the existing behavior in TraceSpanDBEngine.

Internal record type rename (PR #221)

Internal record type renamed across scouter_client, scouter_drift, scouter_evaluate, scouter_events, scouter_server, and py-scouter. No public API change for Python users — stub files updated.


Upgrading from v0.20.0

No action required. All changes are additive or internal.

SCOUTER_OBJECT_CACHE_MB is optional. The default (64 MB) is appropriate for most deployments. Increase it if you have many concurrent readers querying large numbers of Parquet files.

v0.20.0

11 Mar 19:52
aa7ba7d

Choose a tag to compare

v0.20.0 Release Summary

What Changed

v0.20.0 removes PostgreSQL from the trace/span pipeline entirely. All trace ingestion, storage, querying, and maintenance now runs through DataFusion and Delta Lake only. This release also adds a distributed coordination layer for multi-pod compaction, a pre-aggregated trace summary table for fast listing, cloud storage fixes for GCS/S3/Azure, and three new tuning env vars.


Breaking Changes

The PostgreSQL trace schema is no longer used. Any tooling, migrations, or queries that read trace data from PostgreSQL will stop working after upgrading. The scouter_sql aggregator is now a thin forwarding layer; all trace reads and writes go through Delta Lake via scouter_dataframe.


Changes

Traces now fully on Delta Lake + DataFusion

PostgreSQL has been removed from the trace read/write path. The architecture is now:

gRPC / HTTP ingest → in-memory buffer (actor) → Delta Lake (span table + summary table)
                                                         ↑
                                              DataFusion query engine

The scouter_sql aggregator retains its interface for compatibility but no longer writes span data to PostgreSQL.

Trace summary table

A new trace_summaries Delta Lake table stores one row per trace with pre-computed fields:

Column Type Description
trace_id FixedSizeBinary(16) Trace identifier
service_name Dictionary(Int32, Utf8) Service that produced the root span
root_operation Utf8 Name of the root span
start_time / end_time Timestamp(µs, UTC) Trace wall-clock bounds
duration_ms Int64 End-to-end latency in milliseconds
span_count Int64 Total spans in the trace
error_count Int64 Spans with error status
search_blob Utf8 Concatenated attribute text for full-text search
entity_ids List<Utf8> Application entity IDs attached to the trace
queue_ids List<Utf8> Queue message IDs attached to the trace

This table is partitioned by partition_date (Date32). Listing traces and applying filters no longer requires scanning the full span table.

Distributed compaction control table

A new _scouter_control Delta Lake table coordinates compaction, retention, and vacuum tasks across pods. Each task (summary_optimize, etc.) has a single row with idle/processing status, a pod_id, and a next_run_at timestamp. Locks older than 30 minutes are automatically reclaimed.

This prevents multiple pods from running conflicting Z-ORDER optimize operations simultaneously against shared object storage.

New attribute search UDF

A custom DataFusion scalar UDF (match_attr_expr) enables full-text attribute search against the search_blob column. This replaces SQL LIKE patterns that required per-attribute column scans.

// DataFusion query predicate
match_attr_expr(col("search_blob"), lit("user_id=abc123"))

New trace query API routes

Two new HTTP endpoints were added to scouter-server:

  • GET /traces/:trace_id/spans — returns all spans for a specific trace ID
  • POST /traces/spans/filter — returns spans matching TraceFilters (service name, time range, attribute values, entity ID, etc.)

Typed DataFusion predicates for Parquet pruning

Query helpers ts_lit() and date_lit() emit typed Timestamp(Microsecond, UTC) and Date32 literals. These match column types exactly, enabling Parquet row-group min/max pruning and partition directory skipping without type coercion overhead.

Cloud storage fixes

  • GCS / S3 / Azure: storage_root() now correctly extracts only the bucket name from URIs like gs://my-bucket/path/to/prefix. Previously returned the full path after stripping the scheme prefix, causing object store initialization failures.
  • Azure: Fixed path construction for Delta table locations.
  • PassthroughLogStoreFactory added for cloud log store registration when using GCS.

Span schema changes

Columns removed from the span table:

  • root_span_id — derivable from the summary table
  • depth, span_order, path — unused by query layer

Columns added:

  • search_blob — concatenated attribute text for UDF-based search
  • queue_ids — list of queue message IDs

New configuration env vars

Variable Default Description
SCOUTER_TRACE_COMPACTION_INTERVAL_HOURS 24 How often Delta Lake Z-ORDER optimize runs for trace tables
SCOUTER_TRACE_FLUSH_INTERVAL_SECS 5 How often the in-memory span buffer flushes to Delta Lake
SCOUTER_TRACE_BUFFER_SIZE 10000 Span buffer capacity before a forced flush

Larger SCOUTER_TRACE_BUFFER_SIZE values reduce the number of small Parquet files written to cloud storage but increase the window of data that could be lost on a crash.


Upgrading from v0.19.0

  1. Remove any direct PostgreSQL queries against trace tables. These tables may still exist but are no longer written to.
  2. Set SCOUTER_STORAGE_URI to a writable location (local path, s3://, gs://, or az://). This was required in v0.19.0 for spans and is now required for summaries and the control table as well.
  3. On first startup, the server creates the trace_summaries and _scouter_control Delta tables automatically. No migration script is needed.
  4. If running multiple server replicas, all replicas must share the same SCOUTER_STORAGE_URI. The control table coordinates cross-pod compaction; replicas pointing at different storage paths will not coordinate.

v0.19.0

04 Mar 01:52
a323b3b

Choose a tag to compare

What's Changed

Full Changelog: v0.18.0...v0.19.0

v0.18.0

03 Mar 02:10
b82477a

Choose a tag to compare

What's Changed

v0.18.0 Release Notes

Server Deployment Modes

  • The server binary now accepts a --mode flag to run HTTP only, gRPC only, or both (default). This enables independent horizontal scaling of each protocol — deploy gRPC close to high-throughput queue producers and HTTP separately for the REST API.

Full Changelog: v0.17.1...v0.18.0

v0.17.1

01 Mar 20:04
9cd3367

Choose a tag to compare

  • Patch to fix wheel testing