Skip to content

Snapshot: Add optional rosbag recording for time-window capture #120

@mfaferek93

Description

@mfaferek93

Summary

Currently, snapshots capture only a single message per topic at the moment of fault confirmation (JSON in SQLite). This is limiting for debugging because we don't have visibility into what happened before the fault occurred.

Add optional rosbag2 integration to capture a time-window of topic data (e.g., 5 seconds before fault + 1 second after), enabling "black box" style recording for post-mortem analysis.


Proposed solution (optional)

Tiered configuration approach

Level 1 (current, unchanged): JSON snapshots - single message per topic, stored in SQLite.

Level 2 (new, opt-in): Simple rosbag - enable with rosbag.enabled: true, uses sensible defaults.

Level 3 (new, advanced): Custom rosbag config - full control over duration, topics, format, storage.

Configuration schema

snapshots:
  enabled: true

  # === Existing JSON config (unchanged) ===
  default_topics: ["/odom", "/cmd_vel"]
  config_file: "snapshots.yaml"

  # === New rosbag config ===
  rosbag:
    enabled: false                 # opt-in

    # Time window
    duration_sec: 5.0              # seconds before fault (default 5s)
    duration_after_sec: 1.0        # seconds after CONFIRMED

    # Topics: "config" (reuse JSON config) | "all" | [explicit list]
    topics: "config"
    include_topics: []             # add to resolved list
    exclude_topics: []             # remove from list

    # Performance tuning
    lazy_start: false              # true = start buffer only on PREFAILED

    # Storage
    format: "sqlite3"              # "sqlite3" | "mcap"
    storage_path: ""               # empty = temp dir
    max_bag_size_mb: 50
    max_total_storage_mb: 500
    auto_cleanup: true             # delete bag when fault CLEARED

Architecture

  NORMAL → Ring buffer running (lazy_start: false)
                ↓
  PREFAILED → Continue buffering
                ↓
  CONFIRMED → 1. JSON snapshot (existing)
              2. Flush ring buffer to .mcap/.db3
              3. Record duration_after_sec more
              4. Close bag, store path in DB
                ↓
  CLEARED → auto_cleanup: true → delete bag file

REST API extension

GET /api/v1/faults/{code}/snapshots
Response includes:

{
  "topics": { ... },
  "rosbag": {
    "available": true,
    "duration_sec": 6.0,
    "size_bytes": 2456789,
    "download_url": "/api/v1/faults/{code}/snapshots/bag"
  }
}

GET /api/v1/faults/{code}/snapshots/bag
→ Returns bag file download


Additional context (optional)

Key design decisions

  • Default lazy_start: false — Basic configs have instant PREFAILED→CONFIRMED, lazy would miss data
  • Default duration_sec: 5.0 — Balance between usefulness and RAM usage
  • Default format: sqlite3 — Easier to inspect for development; MCAP as optimization
  • Default topics: "config" — Reuse existing JSON topic config - zero extra setup

Risk mitigations

  • RAM explosion with "all" topics — Document warning, recommend exclude_topics for cameras
  • Storage explosion — max_bag_size_mb, max_total_storage_mb, auto_cleanup
  • Always-on overhead — lazy_start: true option for resource-constrained systems

Dependencies

  • rosbag2_cpp for ring buffer and writing
  • rosbag2_storage_mcap (optional) for MCAP format

Related

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions