Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
268 changes: 268 additions & 0 deletions docs/tutorials/snapshots.rst
Original file line number Diff line number Diff line change
Expand Up @@ -230,8 +230,276 @@ Troubleshooting
- Check topic resolution priority (fault_specific > patterns > default)
- Verify regex patterns in config file are correct

Rosbag Capture (Time-Window Recording)
--------------------------------------

In addition to JSON snapshots, you can enable **rosbag capture** for "black box"
style recording. This continuously buffers messages in memory and flushes them
to a bag file when a fault is confirmed.

**Key differences from JSON snapshots:**

.. list-table::
:widths: 25 35 40
:header-rows: 1

* - Feature
- JSON Snapshots
- Rosbag Capture
* - Data format
- JSON (human-readable)
- Binary (native ROS 2)
* - Time coverage
- Point-in-time (at confirmation)
- Time window (before + after fault)
* - Message fidelity
- Converted to JSON
- Original serialization preserved
* - Playback
- N/A
- ``ros2 bag play``
* - Default
- Enabled
- Disabled

Enabling Rosbag Capture
^^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: bash

ros2 run ros2_medkit_fault_manager fault_manager_node --ros-args \
-p snapshots.rosbag.enabled:=true \
-p snapshots.rosbag.duration_sec:=5.0 \
-p snapshots.rosbag.duration_after_sec:=1.0

This captures 5 seconds of data **before** the fault and 1 second **after**.

Rosbag Configuration Options
^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. list-table::
:widths: 35 15 50
:header-rows: 1

* - Parameter
- Default
- Description
* - ``snapshots.rosbag.enabled``
- ``false``
- Enable rosbag capture. When enabled, the system continuously buffers
messages in memory and writes them to a bag file when faults are confirmed.
* - ``snapshots.rosbag.duration_sec``
- ``5.0``
- Ring buffer duration in seconds. This determines how much history is
preserved before the fault confirmation. Larger values provide more
context but consume more memory.
* - ``snapshots.rosbag.duration_after_sec``
- ``1.0``
- Post-fault recording duration. After a fault is confirmed, recording
continues for this many seconds to capture immediate system response.
* - ``snapshots.rosbag.topics``
- ``"config"``
- Topic selection mode:

- ``"config"`` - Use same topics as JSON snapshots (from config file)
- ``"all"`` or ``"auto"`` - Auto-discover and record all available topics
- ``"explicit"`` - Use only topics from ``include_topics`` list
* - ``snapshots.rosbag.include_topics``
- ``[]``
- Explicit list of topics to record (only used when ``topics: "explicit"``).
Example: ``["/odom", "/joint_states", "/cmd_vel"]``
* - ``snapshots.rosbag.exclude_topics``
- ``[]``
- Topics to exclude from recording (applies to all modes). Useful for
filtering high-bandwidth topics like camera images.
* - ``snapshots.rosbag.format``
- ``"sqlite3"``
- Bag storage format: ``"sqlite3"`` (default, widely compatible) or
``"mcap"`` (more efficient compression, requires plugin).
* - ``snapshots.rosbag.storage_path``
- ``""``
- Directory for bag files. Empty string uses system temp directory
(``/tmp``). Bags are named ``fault_{code}_{timestamp}/``.
* - ``snapshots.rosbag.auto_cleanup``
- ``true``
- Automatically delete bag files when faults are cleared. Set to ``false``
to retain bags for manual analysis.
* - ``snapshots.rosbag.lazy_start``
- ``false``
- Controls when the ring buffer starts recording. See diagram below.
* - ``snapshots.rosbag.max_bag_size_mb``
- ``50``
- Maximum size per bag file in MB. When exceeded, rosbag2 creates
additional segment files.
* - ``snapshots.rosbag.max_total_storage_mb``
- ``500``
- Total storage limit for all bag files. Oldest bags are automatically
deleted when this limit is exceeded.

Understanding lazy_start Mode
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The ``lazy_start`` parameter controls when the ring buffer starts recording:

**lazy_start: false (default)** - Recording starts immediately at node startup.
Best for development and when you need maximum context for any fault.

.. uml::

@startuml
skinparam backgroundColor transparent
participant "Ring Buffer" as RB
participant "Fault Manager" as FM

== Node Startup ==
RB -> RB : Start recording
note right of RB #LightGreen : Buffer active\n(continuous)

... time passes (messages buffered) ...

== Fault Detected ==
FM -> FM : PREFAILED
note right of RB #LightGreen : duration_sec\nof data buffered

FM -> FM : CONFIRMED
FM -> RB : Flush buffer
RB -> RB : Write pre-fault data
note right of RB #LightBlue : Post-fault\nrecording

... duration_after_sec ...

RB -> RB : Save bag file
note right of RB #LightGreen : Resume\nbuffering
@enduml

**lazy_start: true** - Recording only starts when a fault enters PREFAILED state.
Saves resources but may miss context if fault confirms before buffer fills.

.. uml::

@startuml
skinparam backgroundColor transparent
participant "Ring Buffer" as RB
participant "Fault Manager" as FM

== Node Startup ==
note right of RB #LightGray : Buffer inactive\n(saving resources)

... time passes (no recording) ...

== Fault Detected ==
FM -> FM : PREFAILED
FM -> RB : Start buffer
note right of RB #LightGreen : Recording\nstarts now

... buffer filling (may be < duration_sec) ...

FM -> FM : CONFIRMED
FM -> RB : Flush buffer
RB -> RB : Write pre-fault data
note right of RB #LightBlue : Post-fault\nrecording

... duration_after_sec ...

RB -> RB : Save bag file
note right of RB #LightGray : Buffer\ninactive
@enduml

**When to use lazy_start: true:**

- Production systems with limited resources
- When faults have reliable PREFAILED → CONFIRMED progression
- Systems where most faults are debounced (enter PREFAILED first)

**When to use lazy_start: false:**

- Development and debugging
- When faults may skip PREFAILED state (severity 3 = CRITICAL)
- When maximum fault context is more important than resource usage

.. note::

The ``"mcap"`` format requires ``rosbag2_storage_mcap`` to be installed.
If not available, use ``"sqlite3"`` (default).

.. code-block:: bash

# Install MCAP support (optional)
sudo apt install ros-${ROS_DISTRO}-rosbag2-storage-mcap

Downloading Rosbag Files
^^^^^^^^^^^^^^^^^^^^^^^^

**Via REST API:**

.. code-block:: bash

# Download bag archive (.tar.gz containing full bag directory)
curl -O -J http://localhost:8080/api/v1/faults/MOTOR_OVERHEAT/snapshots/bag

# Extract the archive
tar -xzf fault_MOTOR_OVERHEAT_20260124_153045.tar.gz

# Play back the bag
ros2 bag play fault_MOTOR_OVERHEAT_1735830000/

The REST API returns a compressed tar.gz archive containing the complete bag
directory structure (``metadata.yaml`` and all storage segments). This allows
direct playback with ``ros2 bag play`` after extraction.

**Via ROS 2 service:**

.. code-block:: bash

ros2 service call /fault_manager/get_rosbag ros2_medkit_msgs/srv/GetRosbag \
"{fault_code: 'MOTOR_OVERHEAT'}"

Example: Production Configuration
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

For production use with conservative resource usage:

.. code-block:: yaml

# config/snapshots.yaml
rosbag:
enabled: true
duration_sec: 3.0
duration_after_sec: 0.5
topics: "config" # Use same topics as JSON snapshots
lazy_start: true # Save resources until fault detected
format: "sqlite3"
max_bag_size_mb: 25
max_total_storage_mb: 200
auto_cleanup: true

# Exclude high-bandwidth topics
# exclude_topics:
# - /camera/image_raw
# - /pointcloud

Example: Debugging Configuration
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

For development with maximum context:

.. code-block:: yaml

rosbag:
enabled: true
duration_sec: 10.0 # 10 seconds before fault
duration_after_sec: 2.0 # 2 seconds after
topics: "config"
lazy_start: false # Always recording
format: "sqlite3"
storage_path: "/var/log/ros2_medkit/rosbags"
max_bag_size_mb: 100
max_total_storage_mb: 1000
auto_cleanup: false # Keep bags for analysis

See Also
--------

- :doc:`../requirements/specs/faults` - Fault API requirements
- `Gateway README <https://github.com/selfpatch/ros2_medkit/blob/main/src/ros2_medkit_gateway/README.md>`_ - REST API reference
- `config/snapshots.yaml <https://github.com/selfpatch/ros2_medkit/blob/main/src/ros2_medkit_fault_manager/config/snapshots.yaml>`_ - Full configuration reference
19 changes: 19 additions & 0 deletions src/ros2_medkit_fault_manager/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -39,13 +39,17 @@ find_package(SQLite3 REQUIRED)
find_package(nlohmann_json REQUIRED)
# yaml-cpp is required as transitive dependency from ros2_medkit_serialization
find_package(yaml-cpp REQUIRED)
# rosbag2 for time-window snapshot recording
find_package(rosbag2_cpp REQUIRED)
find_package(rosbag2_storage REQUIRED)

# Library target (shared between executable and tests)
add_library(fault_manager_lib STATIC
src/fault_manager_node.cpp
src/fault_storage.cpp
src/sqlite_fault_storage.cpp
src/snapshot_capture.cpp
src/rosbag_capture.cpp
src/correlation/types.cpp
src/correlation/config_parser.cpp
src/correlation/pattern_matcher.cpp
Expand All @@ -61,6 +65,8 @@ ament_target_dependencies(fault_manager_lib PUBLIC
rclcpp
ros2_medkit_msgs
ros2_medkit_serialization
rosbag2_cpp
rosbag2_storage
)

target_link_libraries(fault_manager_lib PUBLIC
Expand Down Expand Up @@ -123,6 +129,11 @@ if(BUILD_TESTING)
target_link_libraries(test_snapshot_capture fault_manager_lib)
ament_target_dependencies(test_snapshot_capture rclcpp ros2_medkit_msgs)

# Rosbag capture tests
ament_add_gtest(test_rosbag_capture test/test_rosbag_capture.cpp)
target_link_libraries(test_rosbag_capture fault_manager_lib)
ament_target_dependencies(test_rosbag_capture rclcpp ros2_medkit_msgs)

# Correlation config parser tests
ament_add_gtest(test_correlation_config_parser test/test_correlation_config_parser.cpp)
target_link_libraries(test_correlation_config_parser fault_manager_lib)
Expand All @@ -143,6 +154,8 @@ if(BUILD_TESTING)
target_link_options(test_sqlite_storage PRIVATE --coverage)
target_compile_options(test_snapshot_capture PRIVATE --coverage -O0 -g)
target_link_options(test_snapshot_capture PRIVATE --coverage)
target_compile_options(test_rosbag_capture PRIVATE --coverage -O0 -g)
target_link_options(test_rosbag_capture PRIVATE --coverage)
target_compile_options(test_correlation_config_parser PRIVATE --coverage -O0 -g)
target_link_options(test_correlation_config_parser PRIVATE --coverage)
target_compile_options(test_pattern_matcher PRIVATE --coverage -O0 -g)
Expand All @@ -161,6 +174,12 @@ if(BUILD_TESTING)
TARGET test_integration
TIMEOUT 60
)

add_launch_test(
test/test_rosbag_integration.test.py
TARGET test_rosbag_integration
TIMEOUT 90
)
endif()

ament_package()
Loading