Skip to content

feat: Linux introspection plugins (procfs, systemd, container)#263

Draft
bburda wants to merge 28 commits intomainfrom
feature/linux-introspection
Draft

feat: Linux introspection plugins (procfs, systemd, container)#263
bburda wants to merge 28 commits intomainfrom
feature/linux-introspection

Conversation

@bburda
Copy link
Collaborator

@bburda bburda commented Mar 13, 2026

Pull Request

Summary

Add three Linux introspection plugins that enrich gateway discovery with OS-level metadata. Each plugin implements IntrospectionProvider and registers vendor REST endpoints on Apps and Components:

  • procfs (libprocfs_introspection.so) - reads /proc for process info (PID, RSS, CPU ticks, threads, exe path, cmdline)
  • systemd (libsystemd_introspection.so) - maps ROS 2 nodes to systemd units via sd_pid_get_unit(), queries properties via sd-bus
  • container (libcontainer_introspection.so) - detects Docker/podman/containerd via cgroup path analysis, reads cgroup v2 resource limits

Also includes:

  • Shared static library libmedkit_linux_utils.a with proc_reader, cgroup_reader, and PidCache (TTL-based, thread-safe)
  • PluginContext::get_child_apps() for Component-level aggregation
  • Gateway header exports for downstream plugin packages
  • libsystemd-dev added to devcontainer Dockerfile

Issue


Type

  • Bug fix
  • New feature or tests
  • Breaking change
  • Documentation only

Testing

Unit tests (31 new, all in ros2_medkit_linux_introspection):

  • test_proc_reader - real /proc/self + synthetic /proc in tmpdir (4 tests)
  • test_cgroup_reader - container ID extraction, runtime detection, resource limits (10 tests)
  • test_pid_cache - TTL refresh, auto-refresh, missing nodes, empty/nonexistent proc dirs (6 tests)
  • test_procfs_plugin - JSON serialization (1 test)
  • test_systemd_plugin - JSON serialization, graceful skip (2 tests)
  • test_container_plugin - JSON serialization, not-containerized skip (2 tests)

Integration tests (launch_testing):

  • test_procfs_introspection - live PID mapping, resource usage, Component aggregation, capabilities, 404
  • test_combined_introspection - procfs + container route isolation (200 + 404 coexistence on host)

Docker integration tests (standalone pytest):

  • test_systemd_introspection - unit info, restart count, watchdog, aggregation
  • test_container_introspection - container ID, runtime, memory/CPU limits, aggregation

Full suite: 1302 unit tests pass, 2066 lint tests pass, 0 failures.


Checklist

  • Breaking changes are clearly described (and announced in docs / changelog if needed)
  • Tests were added or updated if needed
  • Docs were updated if behavior or public API changed

bburda added 28 commits March 13, 2026 08:36
…nodes section

The param names in the launch/YAML/CLI examples were already correct
(discovery.mode, discovery.manifest_path), so no changes needed there.

Replaced the "Handling Unmanifested Nodes" section which documented the
nonexistent config.unmanifested_nodes parameter (with ignore/warn/error/
include_as_orphan policies) with "Controlling Gap-Fill in Hybrid Mode"
documenting the actual discovery.merge_pipeline.gap_fill.* parameters.

Added a note block to the Runtime Linking section explaining the layered
merge pipeline architecture.
…info format

- Add missing endpoint categories to handle_root: logs, bulk-data,
  cyclic-subscriptions, updates (conditional), DELETE /faults (global)
- Remove ghost snapshot endpoints (listed but never registered)
- Add missing capabilities: logs, bulk_data, cyclic_subscriptions, updates
- Fix hardcoded version "0.1.0" -> "0.3.0" in handle_root and version-info
- Change version-info response key from "sovd_info" to "items" (SOVD standard)
- Add bulk-data, logs, cyclic-subscriptions URIs to entity capability responses
- Update rest.rst: fix Server Capabilities example format, remove phantom
  /manifest/status, document DELETE /{entity}/faults, update SOVD compliance
  section, add areas/functions resource collection notes
- Update tests, integration tests, and Postman collection for sovd_info->items
Code fixes:
- Remove areas/functions bulk-data from handle_root (validation rejects them)
- Rename test HandleVersionInfoContainsSovdInfoArray -> HandleVersionInfoContainsItemsArray
- Fix test_root_endpoint_includes_snapshots: verify legacy snapshot endpoints are NOT listed

Docs fixes:
- rest.rst: fix self -> href in area/component list examples
- rest.rst: remove /bulk-data from areas and functions resource collections
- plugin-system.rst: remove LogProvider include/export from UpdateProvider example
- plugin-system.rst: clarify IntrospectionProvider metadata is plugin-internal
- discovery-options.rst: fix Field Groups table (status, metadata fields)
- discovery-options.rst: fix health endpoint JSON to match MergeReport::to_json()
- manifest-discovery.rst: fix gap-fill disabled description
- rest.rst: add /version-info example response, remove stale `area` field
  from components list example
- rest.rst: document sovd_info -> items rename in CHANGELOG as breaking change
- discovery-options.rst: add local TOC, note case-sensitivity for policy
  values, fix strategy name "HybridDiscoveryStrategy" -> "hybrid"
- manifest-discovery.rst, migration-to-manifest.rst: fix jq commands
  (.[] -> .items[])
- CHANGELOG.rst: add Breaking Changes section and new 0.3.0 features
- test_plugin_vendor_extensions.test.py: add @verifies REQ_INTEROP_003
refresh_cache() was calling discover_topic_components() for both
RUNTIME_ONLY and MANIFEST_ONLY modes. In manifest_only mode this added
synthetic components from the runtime ROS 2 graph, violating the intent
of "only manifest entities."

Invert the condition so only RUNTIME_ONLY merges topic components.
MANIFEST_ONLY and HYBRID both use discover_components() directly.
…ion test

Rename 3 unit tests referencing old "SovdEntry" naming to "ItemsEntry"
to match the sovd_info->items rename in handle_version_info.

Add regression test verifying that topic-based components do not leak
into the entity cache in manifest_only discovery mode (validates the
fix in gateway_node.cpp discover_components).
…tions

Enable resource collections (data, operations, configurations, faults,
logs, bulk-data) on areas and functions. SOVD defines these only for
apps/components - this is a pragmatic ros2_medkit extension.

Add log routes for areas (namespace prefix match) and functions
(aggregate from hosted apps). Update capability responses to include
logs and bulk-data URIs. Fix entity_capabilities.cpp to match actual
route registrations.
Document ros2_medkit's pragmatic approach to SOVD - we extend the spec
where ROS 2 use cases benefit (resource collections on areas/functions,
x-medkit vendor extensions). Add resource collection support matrix.

Fix incorrect claims about areas supporting same collections as
components. Add changelog entries for area/function log endpoints.
…t, docs

Code fixes:
- Fix faults sampler to scope by entity type (AREA: namespace, FUNCTION:
  host FQNs, COMPONENT: app FQNs) matching REST handler behavior
- Fix logs sampler to use prefix/exact matching per entity type, matching
  log_handlers.cpp scoping logic
- Hoist duplicated severity/context parameter validation in log_handlers
- Add area/function bulk-data endpoints to handle_root endpoint list

Docs fixes:
- Fix SOVD Compliance RST heading level (~~~ -> --- for h2)
- Update Logs Endpoints section to mention areas and functions
- Restore See Also cross-references (authentication, server config)
- Fix em dashes to hyphens in log configuration section
In manifest_only discovery mode, Apps never get bound_fqn set because
runtime_linker only runs in hybrid mode. This caused handlers, samplers,
and configuration aggregation to silently return empty results for all
App-based lookups (logs, faults, configurations).

Add App::effective_fqn() that prefers bound_fqn when available, falling
back to deriving the FQN from ros_binding (namespace_pattern + node_name).
Replace all direct bound_fqn accesses in handler_context, log_handlers,
fault_handlers, gateway_node samplers, thread_safe_entity_cache config
aggregation, and plugin_context with effective_fqn() calls.

Update test_bulk_data_api for areas now returning 200 (entity capabilities
extended) and test_scenario_discovery_manifest timeout for log aggregation.
…ering

- Fix effective_fqn() to prepend "/" when namespace_pattern is empty,
  ensuring valid ROS 2 FQNs for fault filtering and bulk-data scoping
- Reject glob patterns (containing "*") in effective_fqn() to prevent
  garbage FQNs from namespace patterns like "**" or "prefix*"
- Add BULK_DATA and CYCLIC_SUBSCRIPTIONS to CapabilityBuilder enum and
  include them in caps vectors for all entity types in discovery handlers
- Extract filter_faults_by_fqns() helper to eliminate duplication between
  FUNCTION and COMPONENT fault filtering blocks in gateway_node.cpp
- Add EXPECT_FALSE(is_aggregated(BULK_DATA)) assertions for AREA/FUNCTION
- Add effective_fqn() unit tests covering empty namespace, wildcards, globs
- Add comments explaining unconditional bulk-data/cyclic endpoints in
  handle_root (depend on fault_manager, not optional plugins)
- Update bulk-data handler with get_source_filters() for function aggregation
- Fix integration test docstrings for bulk-data entity type coverage
…ages

Add install(DIRECTORY include/) and ament_export_include_directories()
so external packages can find_package(ros2_medkit_gateway) and get
plugin interface headers (GatewayPlugin, IntrospectionProvider, etc.)
and vendored tl::expected.
…egation

Add method to enumerate child Apps for a Component via entity cache.
Needed by introspection plugins for Component-level vendor endpoints.
Add new ROS 2 package for Linux introspection plugins with:
- Static utility library (medkit_linux_utils)
- Three MODULE plugin targets (procfs, systemd, container)
- Stub source files and tests
- CMake config with fPIC, ccache, linting
- read_process_info: parse /proc/{pid}/stat, status, cmdline, exe
- find_pid_for_node: scan /proc for ROS 2 nodes by __node:= and __ns:= args
- Tests use both real /proc/self and synthetic /proc in tmpdir
- extract_container_id: Docker, podman, containerd cgroup path patterns
- detect_runtime: identify container runtime from cgroup path
- is_containerized: check if PID runs in a container
- read_cgroup_info: cgroup v2 resource limits (memory.max, cpu.max)
- All tests use synthetic cgroup filesystem in tmpdir
- Scans /proc for ROS 2 nodes by parsing cmdline args
- Thread-safe with shared_mutex (concurrent reads, exclusive refresh)
- Auto-refresh on TTL expiry during lookup
- Tests with synthetic /proc in tmpdir
- Move parse_ros_args out of anonymous namespace for PidCache access
- Change TTL type to steady_clock::duration for sub-second precision
- IntrospectionProvider: detects containerized Apps via cgroup path analysis
- Supports Docker, podman, containerd runtime detection
- Reads cgroup v2 resource limits (memory.max, cpu.max)
- Vendor routes: GET /apps/{id}/x-medkit-container, GET /components/{id}/x-medkit-container
- 404 when node not containerized, Component aggregation by container ID
- PidCache stored as unique_ptr to avoid shared_mutex move-assignment issue
- IntrospectionProvider: maps Apps to PIDs, returns ProcessInfo metadata
- Vendor routes: GET /apps/{id}/x-medkit-procfs, GET /components/{id}/x-medkit-procfs
- Component aggregation: unique processes with node_ids lists
- PidCache for efficient /proc scanning with configurable TTL
- 503 on PID lookup failure (transient - node may have crashed)
- IntrospectionProvider: maps Apps to systemd units via sd_pid_get_unit
- Queries unit properties via sd-bus (ActiveState, SubState, NRestarts, WatchdogUSec)
- Vendor routes: GET /apps/{id}/x-medkit-systemd, GET /components/{id}/x-medkit-systemd
- 404 when node not in a systemd unit, Component aggregation by unit name
- Configuration, API reference with curl examples, requirements
- Troubleshooting for PID lookup, permissions, systemd access
- Added to tutorials toctree after plugin-system
- procfs: PID mapping, resource usage, Component aggregation, 404 on nonexistent, capabilities
- combined: procfs + container route isolation (200 + 404 coexistence on host)
- systemd: unit info, restart count, watchdog, non-unit 404
- container: container ID, runtime detection, resource limits from cgroup
- Dockerfile.systemd with systemd as PID 1, unit files for demo nodes
- Dockerfile.container for container detection with resource limits
- Runner script for CI integration
Required for building the systemd_introspection plugin which uses
sd-bus API (sd_pid_get_unit, sd_bus_open_system, etc.).
@bburda bburda self-assigned this Mar 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Discovery: Plugin system for platform-specific introspection

1 participant