A high-performance C++ video and audio processing pipeline designed to strip out the latency of traditional Machine Learning abstractions. This repository demonstrates a highly concurrent, zero-copy architecture that natively binds YUV hardware decoders directly to ONNX Runtime computer vision models and Whisper ASR.
By disabling Python abstractions and implementing custom Multi-Threaded pooling, std::unique_ptr polymorphic inference strategies, and SEI NAL Unit muxing, this processing node drastically scales performance on edge/local devices.
- Sequential Zero-Shot Triage: Dynamically link specialized inference engines. Use lightning-fast YOLO layers to locate "objectness" regions of interest, then zero-copy extract sub-matrices to trigger massive open-set Vision Transformers (Grounding DINO) specifically on cropped frames to save upwards of 95% execution time.
- Universal Multiplexed FFmpeg Pipeline: Binds natively to
srt://,v4l2://, or.mp4endpoints natively without temporary segmented disk caching. - H.264 SEI Inference Injection: Embeds Frame-Level JSON telemetry completely natively inside H.264 video streams as Unregistered User Data (UUID-tagged SEI NAL Units), keeping inference bounds and metrics perfectly frame-synced regardless of UDP packet drops over global internet routing.
- Continuous ASR Batching (Whisper.cpp): Extracts Russian/Ukrainian 16kHz audio from FFmpeg and pushes inference into an ultra-low latency continuous queue to stream subtitle fragments dynamically back into the multiplexed SEI data output.
- SOLID Object-Oriented Strategies: Easily decoupled design patterns allow scaling any arbitrary ONNX format execution engine via the
IInferenceStrategyinterface and running it asynchronously via theFrameProcessingCommandworker pools.
And running the massive Grounding DINO vision transformer model leveraging Dynamic INT8 Quantization:
=== Video Processing Metrics ===
Hardware Concurrency: 20 Cores
Inference Workers: 2 Threads
IntraOp Threads/Worker: 10
Optimal Threads/Worker: 5
Inference Backend: ONNXRuntime CPU (FP32)
Frame Size: 960x540
Tensor Resolution: 800x800
Total Time: 23245 ms
Frames Decoded: 11
Frames Inferred: 10
Frames Encoded: 10
Average FPS: 0.4302
Average Time to Frame (T2F): 2.34479 ms
Average Time to Conversion (TTC): 0.647118 ms
Average Time to Inference (TTI): 4628.88 ms
================================
- FFmpeg (
libavcodec-dev,libavformat-dev,libavutil-dev,libavdevice-dev,libswscale-dev,libswresample-dev) - OpenCV (Core & Imgproc)
- ONNX Runtime (Vanilla C++ Backend)
- Whisper.cpp (GGML Binary bindings)
- Nlohmann JSON (Included)
Ensure you have CMake installed and the dependencies properly linked in your system paths.
mkdir build
cd build
cmake ..
make -j$(nproc)A Bash validation suite is provided to cycle through multiple pipeline configuration combinations, automatically validating hardware compatibility, OpenCV sws_scale mapping, native FFmpeg encoding bounds, and Zero-Shot logic matrices gracefully.
chmod +x tests/verify_pipeline.sh
./tests/verify_pipeline.shTested Configurations Extracted:
- Parallel Stream: V4L2 Webcam -> MP4 Storage File
- Zero-Shot Triage Mode: V4L2 Webcam -> MP4 Storage File
- Transport Protocol: V4L2 Webcam ->
srt://Encapsulation Stream
The VideoProcessor operates using a highly optimized unified configuration matrix inside config.json. By default, processing 1920x1080 images continuously through an immense Transformer backbone like GroundingDINO is computationally impossible.
To bridge this, we implemented Zero-Shot Triage utilizing YOLO's decoupled target heads. When "mode": "sequential_triage" and "triage_activation_mode": "object" are passed:
- YOLO evaluates the full-resolution frame continuously at high speed.
- If
triage_class_id(e.g.0for Person) is detected above thetriage_threshold, the exact Region of Interest (ROI) is mathematically cropped. - Only the cropped sub-image is injected into GroundingDINO, saving 90%+ in memory computations, after which coordinates are dynamically un-scaled globally onto the final encoded
result.mp4.
You can modify these mechanics inside config.json:
{
"stream": {
"input": "/dev/video0", // FFmpeg V4L2 Path or srt://
"output": "out_dir/result.mp4"
},
"inference": {
"mode": "sequential_triage", // OR "parallel"
"confidence_threshold": 0.35, "triage_activation_mode": "object",
"triage_class_id": 0, "triage_threshold": 0.45,
"ui": {
"draw_boxes": true, "box_color": [0, 255, 0]
}
}
}To easily measure and contrast the high-throughput performance of the ONNX multi-threaded C++ backend against standard Python environments, we provide isolated benchmarking scripts located inside pybenchcompare/.
These Python scripts natively emulate what the C++ pipeline computes:
bench_yolo.py: YOLO segmentation with simulated frame mask-overlay region filling throughput.bench_dino.py:transformersbased GroundingDINO latency mapping matching Zero-Shot evaluation arrays.bench_whisper.py:openai-whisperbased 30-second continuous audio tensor throughput logic.
Execute the suite directly using standard python module environments (e.g. python pybenchcompare/bench_yolo.py). The resulting ms/frame output can be directly referenced against the metrics.json generated by this C++ processor.
The entire pipeline is driven by an external config.json rather than hardcoded C++ parameters or sprawling CLI arguments.
{
"pipeline": {
"enable_yolo": true,
"enable_dino": true,
"enable_audio_asr": true,
"use_optimization": true,
"check_frames_limit": -1
},
"models": {
"yolo_path": "model/yolov8n-seg.onnx",
"dino_path": "model/groundingdino_int8.onnx",
"asr_path": "model/ggml-base.bin",
"dino_prompt": "person . bag .",
"asr_language": "ru",
"confidence_threshold": 0.35
},
"inference": {
"mode": "sequential_triage",
"triage_activation_mode": "objectness",
"triage_class_id": 0,
"triage_threshold": 0.45,
"triage_min_area_percent": 0.05,
"draw_boxes": true,
"draw_masks": true,
"box_color": [0, 255, 0],
"mask_color": [0, 0, 255]
}
}If inference.mode is set to sequential_triage, DINO is bypassed completely unless YOLO detects a high-value signal. You can trigger DINO using objectness mode (requiring a custom exported YOLO ONNX graph to read raw confidence) or object mode (triggering on specific YOLO classes like 0 for Person). See docs/yolo_objectness_export.md for details on modifying Ultralytics to export the objectness tensor.
The processor primarily expects a config.json and a .env file detailing system network configurations and hardware parameters.
./build/video_processor config.jsonLegacy CLI Overrides:
If needed, you can bypass the stream configuration in config.json using the legacy positional arguments:
./build/video_processor <input_uri> <output_uri>SRT Streaming Example:
./build/video_processor "srt://127.0.0.1:8888?mode=listener" "srt://127.0.0.1:9999?mode=caller"Webcam Object Detection:
./build/video_processor "/dev/video0" "output.mp4"This repository isn't a standalone toy—it is a dedicated testbed. It is actively used to rigorously stress-test specific computer vision models, engine builds, and sequential triage patterns before they are promoted to the FogAI core. If a strategy (like Zero-Copy hardware mapping or SEI multiplexing) can't survive here at scale, it has no business being inside an industrial autonomous nervous system.