Skip to content

NickZt/video-yolo-dash-processor

Repository files navigation

Unified Media Inference Processor

A high-performance C++ video and audio processing pipeline designed to strip out the latency of traditional Machine Learning abstractions. This repository demonstrates a highly concurrent, zero-copy architecture that natively binds YUV hardware decoders directly to ONNX Runtime computer vision models and Whisper ASR.

By disabling Python abstractions and implementing custom Multi-Threaded pooling, std::unique_ptr polymorphic inference strategies, and SEI NAL Unit muxing, this processing node drastically scales performance on edge/local devices.

Key Features

  • Sequential Zero-Shot Triage: Dynamically link specialized inference engines. Use lightning-fast YOLO layers to locate "objectness" regions of interest, then zero-copy extract sub-matrices to trigger massive open-set Vision Transformers (Grounding DINO) specifically on cropped frames to save upwards of 95% execution time.
  • Universal Multiplexed FFmpeg Pipeline: Binds natively to srt://, v4l2://, or .mp4 endpoints natively without temporary segmented disk caching.
  • H.264 SEI Inference Injection: Embeds Frame-Level JSON telemetry completely natively inside H.264 video streams as Unregistered User Data (UUID-tagged SEI NAL Units), keeping inference bounds and metrics perfectly frame-synced regardless of UDP packet drops over global internet routing.
  • Continuous ASR Batching (Whisper.cpp): Extracts Russian/Ukrainian 16kHz audio from FFmpeg and pushes inference into an ultra-low latency continuous queue to stream subtitle fragments dynamically back into the multiplexed SEI data output.
  • SOLID Object-Oriented Strategies: Easily decoupled design patterns allow scaling any arbitrary ONNX format execution engine via the IInferenceStrategy interface and running it asynchronously via the FrameProcessingCommand worker pools.

And running the massive Grounding DINO vision transformer model leveraging Dynamic INT8 Quantization:

=== Video Processing Metrics ===
Hardware Concurrency: 20 Cores
Inference Workers: 2 Threads
IntraOp Threads/Worker: 10
Optimal Threads/Worker: 5
Inference Backend: ONNXRuntime CPU (FP32)
Frame Size: 960x540
Tensor Resolution: 800x800
Total Time: 23245 ms
Frames Decoded: 11
Frames Inferred: 10
Frames Encoded: 10
Average FPS: 0.4302
Average Time to Frame (T2F): 2.34479 ms
Average Time to Conversion (TTC): 0.647118 ms
Average Time to Inference (TTI): 4628.88 ms
================================

Dependencies

  • FFmpeg (libavcodec-dev, libavformat-dev, libavutil-dev, libavdevice-dev, libswscale-dev, libswresample-dev)
  • OpenCV (Core & Imgproc)
  • ONNX Runtime (Vanilla C++ Backend)
  • Whisper.cpp (GGML Binary bindings)
  • Nlohmann JSON (Included)

Building the Project

Ensure you have CMake installed and the dependencies properly linked in your system paths.

mkdir build
cd build
cmake ..
make -j$(nproc)

5. Verify the Build (Automated Test Suite)

A Bash validation suite is provided to cycle through multiple pipeline configuration combinations, automatically validating hardware compatibility, OpenCV sws_scale mapping, native FFmpeg encoding bounds, and Zero-Shot logic matrices gracefully.

chmod +x tests/verify_pipeline.sh
./tests/verify_pipeline.sh

Tested Configurations Extracted:

  1. Parallel Stream: V4L2 Webcam -> MP4 Storage File
  2. Zero-Shot Triage Mode: V4L2 Webcam -> MP4 Storage File
  3. Transport Protocol: V4L2 Webcam -> srt:// Encapsulation Stream

6. Execution Modes (Zero-Shot Triage)

The VideoProcessor operates using a highly optimized unified configuration matrix inside config.json. By default, processing 1920x1080 images continuously through an immense Transformer backbone like GroundingDINO is computationally impossible.

To bridge this, we implemented Zero-Shot Triage utilizing YOLO's decoupled target heads. When "mode": "sequential_triage" and "triage_activation_mode": "object" are passed:

  1. YOLO evaluates the full-resolution frame continuously at high speed.
  2. If triage_class_id (e.g. 0 for Person) is detected above the triage_threshold, the exact Region of Interest (ROI) is mathematically cropped.
  3. Only the cropped sub-image is injected into GroundingDINO, saving 90%+ in memory computations, after which coordinates are dynamically un-scaled globally onto the final encoded result.mp4.

You can modify these mechanics inside config.json:

{
    "stream": {
        "input": "/dev/video0",   // FFmpeg V4L2 Path or srt://
        "output": "out_dir/result.mp4"
    },
    "inference": {
        "mode": "sequential_triage", // OR "parallel"
        "confidence_threshold": 0.35, "triage_activation_mode": "object",
        "triage_class_id": 0, "triage_threshold": 0.45,
        "ui": {
            "draw_boxes": true, "box_color": [0, 255, 0]
        }
    }
}

7. Python Baseline Benchmark Scripts (pybenchcompare)

To easily measure and contrast the high-throughput performance of the ONNX multi-threaded C++ backend against standard Python environments, we provide isolated benchmarking scripts located inside pybenchcompare/.

These Python scripts natively emulate what the C++ pipeline computes:

  • bench_yolo.py: YOLO segmentation with simulated frame mask-overlay region filling throughput.
  • bench_dino.py: transformers based GroundingDINO latency mapping matching Zero-Shot evaluation arrays.
  • bench_whisper.py: openai-whisper based 30-second continuous audio tensor throughput logic.

Execute the suite directly using standard python module environments (e.g. python pybenchcompare/bench_yolo.py). The resulting ms/frame output can be directly referenced against the metrics.json generated by this C++ processor.


8. Metrics & Output Files

Configuration

The entire pipeline is driven by an external config.json rather than hardcoded C++ parameters or sprawling CLI arguments.

{
  "pipeline": {
    "enable_yolo": true,
    "enable_dino": true,
    "enable_audio_asr": true,
    "use_optimization": true,
    "check_frames_limit": -1
  },
  "models": {
    "yolo_path": "model/yolov8n-seg.onnx",
    "dino_path": "model/groundingdino_int8.onnx",
    "asr_path": "model/ggml-base.bin",
    "dino_prompt": "person . bag .",
    "asr_language": "ru",
    "confidence_threshold": 0.35
  },
  "inference": {
    "mode": "sequential_triage",
    "triage_activation_mode": "objectness",
    "triage_class_id": 0,
    "triage_threshold": 0.45,
    "triage_min_area_percent": 0.05,
    "draw_boxes": true,
    "draw_masks": true,
    "box_color": [0, 255, 0],
    "mask_color": [0, 0, 255]
  }
}

Zero-Shot Triage Explanation

If inference.mode is set to sequential_triage, DINO is bypassed completely unless YOLO detects a high-value signal. You can trigger DINO using objectness mode (requiring a custom exported YOLO ONNX graph to read raw confidence) or object mode (triggering on specific YOLO classes like 0 for Person). See docs/yolo_objectness_export.md for details on modifying Ultralytics to export the objectness tensor.

Execution

The processor primarily expects a config.json and a .env file detailing system network configurations and hardware parameters.

./build/video_processor config.json

Legacy CLI Overrides: If needed, you can bypass the stream configuration in config.json using the legacy positional arguments:

./build/video_processor <input_uri> <output_uri>

SRT Streaming Example:

./build/video_processor "srt://127.0.0.1:8888?mode=listener" "srt://127.0.0.1:9999?mode=caller"

Webcam Object Detection:

./build/video_processor "/dev/video0" "output.mp4"

The FogAI Ecosystem

This repository isn't a standalone toy—it is a dedicated testbed. It is actively used to rigorously stress-test specific computer vision models, engine builds, and sequential triage patterns before they are promoted to the FogAI core. If a strategy (like Zero-Copy hardware mapping or SEI multiplexing) can't survive here at scale, it has no business being inside an industrial autonomous nervous system.

About

Real-time DASH Video Processor with YOLO Segmentation Testbed for Unified Media FogAi Node, supports High-Performance Stream Multiplexing (SEI Injection), Inference Telemetry (OpenAI Compatible SEI) Results (Class ID, Object Type, Bounding Box X/Y/W/H, Confidence Score)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages