Add buffered inference support for SALM models#15364
Conversation
Signed-off-by: naymaraq <dkaramyan@nvidia.com>
Signed-off-by: naymaraq <dkaramyan@nvidia.com>
Signed-off-by: naymaraq <dkaramyan@nvidia.com>
Signed-off-by: naymaraq <dkaramyan@nvidia.com>
Signed-off-by: naymaraq <dkaramyan@nvidia.com>
Signed-off-by: naymaraq <dkaramyan@nvidia.com>
Signed-off-by: naymaraq <dkaramyan@nvidia.com>
Signed-off-by: naymaraq <dkaramyan@nvidia.com>
Signed-off-by: naymaraq <dkaramyan@nvidia.com>
Signed-off-by: naymaraq <naymaraq@users.noreply.github.com>
Signed-off-by: naymaraq <dkaramyan@nvidia.com>
Signed-off-by: naymaraq <dkaramyan@nvidia.com>
Signed-off-by: naymaraq <dkaramyan@nvidia.com>
Signed-off-by: naymaraq <naymaraq@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Adds experimental buffered/streaming inference support for SALM ASR models by introducing an incremental audio buffering mechanism and token-merging utilities to stitch outputs across overlapping buffers.
Changes:
- Added
BufferedSALMPipelineplus minimalSALMStreamingStateand factory wiring forASRDecodingType.SALM. - Implemented incremental (per-stream + batched) audio buffering and LCS-based token merging (LCSubstring/LCSubsequence).
- Added example config for buffered SALM streaming and unit tests for
longest_common_substring().
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
nemo/collections/asr/inference/utils/lcs_merge.py |
New LCS merge utilities for stitching token sequences across buffers. |
tests/collections/asr/inference/test_lcs_merge.py |
Unit tests for longest_common_substring(). |
nemo/collections/asr/inference/utils/enums.py |
Adds SALM decoding type and MergingStrategy enum. |
nemo/collections/asr/inference/streaming/state/salm_state.py |
Adds SALM-specific streaming state type (inherits generic StreamingState). |
nemo/collections/asr/inference/streaming/buffering/incremental_audio_bufferer.py |
New incremental audio bufferer (single + batched) for buffered SALM inference. |
nemo/collections/asr/inference/pipelines/buffered_salm_pipeline.py |
New buffered SALM pipeline using incremental buffering + LCS-based token merge. |
nemo/collections/asr/inference/model_wrappers/salm_asr_inference_wrapper.py |
New wrapper around SpeechLM2 SALM for inference/generation. |
nemo/collections/asr/inference/factory/buffered_pipeline_builder.py |
Wires buffered SALM into the buffered pipeline builder. |
nemo/collections/asr/inference/factory/base_builder.py |
Extends ASR model factory to construct SALM wrapper for buffered pipelines. |
examples/asr/conf/asr_streaming_inference/buffered_salm.yaml |
New example config for running buffered SALM streaming inference. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
nemo/collections/asr/inference/streaming/buffering/incremental_audio_bufferer.py
Outdated
Show resolved
Hide resolved
nemo/collections/asr/inference/streaming/buffering/incremental_audio_bufferer.py
Show resolved
Hide resolved
| asr_class = RNNTInferenceWrapper | ||
| case (ASRDecodingType.SALM, PipelineType.BUFFERED): | ||
| asr_class = SALMASRInferenceWrapper | ||
| # remove decoding_cfg, SALM AED does not use decoding_cfg yet |
There was a problem hiding this comment.
Technically it can, but the structure will be different - it accepts transformers.GenerationConfig into model.generate - up to you if you think it makes sense to support that. We can also add this later if needed.
https://huggingface.co/docs/transformers/en/main_classes/text_generation
There was a problem hiding this comment.
Let's skip this for now
Signed-off-by: naymaraq <dkaramyan@nvidia.com>
Signed-off-by: naymaraq <naymaraq@users.noreply.github.com>
We don't have docs yet. Need to be created. |
|
[🤖]: Hi @naymaraq 👋, We wanted to let you know that a CICD pipeline for this PR just finished successfully. So it might be time to merge this PR or get some approvals. |
Important
The
Update branchbutton must only be pressed in very rare occassions.An outdated branch is never blocking the merge of a PR.
Please reach out to the automation team before pressing that button.
What does this PR do ?
Add a one line overview of what this PR aims to accomplish.
This PR adds support for inference of buffered SALM models such as
nvidia/canary-qwen-2.5b. It uses an incremental audio buffer (defined bybuffer_size) to accumulate audio chunks (defined bychunk_size). When the buffer becomes full, a portion of the buffer is dropped from the beginning (defined byoverlap_size). The buffer size must be divisible by both the chunk size and the overlap size. The tokens extracted from subsequent buffers are then merged usinglongest common subsequenceorlongest common substringstrategies.This pipeline is experimental and not yet ready for production use.
Collection: [ASR]
Changelog
Usage
python examples/asr/asr_streaming_inference/asr_streaming_infer.py \ --config-path="../conf/asr_streaming_inference/" \ --config-name=buffered_salm.yaml \ audio_file=<path to audio file, directory, or manifest.jsonl> \ output_filename="result.jsonl" \ asr_output_granularity=segment \ asr.model_name="nvidia/canary-qwen-2.5b" \ streaming.batch_size=64 \ streaming.buffer_size=8.0 \ streaming.chunk_size=2.0 \ streaming.overlap_size=4.0 \ streaming.merging_strategy=lcsubstrExperiments
Evaluations are conducted on the ASR HF Leaderboard datasets. The first line shows the performance of the offline model.
The model used is
nvidia/canary-qwen-2.5b.Key takeaways:
GitHub Actions CI
The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.
The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".
Before your PR is "Ready for review"
Pre checks:
PR Type:
If you haven't finished some of the above items you can still open "Draft" PR.
Who can review?
Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.
Additional Information