Skip to content

Add buffered inference support for SALM models#15364

Open
naymaraq wants to merge 22 commits intomainfrom
dkaramyan/streaming-canary-qwen
Open

Add buffered inference support for SALM models#15364
naymaraq wants to merge 22 commits intomainfrom
dkaramyan/streaming-canary-qwen

Conversation

@naymaraq
Copy link
Collaborator

@naymaraq naymaraq commented Feb 6, 2026

Important

The Update branch button must only be pressed in very rare occassions.
An outdated branch is never blocking the merge of a PR.
Please reach out to the automation team before pressing that button.

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

This PR adds support for inference of buffered SALM models such as nvidia/canary-qwen-2.5b. It uses an incremental audio buffer (defined by buffer_size) to accumulate audio chunks (defined by chunk_size). When the buffer becomes full, a portion of the buffer is dropped from the beginning (defined by overlap_size). The buffer size must be divisible by both the chunk size and the overlap size. The tokens extracted from subsequent buffers are then merged using longest common subsequence or longest common substring strategies.

This pipeline is experimental and not yet ready for production use.

Collection: [ASR]

Changelog

  • Add specific line by line info of high level changes in this PR.

Usage

python examples/asr/asr_streaming_inference/asr_streaming_infer.py \
    --config-path="../conf/asr_streaming_inference/" \
    --config-name=buffered_salm.yaml \
    audio_file=<path to audio file, directory, or manifest.jsonl> \
    output_filename="result.jsonl" \
    asr_output_granularity=segment \
    asr.model_name="nvidia/canary-qwen-2.5b" \
    streaming.batch_size=64 \
    streaming.buffer_size=8.0 \
    streaming.chunk_size=2.0 \
    streaming.overlap_size=4.0 \
    streaming.merging_strategy=lcsubstr

Experiments

Evaluations are conducted on the ASR HF Leaderboard datasets. The first line shows the performance of the offline model.
The model used is nvidia/canary-qwen-2.5b.

Mode Merging Algo Buffer Size Overlap Size Avg. WER AMI Earnings22 Giga LS Clean LS Other SPGI Tedlium VoxPopuli
offline - - - 5.62% 10.18% 10.42% 9.41% 1.60% 3.10% 1.90% 2.72% 5.66%
buffered LCSubstring 4 2 6.70% 10.91% 11.64% 9.92% 2.87% 4.78% 3.49% 3.34% 6.61%
buffered LCSubsequence 4 2 10.14% 13.06% 15.43% 12.96% 6.92% 9.08% 6.72% 6.38% 10.59%
buffered LCSubstring 8 4 5.81% 10.27% 10.50% 9.45% 1.86% 3.45% 2.20% 2.91% 5.86%
buffered LCSubsequence 8 4 9.38% 11.51% 12.96% 13.71% 5.54% 7.23% 6.64% 7.70% 9.74%
buffered LCSubstring 8 1 7.73% 10.93% 12.25% 11.35% 3.50% 5.30% 4.22% 5.03% 9.29%
buffered LCSubsequence 8 1 8.89% 11.29% 13.31% 12.43% 4.62% 6.24% 5.60% 6.43% 11.21%

Key takeaways:

  • For the same buffer and overlap settings, LCSubstring yields substantially lower WER than LCSubsequence
  • Overlap size and buffer size have a meaningful impact on performance
  • The configuration using LCSubstring with buffer size 8 and overlap size 4 provides the best trade-off between streaming capability and accuracy

GitHub Actions CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

  • New Feature
  • Bugfix
  • Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

  • Related to # (issue)

naymaraq added 13 commits February 3, 2026 07:42
Signed-off-by: naymaraq <dkaramyan@nvidia.com>
Signed-off-by: naymaraq <dkaramyan@nvidia.com>
Signed-off-by: naymaraq <dkaramyan@nvidia.com>
Signed-off-by: naymaraq <dkaramyan@nvidia.com>
Signed-off-by: naymaraq <dkaramyan@nvidia.com>
Signed-off-by: naymaraq <dkaramyan@nvidia.com>
Signed-off-by: naymaraq <dkaramyan@nvidia.com>
Signed-off-by: naymaraq <dkaramyan@nvidia.com>
Signed-off-by: naymaraq <dkaramyan@nvidia.com>
Signed-off-by: naymaraq <dkaramyan@nvidia.com>
Signed-off-by: naymaraq <dkaramyan@nvidia.com>
Signed-off-by: naymaraq <dkaramyan@nvidia.com>
Signed-off-by: naymaraq <dkaramyan@nvidia.com>
@github-actions github-actions bot added the ASR label Feb 6, 2026
Signed-off-by: naymaraq <naymaraq@users.noreply.github.com>
@naymaraq naymaraq changed the title Add support for buffered SALM models Add buffered inference support for SALM models Feb 6, 2026
naymaraq and others added 4 commits February 7, 2026 00:25
Signed-off-by: naymaraq <dkaramyan@nvidia.com>
Signed-off-by: naymaraq <dkaramyan@nvidia.com>
Signed-off-by: naymaraq <dkaramyan@nvidia.com>
Signed-off-by: naymaraq <naymaraq@users.noreply.github.com>
@naymaraq naymaraq marked this pull request as ready for review February 6, 2026 20:27
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds experimental buffered/streaming inference support for SALM ASR models by introducing an incremental audio buffering mechanism and token-merging utilities to stitch outputs across overlapping buffers.

Changes:

  • Added BufferedSALMPipeline plus minimal SALMStreamingState and factory wiring for ASRDecodingType.SALM.
  • Implemented incremental (per-stream + batched) audio buffering and LCS-based token merging (LCSubstring/LCSubsequence).
  • Added example config for buffered SALM streaming and unit tests for longest_common_substring().

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
nemo/collections/asr/inference/utils/lcs_merge.py New LCS merge utilities for stitching token sequences across buffers.
tests/collections/asr/inference/test_lcs_merge.py Unit tests for longest_common_substring().
nemo/collections/asr/inference/utils/enums.py Adds SALM decoding type and MergingStrategy enum.
nemo/collections/asr/inference/streaming/state/salm_state.py Adds SALM-specific streaming state type (inherits generic StreamingState).
nemo/collections/asr/inference/streaming/buffering/incremental_audio_bufferer.py New incremental audio bufferer (single + batched) for buffered SALM inference.
nemo/collections/asr/inference/pipelines/buffered_salm_pipeline.py New buffered SALM pipeline using incremental buffering + LCS-based token merge.
nemo/collections/asr/inference/model_wrappers/salm_asr_inference_wrapper.py New wrapper around SpeechLM2 SALM for inference/generation.
nemo/collections/asr/inference/factory/buffered_pipeline_builder.py Wires buffered SALM into the buffered pipeline builder.
nemo/collections/asr/inference/factory/base_builder.py Extends ASR model factory to construct SALM wrapper for buffered pipelines.
examples/asr/conf/asr_streaming_inference/buffered_salm.yaml New example config for running buffered SALM streaming inference.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@github-actions github-actions bot removed the Run CICD label Feb 6, 2026
asr_class = RNNTInferenceWrapper
case (ASRDecodingType.SALM, PipelineType.BUFFERED):
asr_class = SALMASRInferenceWrapper
# remove decoding_cfg, SALM AED does not use decoding_cfg yet
Copy link
Collaborator

@pzelasko pzelasko Feb 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically it can, but the structure will be different - it accepts transformers.GenerationConfig into model.generate - up to you if you think it makes sense to support that. We can also add this later if needed.

https://huggingface.co/docs/transformers/en/main_classes/text_generation

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's skip this for now

Copy link
Collaborator

@pzelasko pzelasko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great work @naymaraq!
Does asr/inference have some documentation that we should extend with a description and an example how to run this? Or the docs still need to be created?

@NVIDIA-NeMo NVIDIA-NeMo deleted a comment from Copilot AI Feb 8, 2026
naymaraq and others added 3 commits February 8, 2026 08:18
Signed-off-by: naymaraq <dkaramyan@nvidia.com>
Signed-off-by: naymaraq <naymaraq@users.noreply.github.com>
Signed-off-by: naymaraq <dkaramyan@nvidia.com>
@naymaraq
Copy link
Collaborator Author

naymaraq commented Feb 8, 2026

This is great work @naymaraq! Does asr/inference have some documentation that we should extend with a description and an example how to run this? Or the docs still need to be created?

We don't have docs yet. Need to be created.

@github-actions
Copy link
Contributor

github-actions bot commented Feb 8, 2026

[🤖]: Hi @naymaraq 👋,

We wanted to let you know that a CICD pipeline for this PR just finished successfully.

So it might be time to merge this PR or get some approvals.

//cc @chtruong814 @ko3n1g @pablo-garay @thomasdhc

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants