Add buffered inference support for SALM models by naymaraq · Pull Request #15364 · NVIDIA-NeMo/NeMo

naymaraq · 2026-02-06T06:11:26Z

Important

The Update branch button must only be pressed in very rare occassions.
An outdated branch is never blocking the merge of a PR.
Please reach out to the automation team before pressing that button.

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

This PR adds support for inference of buffered SALM models such as nvidia/canary-qwen-2.5b. It uses an incremental audio buffer (defined by buffer_size) to accumulate audio chunks (defined by chunk_size). When the buffer becomes full, a portion of the buffer is dropped from the beginning (defined by overlap_size). The buffer size must be divisible by both the chunk size and the overlap size. The tokens extracted from subsequent buffers are then merged using longest common subsequence or longest common substring strategies.

This pipeline is experimental and not yet ready for production use.

Collection: [ASR]

Changelog

Add specific line by line info of high level changes in this PR.

Usage

python examples/asr/asr_streaming_inference/asr_streaming_infer.py \
    --config-path="../conf/asr_streaming_inference/" \
    --config-name=buffered_salm.yaml \
    audio_file=<path to audio file, directory, or manifest.jsonl> \
    output_filename="result.jsonl" \
    asr_output_granularity=segment \
    asr.model_name="nvidia/canary-qwen-2.5b" \
    streaming.batch_size=64 \
    streaming.buffer_size=8.0 \
    streaming.chunk_size=2.0 \
    streaming.overlap_size=4.0 \
    streaming.merging_strategy=lcsubstr

Experiments

Evaluations are conducted on the ASR HF Leaderboard datasets. The first line shows the performance of the offline model.
The model used is nvidia/canary-qwen-2.5b.

Mode	Merging Algo	Buffer Size	Overlap Size	Avg. WER	AMI	Earnings22	Giga	LS Clean	LS Other	SPGI	Tedlium	VoxPopuli
offline	-	-	-	5.62%	10.18%	10.42%	9.41%	1.60%	3.10%	1.90%	2.72%	5.66%
buffered	LCSubstring	4	2	6.70%	10.91%	11.64%	9.92%	2.87%	4.78%	3.49%	3.34%	6.61%
buffered	LCSubsequence	4	2	10.14%	13.06%	15.43%	12.96%	6.92%	9.08%	6.72%	6.38%	10.59%
buffered	LCSubstring	8	4	5.81%	10.27%	10.50%	9.45%	1.86%	3.45%	2.20%	2.91%	5.86%
buffered	LCSubsequence	8	4	9.38%	11.51%	12.96%	13.71%	5.54%	7.23%	6.64%	7.70%	9.74%
buffered	LCSubstring	8	1	7.73%	10.93%	12.25%	11.35%	3.50%	5.30%	4.22%	5.03%	9.29%
buffered	LCSubsequence	8	1	8.89%	11.29%	13.31%	12.43%	4.62%	6.24%	5.60%	6.43%	11.21%

Key takeaways:

For the same buffer and overlap settings, LCSubstring yields substantially lower WER than LCSubsequence
Overlap size and buffer size have a meaningful impact on performance
The configuration using LCSubstring with buffer size 8 and overlap size 4 provides the best trade-off between streaming capability and accuracy

GitHub Actions CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

New Feature
Bugfix
Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

Related to # (issue)

Signed-off-by: naymaraq <dkaramyan@nvidia.com>

Signed-off-by: naymaraq <naymaraq@users.noreply.github.com>

Signed-off-by: naymaraq <dkaramyan@nvidia.com>

Signed-off-by: naymaraq <naymaraq@users.noreply.github.com>

Copilot

Pull request overview

Adds experimental buffered/streaming inference support for SALM ASR models by introducing an incremental audio buffering mechanism and token-merging utilities to stitch outputs across overlapping buffers.

Changes:

Added BufferedSALMPipeline plus minimal SALMStreamingState and factory wiring for ASRDecodingType.SALM.
Implemented incremental (per-stream + batched) audio buffering and LCS-based token merging (LCSubstring/LCSubsequence).
Added example config for buffered SALM streaming and unit tests for longest_common_substring().

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 10 comments.

Show a summary per file

File	Description
`nemo/collections/asr/inference/utils/lcs_merge.py`	New LCS merge utilities for stitching token sequences across buffers.
`tests/collections/asr/inference/test_lcs_merge.py`	Unit tests for `longest_common_substring()`.
`nemo/collections/asr/inference/utils/enums.py`	Adds `SALM` decoding type and `MergingStrategy` enum.
`nemo/collections/asr/inference/streaming/state/salm_state.py`	Adds SALM-specific streaming state type (inherits generic `StreamingState`).
`nemo/collections/asr/inference/streaming/buffering/incremental_audio_bufferer.py`	New incremental audio bufferer (single + batched) for buffered SALM inference.
`nemo/collections/asr/inference/pipelines/buffered_salm_pipeline.py`	New buffered SALM pipeline using incremental buffering + LCS-based token merge.
`nemo/collections/asr/inference/model_wrappers/salm_asr_inference_wrapper.py`	New wrapper around SpeechLM2 SALM for inference/generation.
`nemo/collections/asr/inference/factory/buffered_pipeline_builder.py`	Wires buffered SALM into the buffered pipeline builder.
`nemo/collections/asr/inference/factory/base_builder.py`	Extends ASR model factory to construct SALM wrapper for buffered pipelines.
`examples/asr/conf/asr_streaming_inference/buffered_salm.yaml`	New example config for running buffered SALM streaming inference.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

nemo/collections/asr/inference/factory/base_builder.py

nemo/collections/asr/inference/factory/buffered_pipeline_builder.py

nemo/collections/asr/inference/utils/lcs_merge.py

nemo/collections/asr/inference/streaming/buffering/incremental_audio_bufferer.py

nemo/collections/asr/inference/utils/lcs_merge.py

examples/asr/conf/asr_streaming_inference/buffered_salm.yaml

nemo/collections/asr/inference/streaming/buffering/incremental_audio_bufferer.py

pzelasko · 2026-02-07T00:55:58Z

nemo/collections/asr/inference/factory/base_builder.py

                asr_class = RNNTInferenceWrapper
+            case (ASRDecodingType.SALM, PipelineType.BUFFERED):
+                asr_class = SALMASRInferenceWrapper
+                # remove decoding_cfg, SALM AED does not use decoding_cfg yet


Technically it can, but the structure will be different - it accepts transformers.GenerationConfig into model.generate - up to you if you think it makes sense to support that. We can also add this later if needed.

https://huggingface.co/docs/transformers/en/main_classes/text_generation

Let's skip this for now

pzelasko

This is great work @naymaraq!
Does asr/inference have some documentation that we should extend with a description and an example how to run this? Or the docs still need to be created?

Signed-off-by: naymaraq <dkaramyan@nvidia.com>

Signed-off-by: naymaraq <naymaraq@users.noreply.github.com>

Signed-off-by: naymaraq <dkaramyan@nvidia.com>

naymaraq · 2026-02-08T04:27:31Z

This is great work @naymaraq! Does asr/inference have some documentation that we should extend with a description and an example how to run this? Or the docs still need to be created?

We don't have docs yet. Need to be created.

github-actions · 2026-02-08T09:51:03Z

[🤖]: Hi @naymaraq 👋,

We wanted to let you know that a CICD pipeline for this PR just finished successfully.

So it might be time to merge this PR or get some approvals.

//cc @chtruong814 @ko3n1g @pablo-garay @thomasdhc

naymaraq added 13 commits February 3, 2026 07:42

draft version of streaming salm pipeline

93d9290

Signed-off-by: naymaraq <dkaramyan@nvidia.com>

fix ids_to_text

bee7e5f

Signed-off-by: naymaraq <dkaramyan@nvidia.com>

rm clone

7518417

Signed-off-by: naymaraq <dkaramyan@nvidia.com>

bring back .clone

785ed73

Signed-off-by: naymaraq <dkaramyan@nvidia.com>

first working version

88a82e3

Signed-off-by: naymaraq <dkaramyan@nvidia.com>

change extra token count to 2

fb27713

Signed-off-by: naymaraq <dkaramyan@nvidia.com>

rm prints

b9146ef

Signed-off-by: naymaraq <dkaramyan@nvidia.com>

fix device bug

06516a8

Signed-off-by: naymaraq <dkaramyan@nvidia.com>

pre tokenize prompts

5a69fc0

Signed-off-by: naymaraq <dkaramyan@nvidia.com>

choose rightmost lcs in the buffer, add test for lcs

a678c61

Signed-off-by: naymaraq <dkaramyan@nvidia.com>

fix typos

22e719f

Signed-off-by: naymaraq <dkaramyan@nvidia.com>

fix typo

e3e3afb

Signed-off-by: naymaraq <dkaramyan@nvidia.com>

fix docstrings

bc7f73a

Signed-off-by: naymaraq <dkaramyan@nvidia.com>

github-actions bot added the ASR label Feb 6, 2026

Apply isort and black reformatting

aa7e75c

Signed-off-by: naymaraq <naymaraq@users.noreply.github.com>

naymaraq changed the title ~~Add support for buffered SALM models~~ Add buffered inference support for SALM models Feb 6, 2026

naymaraq and others added 4 commits February 7, 2026 00:25

add support for both lcs and lcsubstr

07385f3

Signed-off-by: naymaraq <dkaramyan@nvidia.com>

rm unused import

ed005da

Signed-off-by: naymaraq <dkaramyan@nvidia.com>

change defualt values

f2ae90b

Signed-off-by: naymaraq <dkaramyan@nvidia.com>

Apply isort and black reformatting

4222c0b

Signed-off-by: naymaraq <naymaraq@users.noreply.github.com>

naymaraq marked this pull request as ready for review February 6, 2026 20:27

Merge branch 'main' into dkaramyan/streaming-canary-qwen

69c3255

naymaraq requested review from Copilot, nune-tadevosyan and pzelasko February 6, 2026 20:28

Copilot started reviewing on behalf of naymaraq February 6, 2026 20:28 View session

naymaraq added the Run CICD label Feb 6, 2026

naymaraq temporarily deployed to test February 6, 2026 20:29 — with GitHub Actions Inactive

Copilot AI reviewed Feb 6, 2026

View reviewed changes

github-actions bot removed the Run CICD label Feb 6, 2026

pzelasko reviewed Feb 7, 2026

View reviewed changes

NVIDIA-NeMo deleted a comment from Copilot AI Feb 8, 2026

naymaraq and others added 3 commits February 8, 2026 08:18

improvements

f48aa61

Signed-off-by: naymaraq <dkaramyan@nvidia.com>

Apply isort and black reformatting

7f6bebb

Signed-off-by: naymaraq <naymaraq@users.noreply.github.com>

bugfix

6f3fdb4

Signed-off-by: naymaraq <dkaramyan@nvidia.com>

naymaraq added the Run CICD label Feb 8, 2026

naymaraq temporarily deployed to test February 8, 2026 04:32 — with GitHub Actions Inactive

NVIDIA-NeMo deleted a comment from Copilot AI Feb 8, 2026

github-actions bot removed the Run CICD label Feb 8, 2026

naymaraq added the Run CICD label Feb 8, 2026

naymaraq temporarily deployed to test February 8, 2026 06:30 — with GitHub Actions Inactive

github-actions bot removed the Run CICD label Feb 8, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add buffered inference support for SALM models#15364

Add buffered inference support for SALM models#15364
naymaraq wants to merge 22 commits intomainfrom
dkaramyan/streaming-canary-qwen

naymaraq commented Feb 6, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pzelasko Feb 7, 2026 •

edited

Loading

Uh oh!

naymaraq Feb 8, 2026

Uh oh!

pzelasko left a comment

Uh oh!

naymaraq commented Feb 8, 2026

Uh oh!

github-actions bot commented Feb 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

naymaraq commented Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Changelog

Usage

Experiments

GitHub Actions CI

Before your PR is "Ready for review"

Who can review?

Additional Information

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pzelasko Feb 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

naymaraq Feb 8, 2026

Choose a reason for hiding this comment

Uh oh!

pzelasko left a comment

Choose a reason for hiding this comment

Uh oh!

naymaraq commented Feb 8, 2026

Uh oh!

github-actions bot commented Feb 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

naymaraq commented Feb 6, 2026 •

edited

Loading

pzelasko Feb 7, 2026 •

edited

Loading