Skip to content

Parallel chunking feature for RNNT and TDT models#15186

Open
nune-tadevosyan wants to merge 63 commits intoNVIDIA-NeMo:mainfrom
nune-tadevosyan:parakeet_chunking
Open

Parallel chunking feature for RNNT and TDT models#15186
nune-tadevosyan wants to merge 63 commits intoNVIDIA-NeMo:mainfrom
nune-tadevosyan:parakeet_chunking

Conversation

@nune-tadevosyan
Copy link
Copy Markdown
Collaborator

@nune-tadevosyan nune-tadevosyan commented Dec 14, 2025

What does this PR do ?

Adds support for parallel chunking for all types of ASR models

Collection: [Note which collection this PR will affect]

Changelog

  • Added token_ids in the timestamps given by RNNT and TDT models.
  • Provide token_sequence in CTC models to have matching tokens between timestamps and text tokens
  • Provided chunking functionality for LhotseSpeechToTextBpeDataset
  • Change in TranscriptionMixin to have general support for chunking.
  • Tensor / numpy array support for chunking.

Usage

  • You can potentially add a usage example below
import nemo.collections.asr as nemo_asr
asr_model = nemo_asr.models.ASRModel.from_pretrained("nvidia/parakeet-tdt-0.6b-v2")
transcript = asr_model.transcribe(["path/to/audio_file.wav"], enable_chunking=True)[0].text

GitHub Actions CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

  • New Feature
  • Bugfix
  • Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

  • Related to # (issue)

Comment thread nemo/collections/asr/data/audio_to_text_lhotse_prompted.py Fixed
Comment thread nemo/collections/asr/models/rnnt_models.py Fixed
Signed-off-by: Nune <ntadevosyan@nvidia.com>
Signed-off-by: Nune <ntadevosyan@nvidia.com>
Signed-off-by: Nune <ntadevosyan@nvidia.com>
Signed-off-by: Nune <ntadevosyan@nvidia.com>
Signed-off-by: Nune <ntadevosyan@nvidia.com>
Signed-off-by: Nune <ntadevosyan@nvidia.com>
Signed-off-by: Nune <ntadevosyan@nvidia.com>
Signed-off-by: Nune <ntadevosyan@nvidia.com>
Signed-off-by: Nune <ntadevosyan@nvidia.com>
Signed-off-by: Nune <ntadevosyan@nvidia.com>
Signed-off-by: Nune <ntadevosyan@nvidia.com>
Signed-off-by: Nune <ntadevosyan@nvidia.com>
Signed-off-by: Nune <ntadevosyan@nvidia.com>
Signed-off-by: Nune <ntadevosyan@nvidia.com>
Signed-off-by: Nune <ntadevosyan@nvidia.com>
Signed-off-by: Nune <ntadevosyan@nvidia.com>
Signed-off-by: Nune <ntadevosyan@nvidia.com>
Signed-off-by: Nune <ntadevosyan@nvidia.com>
Signed-off-by: Nune <ntadevosyan@nvidia.com>
Signed-off-by: Nune <ntadevosyan@nvidia.com>
Signed-off-by: Nune <ntadevosyan@nvidia.com>
Signed-off-by: Nune <ntadevosyan@nvidia.com>
@nithinraok
Copy link
Copy Markdown
Member

/claude review

Comment thread nemo/collections/asr/parts/submodules/ctc_decoding.py Outdated
Comment on lines +1086 to +1090
lang_id = 'en' if isinstance(self.tokenizer, tokenizers.AggregateTokenizer) else None
else:
source_id = f'audio_{uuid.uuid4().int}'
chunk_start = 0
lang_id = 'en' if isinstance(self.tokenizer, tokenizers.AggregateTokenizer) else None
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: lang_id is hardcoded to 'en' for AggregateTokenizer in both the pre-chunked tensor path and the fallback path. If the user is transcribing non-English audio (e.g. via source_lang='de'), this will cause the merge logic in merge_chunked_hypotheses to call tokenizer.text_to_ids(text, lang_id='en'), producing incorrect token IDs and potentially garbled merge results.

Consider propagating the actual language from the prompt/config (e.g. trcfg.prompt.get('source_lang', 'en')) instead of hardcoding 'en'.

Copy link
Copy Markdown
Member

@nithinraok nithinraok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two ways here:
Using filepath and using tensor as input. Pls make sure to robustly test for both flows.

Some test cases need to be removed based on removed functions from chunking_utils.py

return best_chunk_size


def chunk_waveform(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with lhotse cut windows, I think we are now not using this?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

chunk_waveform is used in transcription.py when audio is provided as tensor.

return chunks, chunk_lens, chunk_starts


def chunk_audio_sample(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same, are we using this function now?

return char_timestamps


def merge_flat_chunk_hypotheses(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same, are we using this?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Used in transcription.py

Comment on lines +661 to +670
merged_hypotheses: Target hypothesis to update with merged timestamps
hypotheses: List of hypotheses from different chunks
chunk_offsets: Frame offsets for each chunk
subsampling_factor: Subsampling factor of the encoder
window_stride: Time stride per frame in seconds
tokenizer: Tokenizer for text operations
merged_tokens: Token sequence after LCS merge
timestamps_type: Types of timestamps to include ('word', 'segment', 'all')
lang_id: Language ID for multilingual models
similarity_threshold: Threshold for word similarity matching (0.0-1.0)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

docstring documents params tokenizer, merged_tokens, lang_id that don't exist in the actual signature. The real params are merged_text, timestamps_type, similarity_threshold.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cleaned

merged_tokens = lcs_alignment_merge_buffer(
buffer=merged_tokens,
data=data[: int(delay * 0.6)], # only approximately 60% of the tokens are non blank
data=data[: int(delay * 0.6)], # only approximately 60% of the frames have corresponding tokens
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this 60%?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delay is number of frames here for overlapping part, we want to check the tokens that were in the overlapping segment. Approximately 60% of the frames output tokens.

chunk_word_idx += 1
continue
break
else:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this silently produce misleading timestamps?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can happen in rare cases and for few words.

return hypotheses


def join_alignments(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this alignments are we considering overlap frames or simply concatenated?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simply concatenating

nune-tadevosyan and others added 3 commits March 23, 2026 10:50
Signed-off-by: Nune <ntadevosyan@nvidia.com>
Signed-off-by: Nune <ntadevosyan@nvidia.com>
Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com>
Signed-off-by: nune-tadevosyan <152167970+nune-tadevosyan@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants