Parallel chunking feature for RNNT and TDT models#15186
Parallel chunking feature for RNNT and TDT models#15186nune-tadevosyan wants to merge 63 commits intoNVIDIA-NeMo:mainfrom
Conversation
8deb746 to
2546e30
Compare
5ac22ae to
7cfd616
Compare
fd7ae96 to
055c489
Compare
de428f2 to
f8baebb
Compare
65aa263 to
c621e35
Compare
3402ef5 to
33d18df
Compare
Signed-off-by: Nune <ntadevosyan@nvidia.com>
Signed-off-by: Nune <ntadevosyan@nvidia.com>
Signed-off-by: Nune <ntadevosyan@nvidia.com>
Signed-off-by: Nune <ntadevosyan@nvidia.com>
Signed-off-by: Nune <ntadevosyan@nvidia.com>
Signed-off-by: Nune <ntadevosyan@nvidia.com>
Signed-off-by: Nune <ntadevosyan@nvidia.com>
Signed-off-by: Nune <ntadevosyan@nvidia.com>
|
/claude review |
| lang_id = 'en' if isinstance(self.tokenizer, tokenizers.AggregateTokenizer) else None | ||
| else: | ||
| source_id = f'audio_{uuid.uuid4().int}' | ||
| chunk_start = 0 | ||
| lang_id = 'en' if isinstance(self.tokenizer, tokenizers.AggregateTokenizer) else None |
There was a problem hiding this comment.
Bug: lang_id is hardcoded to 'en' for AggregateTokenizer in both the pre-chunked tensor path and the fallback path. If the user is transcribing non-English audio (e.g. via source_lang='de'), this will cause the merge logic in merge_chunked_hypotheses to call tokenizer.text_to_ids(text, lang_id='en'), producing incorrect token IDs and potentially garbled merge results.
Consider propagating the actual language from the prompt/config (e.g. trcfg.prompt.get('source_lang', 'en')) instead of hardcoding 'en'.
nithinraok
left a comment
There was a problem hiding this comment.
There are two ways here:
Using filepath and using tensor as input. Pls make sure to robustly test for both flows.
Some test cases need to be removed based on removed functions from chunking_utils.py
| return best_chunk_size | ||
|
|
||
|
|
||
| def chunk_waveform( |
There was a problem hiding this comment.
with lhotse cut windows, I think we are now not using this?
There was a problem hiding this comment.
chunk_waveform is used in transcription.py when audio is provided as tensor.
| return chunks, chunk_lens, chunk_starts | ||
|
|
||
|
|
||
| def chunk_audio_sample( |
There was a problem hiding this comment.
same, are we using this function now?
| return char_timestamps | ||
|
|
||
|
|
||
| def merge_flat_chunk_hypotheses( |
There was a problem hiding this comment.
Used in transcription.py
| merged_hypotheses: Target hypothesis to update with merged timestamps | ||
| hypotheses: List of hypotheses from different chunks | ||
| chunk_offsets: Frame offsets for each chunk | ||
| subsampling_factor: Subsampling factor of the encoder | ||
| window_stride: Time stride per frame in seconds | ||
| tokenizer: Tokenizer for text operations | ||
| merged_tokens: Token sequence after LCS merge | ||
| timestamps_type: Types of timestamps to include ('word', 'segment', 'all') | ||
| lang_id: Language ID for multilingual models | ||
| similarity_threshold: Threshold for word similarity matching (0.0-1.0) |
There was a problem hiding this comment.
docstring documents params tokenizer, merged_tokens, lang_id that don't exist in the actual signature. The real params are merged_text, timestamps_type, similarity_threshold.
| merged_tokens = lcs_alignment_merge_buffer( | ||
| buffer=merged_tokens, | ||
| data=data[: int(delay * 0.6)], # only approximately 60% of the tokens are non blank | ||
| data=data[: int(delay * 0.6)], # only approximately 60% of the frames have corresponding tokens |
There was a problem hiding this comment.
Delay is number of frames here for overlapping part, we want to check the tokens that were in the overlapping segment. Approximately 60% of the frames output tokens.
| chunk_word_idx += 1 | ||
| continue | ||
| break | ||
| else: |
There was a problem hiding this comment.
Does this silently produce misleading timestamps?
There was a problem hiding this comment.
This can happen in rare cases and for few words.
| return hypotheses | ||
|
|
||
|
|
||
| def join_alignments( |
There was a problem hiding this comment.
In this alignments are we considering overlap frames or simply concatenated?
There was a problem hiding this comment.
Simply concatenating
Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com> Signed-off-by: nune-tadevosyan <152167970+nune-tadevosyan@users.noreply.github.com>
What does this PR do ?
Adds support for parallel chunking for all types of ASR models
Collection: [Note which collection this PR will affect]
Changelog
LhotseSpeechToTextBpeDatasetUsage
GitHub Actions CI
The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.
The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".
Before your PR is "Ready for review"
Pre checks:
PR Type:
If you haven't finished some of the above items you can still open "Draft" PR.
Who can review?
Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.
Additional Information