Add ALM Data Pipeline tutorial and stages by mohammadaaftabv · Pull Request #1419 · NVIDIA-NeMo/Curator

mohammadaaftabv · 2026-01-23T16:56:24Z

Add new NeMo Curator stages for ALM (Audio Language Model) data curation:

ALMDataBuilderStage: Creates training windows from audio segments with quality filtering (sample rate, bandwidth, speaker count, duration)
ALMDataOverlapStage: Filters overlapping windows based on threshold, keeping windows closest to target duration

Add complete tutorial with:

Python CLI (pipeline.py) and Hydra runner (run.py)
Sample input data for testing
Comprehensive documentation

Tested with sample data:

Stage 1 produces 181 windows from 5 input entries
Stage 2 filters to 25 non-overlapping windows (3035.5s total)

Description

Usage

# Add snippet demonstrating usage

Checklist

I am familiar with the Contributing Guide.
New or Existing tests cover these changes.
The documentation is up to date with these changes.

copy-pr-bot · 2026-01-23T16:56:28Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

greptile-apps · 2026-01-29T09:41:03Z

Greptile Summary

This PR introduces the ALM (Audio Language Model) data curation pipeline for NeMo Curator, adding four new stages (ALMManifestReader, ALMDataBuilderStage, ALMDataOverlapStage, ALMManifestWriterStage), a complete Hydra-driven tutorial, benchmarking infrastructure, and comprehensive tests.

Key observations:

alm_data_overlap.py has two logic bugs: (1) Each of _calculate_total_dur, _calculate_duration_list, and _calculate_timestamps wraps its entire list comprehension in a single try/except, silently discarding all window data for an entry if any single window is malformed — the entry effectively becomes a no-op with zero filtered windows. (2) _get_filtered_windows uses a set for timestamp matching, which fails to correctly handle duplicate windows sharing identical (end, start) pairs — a window that _filter_segments marked for removal can survive because both originals match the one set entry.
The ALMManifestWriterStage.setup() truncates the output file without an idempotency guard; a second setup() call (e.g., worker retry) would silently drop already-written data.
The overlap stage correctly preserves the original windows key from the builder stage, so downstream metrics in the benchmark script accurately distinguish builder vs. filtered window counts.
The XennaExecutor import in tutorials/audio/alm/main.py correctly resolves to nemo_curator.backends.xenna which re-exports the class.
Integration tests in test_alm_data_builder.py and test_alm_data_overlap.py hard-code exact global counts (181, 25, 3035.50s) making them fragile to any future fixture or logic change.

Confidence Score: 2/5

Not safe to merge — two correctness bugs in alm_data_overlap.py can cause silent data loss and incorrect deduplication of training windows.
The PR introduces well-structured stages with solid test coverage, but alm_data_overlap.py contains two logic bugs: broad exception handling that silently discards all window data for an entire entry on a single malformed segment, and set-based window reconstruction that can reintroduce windows _filter_segments intended to remove. These bugs affect correctness of the training data output.
nemo_curator/stages/audio/alm/alm_data_overlap.py — specifically _calculate_timestamps / _calculate_total_dur / _calculate_duration_list exception handling and _get_filtered_windows set-based matching.

Important Files Changed

Filename	Overview
nemo_curator/stages/audio/alm/alm_data_builder.py	New stage that builds training windows from audio segments with quality filtering; logic is sound but contains a redundant bandwidth check on `window_segs` that can never trigger.
nemo_curator/stages/audio/alm/alm_data_overlap.py	Two logic bugs: (1) broad try/except over entire list comprehensions silently discards all window data when any single window is malformed; (2) set-based timestamp matching in `_get_filtered_windows` can reintroduce windows that `_filter_segments` intended to remove when duplicate windows share the same timestamps.
nemo_curator/stages/audio/alm/alm_manifest_reader.py	New CompositeStage wrapping FilePartitioningStage + line-by-line JSONL reading via fsspec; clean implementation with per-manifest logging and good test coverage.
nemo_curator/stages/audio/alm/alm_manifest_writer.py	Single-worker JSONL append writer; setup truncates the output file but has no idempotency guard against double-setup on retry.
nemo_curator/stages/audio/alm/init.py	Clean exports for all four ALM stages; all referenced modules exist in the PR.
tutorials/audio/alm/main.py	Hydra-driven tutorial entry point; imports `XennaExecutor` from the correct path (`nemo_curator.backends.xenna`) and uses stage metrics rather than fragile window counting.
tests/stages/audio/alm/test_alm_data_overlap.py	Good unit test coverage for the overlap stage; the integration test hard-codes exact global counts (181 builder windows, 25 filtered, 3035.50s) which makes it fragile to any future logic or fixture change.
tests/stages/audio/alm/test_alm_data_builder.py	Good unit test coverage; integration test hard-codes the exact window total (181) which will break CI on any fixture or logic change.
benchmarking/scripts/alm_pipeline_benchmark.py	Well-structured benchmarking script; correctly reads `windows` (builder output) and `filtered_windows` (overlap output) as separate metrics since the overlap stage preserves the original `windows` key.

Sequence Diagram

sequenceDiagram
    participant Driver
    participant FPS as FilePartitioningStage
    participant AMRS as ALMManifestReaderStage
    participant ALMB as ALMDataBuilderStage
    participant ALMO as ALMDataOverlapStage
    participant ALMW as ALMManifestWriterStage

    Driver->>FPS: EmptyTask (manifest_path)
    FPS-->>AMRS: FileGroupTask (list of .jsonl paths)
    AMRS-->>ALMB: AudioBatch per entry (1 entry each, fan-out)
    Note over AMRS,ALMB: One AudioBatch per JSONL line
    ALMB-->>ALMO: AudioBatch with windows[] + stats{}
    Note over ALMB,ALMO: Windows filtered by SR/BW/speakers/duration
    ALMO-->>ALMW: AudioBatch with filtered_windows[] + filtered_dur
    Note over ALMO,ALMW: Overlapping windows removed, closest to target kept
    ALMW-->>Driver: FileGroupTask (output .jsonl path)

Comments Outside Diff (3)

nemo_curator/stages/audio/alm/alm_data_overlap.py, line 37-52 (link)

Broad exception swallows all window data on single failure

Each of _calculate_total_dur, _calculate_duration_list, and _calculate_timestamps wraps the entire list comprehension in a single try/except. If any one window in windows is malformed (e.g., a segment missing "start" or "end"), the exception fires mid-comprehension and the entire result is discarded — returning 0.0, [], or [] respectively.

This means a single bad window silently loses all durations and timestamps for the whole entry, and the overlap stage treats the entry as if it had no windows at all (filtered_windows=[], filtered_dur=0.0). The failure is entirely silent since no log is emitted.

A more robust approach processes windows individually, skipping only the offending window:
```
def _calculate_timestamps(windows: list[dict[str, Any]]) -> list[tuple[float, float]]:
    result = []
    for window in windows:
        seg = window.get("segments", [])
        if not seg:
            continue
        try:
            result.append((seg[-1]["end"], seg[0]["start"]))
        except (KeyError, IndexError, TypeError):
            logger.warning(f"Skipping malformed window: {window}")
    return result
```
The same pattern applies to _calculate_total_dur and _calculate_duration_list.
nemo_curator/stages/audio/alm/alm_manifest_writer.py, line 47-50 (link)

setup() truncation is not idempotent under concurrent workers

setup() opens the output file in "w" mode to truncate it, and process() appends with "a". While num_workers() = 1 and xenna_stage_spec() both enforce single-worker execution, setup() has no guard against being called more than once (e.g., during retry after a transient failure). A second setup() call would truncate a partially-written output file.

Consider adding a guard or documenting this invariant explicitly:
```
def setup(self, worker_metadata: Any = None) -> None:  # noqa: ARG002, ANN401
    if getattr(self, "_setup_done", False):
        return
    fs, path = url_to_fs(self.output_path)
    ...
    self._setup_done = True
```
nemo_curator/stages/audio/alm/alm_data_overlap.py, line 129-148 (link)

Duplicate windows survive set-based timestamp matching

_get_filtered_windows uses a set of (end, start) tuples to reconstruct surviving windows. If two windows share identical (segments[-1]["end"], segments[0]["start"]) values, the set deduplicates them to a single entry — but the loop over windows matches both originals against that one set entry, appending both to filtered_windows.

Concrete scenario:
- Two identical windows with the same timestamps enter _filter_segments.
- One is added to remove_indices; filtered_timestamps therefore contains the pair only once.
- filtered_set still resolves to one entry.
- Both original windows match it and both are appended — the window that _filter_segments intended to remove survives.
Result: len(filtered_windows) > len(filtered_timestamps), and the deduplication performed by _filter_segments is silently undone.

The fix is to replace the set with a counted multiset (e.g., collections.Counter) keyed on the rounded (end, start) pair, decrementing the count each time a window is matched. That way each filtered timestamp is consumed exactly once regardless of duplicates.

_{Last reviewed commit: 8f6b28c}

greptile-apps

_{30 files reviewed, 17 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps

_{5 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

karpnv

LGTM

ayushdg

Few comments:

Can you add a benchmarking script to benchmarks and share a representative dataset that can be used to run an alm pipeline.
You are already logging many statistics in the stages here, is it possible to also use _log_metrics like done in some of the text stages to log some of these timing metrics so that they can be tracked better to catch regressions?

greptile-apps

_{5 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

mohammadaaftabv · 2026-02-07T12:20:46Z

Few comments:

Can you add a benchmarking script to benchmarks and share a representative dataset that can be used to run an alm pipeline.

https://github.com/mohammadaaftabv/Curator/tree/alm_data_build/tests/fixtures/audio/alm is the representative dataset and i am assuming by benchmarks you mean result of running both processors on the representative data, in that case alm data build should build 181 windows based on config in test file and alm data overlap applied on resultant 181 windows with allowing max 50% overlap will give 3035.5 seconds total output.

All this is in test cases here.

mohammadaaftabv · 2026-02-07T12:57:01Z

2. You are already logging many statistics in the stages here, is it possible to also use _log_metrics like done in some of the text stages to log some of these timing metrics so that they can be tracked better to catch regressions?

Added _log_metrics calls to both stages, following the pattern in text stages. Now tracking:

ALMDataBuilderStage: process_entry_time, segments_processed, windows_created

ALMDataOverlapStage: filter_time, input_windows, output_windows"

greptile-apps

_{13 files reviewed, 4 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps

_{13 files reviewed, 5 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps

_{3 files reviewed, 2 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps

_{13 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

praateekmahajan · 2026-02-12T19:17:49Z

+
+    # Calculate statistics
+    # Stage 1 output: total_dur_list_window contains the original window count
+    stage1_windows = sum(len(e.get("total_dur_list_window", e.get("windows", []))) for e in output_entries)


I guess these make sense, but also take a look at Task._metadata and Task._stage_perf_stats if there are things that are relevant

praateekmahajan · 2026-02-12T19:18:30Z

+        self._drop_fields_set = {f.strip() for f in self.drop_fields.split(",") if f.strip()}
+        self._drop_fields_top_level_set = {f.strip() for f in self.drop_fields_top_level.split(",") if f.strip()}
+
+    def process_dataset_entry(self, data_entry: dict[str, Any]) -> list[AudioBatch]:


Is it intentional that we operate on a single manifest entry at a time? Can any of this be vectorized? Same for other stages

Yes, this is intentional — it follows the LegacySpeechStage pattern used by all other audio stages (GetAudioDurationStage, PreserveByValueStage, etc.), where process() iterates over task.data and calls process_dataset_entry() per entry.

Parallelism is handled at the executor level instead. In benchmark testing (10,000 entries, XennaExecutor on 8-core i9-9900KF), the autoscaler allocated 4 workers to the Builder stage, achieving ~1,460 entries/sec aggregate throughput (365 entries/sec/worker) with 86% CPU utilization. The Overlap stage ran 3 workers at ~5,650 entries/sec. Full pipeline completed in 90s.

If we want batch-level optimization in the future, it would need to happen at the LegacySpeechStage base class level, which would affect all audio stages.

praateekmahajan · 2026-02-12T19:19:17Z

+            }
+        )
+
+        return [AudioBatch(data=[result])]


Each time we return a Task you must pass its parents tasks metadata and stage_perf_stats..

In such a fan-out implementation this becomes hard to reason about..

Yeah the _stage_perfs are supposedly propagated via the base LegacySpeechStage. I would be curious to look at the benchmark results for this PR though to get an even better understanding of how existing audio curation code can be refactored.

- Add benchmarking script (alm_pipeline_benchmark.py) with repeat-factor for scale testing, verified end-to-end in Docker (10K entries, 90s) - Add alm-benchmark.yaml config for the benchmarking framework - Reuse create_pipeline_from_yaml from nemo_curator.config.run (rename processors -> stages in YAML and tutorial) - Add TaskPerfUtils per-stage stats to main.py output - Remove unnecessary `from __future__ import annotations` - Propagate parent Task._metadata and _stage_perf in both stages - Refactor alm_data_builder: extract BuilderStats dataclass, 4 helper functions (_get_bandwidth, _compute_speaker_durations, _truncate_segment, _record_window_loss), remove noqa suppressions - Refactor alm_data_overlap: extract 9 methods as module-level functions, class now only contains process/entry/filter logic - Restructure tests into class-based pattern (TestX + TestXIntegration) - Move shared fixtures to tests/stages/audio/alm/conftest.py - Update README with benchmarking section including results and machine specs Signed-off-by: aaftaabv@gmail.com <aaftaabv@gmail.com>

Move manifest I/O from the driver to a worker by adding ALMManifestReaderStage (ProcessingStage[_EmptyTask, AudioBatch]). This follows the Developers Guide recommendation to keep the driver lightweight, matching the pattern used by FilePartitioningStage and CreateInitialManifestFleursStage. - New stage reads JSONL on the worker with fsspec (cloud-path compatible) - Sets is_fanout_stage=True and num_workers_per_node=1 per guide - Pipeline YAML now has 3 stages: reader -> builder -> overlap - main.py no longer loads manifest on driver or passes initial_tasks - Removed load_manifest() and AudioBatch import from main.py Signed-off-by: aaftaabv@gmail.com <aaftaabv@gmail.com>

…ord_window_loss Signed-off-by: aaftaabv@gmail.com <aaftaabv@gmail.com>

- Fix README: use stages.0.manifest_path instead of input_manifest, correct stage indices (0=reader, 1=builder, 2=overlap), update expected output and benchmark results to match actual values - Fix overlap stage: stop overwriting windows field with filtered results so builder window count is preserved (181 vs 25) - Fix main.py: remove duplicate manual stats, use only per-stage metrics from TaskPerfUtils as single source of truth - Fix alm-benchmark.yaml: add required model_weights_path field - Fix builder: remove redundant bandwidth check in truncation block Signed-off-by: aaftaabv@gmail.com <aaftaabv@gmail.com>

- Refactor LegacySpeechStage.process() to propagate _metadata and _stage_perf, removing process() overrides from ALM stages - Add ALMManifestWriterStage (single-worker, append-safe JSONL writer) - Support list[str] manifest_path in ALMManifestReaderStage - Promote manifest_path to top-level pipeline.yaml argument - Move ALM benchmark into nightly-benchmark.yaml, delete alm-benchmark.yaml - Move soundfile dep to pyproject.toml audio_cpu group, delete requirements.txt - Empty tests/__init__.py, inline FIXTURE_PATH, remove test_default_values - Update README install instructions and pipeline docs Signed-off-by: aaftaabv@gmail.com <aaftaabv@gmail.com>

Signed-off-by: aaftaabv@gmail.com <aaftaabv@gmail.com>

…rk updates Signed-off-by: aaftaabv@gmail.com <aaftaabv@gmail.com>

Signed-off-by: V Mohammad Aaftab <aaftabv@nvidia.com> Signed-off-by: aaftaabv@gmail.com <aaftaabv@gmail.com>

…tage Signed-off-by: aaftaabv@gmail.com <aaftaabv@gmail.com>

…uppressions, /tmp usage Signed-off-by: aaftaabv@gmail.com <aaftaabv@gmail.com>

Signed-off-by: aaftaabv@gmail.com <aaftaabv@gmail.com>

Introduces _RepeatEntriesStage that fans out entries in-memory after the manifest reader, avoiding redundant file I/O. Updates benchmark results accordingly. Signed-off-by: aaftaabv@gmail.com <aaftaabv@gmail.com>

- Fix RUF003: replace unicode multiplication sign with ASCII x in test comments - Fix I001: reorder imports in benchmark script - Auto-format 6 files with ruff format - Merge upstream/main and regenerate uv.lock - alm_data_overlap: use entry.copy() in empty-windows path, add 4 missing fields - alm_manifest_reader: fix cumulative log to show per-manifest count - Update benchmark results Signed-off-by: aaftaabv@gmail.com <aaftaabv@gmail.com>

soundfile is only used by GetAudioDurationStage, not by LegacySpeechStage. Moving the import inside the method prevents ModuleNotFoundError when Ray workers install the base package without audio extras. Signed-off-by: aaftaabv@gmail.com <aaftaabv@gmail.com>

Pre-initialize Ray via shared_ray_client fixture so XennaExecutor ray.init(ignore_reinit_error=True) becomes a no-op -- workers use the cluster default env which has all extras installed. Also restores soundfile import to module level in common.py. Signed-off-by: aaftaabv@gmail.com <aaftaabv@gmail.com>

…st_common.py Ray auto-packages the working directory when connecting to a cluster, creating a fresh worker venv with only base deps (no soundfile). Move import soundfile inside GetAudioDurationStage.process_dataset_entry so it is only imported at runtime, not at module load. Fix mock path in test_common.py to patch soundfile.SoundFileError directly. Remove ineffective shared_ray_client fixture from integration test. Signed-off-by: aaftaabv@gmail.com <aaftaabv@gmail.com>

mohammadaaftabv · 2026-03-07T22:08:12Z

/ok to test 64a0281

The multimodal_mint1t dataset and both multimodal_mint1t_xenna/materialize benchmark entries were accidentally removed during a rebase conflict. Restores them and keeps the new alm_pipeline_xenna entry appended at the end. Signed-off-by: aaftaabv@gmail.com <aaftaabv@gmail.com>

mohammadaaftabv · 2026-03-09T17:29:47Z

/ok to test 111827a

_RepeatEntriesStage and ALMManifestReaderStage were passing _stage_perf by direct reference when creating multiple output tasks, causing all siblings to share the same list. Downstream stage perf entries would accumulate across all tasks, producing misleading metrics and bloated tasks.pkl files. Changed to list(task._stage_perf) to give each child an independent copy, matching LegacySpeechStage behavior. Signed-off-by: aaftaabv@gmail.com <aaftaabv@gmail.com>

mohammadaaftabv · 2026-03-09T20:09:42Z

/ok to test 8f6b28c

github-actions Bot added the community-request label Jan 23, 2026

chtruong814 added the needs-follow-up Issue needs follow-up label Jan 25, 2026

mohammadaaftabv force-pushed the alm_data_build branch 4 times, most recently from 66abf28 to 0125f32 Compare January 29, 2026 09:36

mohammadaaftabv marked this pull request as ready for review January 29, 2026 09:38

greptile-apps Bot reviewed Jan 29, 2026

View reviewed changes

Comment thread nemo_curator/stages/audio/alm/alm_data_builder.py Outdated

Comment thread nemo_curator/stages/audio/alm/alm_data_builder.py Outdated

greptile-apps Bot reviewed Jan 29, 2026

View reviewed changes

karpnv self-requested a review January 31, 2026 00:35

karpnv approved these changes Jan 31, 2026

View reviewed changes

ayushdg reviewed Feb 5, 2026

View reviewed changes

Comment thread nemo_curator/stages/audio/alm/alm_data_builder.py Outdated

Comment thread nemo_curator/stages/audio/alm/alm_data_builder.py Outdated

Comment thread nemo_curator/stages/audio/alm/alm_data_builder.py Outdated

Comment thread nemo_curator/stages/audio/alm/alm_data_builder.py Outdated

greptile-apps Bot reviewed Feb 7, 2026

View reviewed changes

Comment thread nemo_curator/stages/audio/alm/alm_data_overlap.py Outdated

Comment thread tutorials/audio/alm/main.py Outdated

Comment thread nemo_curator/stages/audio/alm/alm_data_overlap.py Outdated

Comment thread tests/stages/audio/alm/test_alm_data_overlap.py Outdated

greptile-apps Bot reviewed Feb 7, 2026

View reviewed changes

Comment thread tutorials/audio/alm/main.py Outdated

Comment thread nemo_curator/stages/audio/alm/alm_data_overlap.py Outdated

mohammadaaftabv requested review from ayushdg and karpnv February 9, 2026 02:41

greptile-apps Bot reviewed Feb 10, 2026

View reviewed changes

Comment thread nemo_curator/stages/audio/alm/alm_data_overlap.py