read manifest from s3 by karpnv · Pull Request #15330 · NVIDIA-NeMo/NeMo

karpnv · 2026-01-28T02:14:50Z

S3 support

Collection: ASR

Changelog

Use input manifest in s3 object storage (s3://abc/sharded_manifests/manifest_0.jsonl or s3://abc/sharded_manifests/manifest__OP_0..2047_CL_.jsonl)
Add path to the s3 credentials file and section. --s3cfg Example: ~/.s3cfg[default]. Set to "" to disable S3 support. Default is "".
Add S3 path to tarred audio files --tar-base-path (e.g., s3://ASR/tarred/audio_0.tar or s3://ASR/tarred/audio__OP_0..2047_CL_.tar).
When specified, audio_filepath values in the manifest are treated as filenames within this tar archive.

Usage

You can potentially add a usage example below

python tools/speech_data_explorer/data_explorer.py s3://abc/sharded_manifests/manifest_0.json --tar-base-path s3://abc/tarred/audio_0.tar --s3cfg ~/.s3cfg[default]

GitHub Actions CI

PR Type:

[ V] New Feature
Bugfix
Documentation

Signed-off-by: Nikolay Karpov <nkarpov@nvidia.com>

Signed-off-by: karpnv <karpnv@users.noreply.github.com>

tools/speech_data_explorer/data_explorer.py

Signed-off-by: Nikolay Karpov <nkarpov@nvidia.com>

github-actions · 2026-01-31T00:32:07Z

[🤖]: Hi @karpnv 👋,

We wanted to let you know that a CICD pipeline for this PR just finished successfully.

So it might be time to merge this PR or get some approvals.

//cc @chtruong814 @ko3n1g @pablo-garay @thomasdhc

vsl9

Thanks, LGTM

Jorjeous · 2026-02-02T15:00:36Z

tools/speech_data_explorer/data_explorer.py

+    global _tar_cache
+
+    # Check if we have this tar file cached
+    if tar_s3_path not in _tar_cache:


May be there is a way to read especially audio, without download whole tar?
this may cause Huge delays

Yes, we can for webds, will do

Jorjeous · 2026-02-02T15:24:31Z

tools/speech_data_explorer/data_explorer.py

+    global _s3_client
+    path, section = s3cfg.rsplit('[', 1)
+    s3_config = parse_s3cfg(path, section.rstrip(']'))
+    print("s3_config", s3_config)


Move to logger?

will be updated

Jorjeous · 2026-02-02T15:37:24Z

tools/speech_data_explorer/data_explorer.py

+    """
+    global _s3_client
+    bucket, key = parse_s3_path(s3_path)
+    try:


This will fail in local manifest contain s3 path, and nothing is set [like s3 config etc]. + actuall error (missing s3 configs) covered with attr err

Signed-off-by: Nikolay Karpov <nkarpov@nvidia.com>

Signed-off-by: karpnv <karpnv@users.noreply.github.com>

…. Updaetd logging system Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

Signed-off-by: Jorjeous <Jorjeous@users.noreply.github.com>

Jorjeous · 2026-02-06T14:30:42Z

@karpnv please check again, now all LGTM

tools/speech_data_explorer/data_explorer.py

 import operator
 import os
 import pickle
+import tarfile


In general, an unused import should be removed to reduce unnecessary dependencies, speed up module loading slightly, and improve readability. The best fix here is to delete the import tarfile line from tools/speech_data_explorer/data_explorer.py.

Concretely, within tools/speech_data_explorer/data_explorer.py, at the top of the file where other standard-library modules are imported, remove the single line import tarfile (currently line 27). No additional methods, imports, or definitions are needed because we are only eliminating an unused symbol; there is no functional code depending on it.

tools/speech_data_explorer/data_explorer.py

+    for line in lines[1:]:  # Skip header line
+        parts = line.split()
+        if len(parts) >= 4:
+            file_type = parts[0]


To fix an unused local variable, either (a) delete the variable assignment if it is unnecessary, or (b) if the field is read for documentation or future-proofing, rename it to follow an "unused" naming convention (for example _, unused_file_type, or _unused_file_type) so that both humans and tools understand that it is intentionally unused.

In this function, the file_type field is part of the documented DALI index format and may be useful for understanding the structure, even though it is not currently used in logic. The safest fix that does not change behavior is to keep reading the first field but rename the variable to something like _unused_file_type. This preserves any (current or future) expectation that the line has four fields, keeps the code self-documenting, and silences the static analysis warning.

Concretely, in tools/speech_data_explorer/data_explorer.py, within parse_dali_index, change line 303 from file_type = parts[0] to _unused_file_type = parts[0]. No additional imports or method changes are needed.

karpnv and others added 2 commits January 27, 2026 18:13

read manifest from s3

fe21e7e

Signed-off-by: Nikolay Karpov <nkarpov@nvidia.com>

Apply isort and black reformatting

9069210

Signed-off-by: karpnv <karpnv@users.noreply.github.com>

github-advanced-security bot found potential problems Jan 28, 2026

View reviewed changes

tools/speech_data_explorer/data_explorer.py Fixed Show fixed Hide fixed

tools/speech_data_explorer/data_explorer.py Fixed Show fixed Hide fixed

karpnv added 2 commits January 30, 2026 16:21

s3cfg parameter

89a595f

Signed-off-by: Nikolay Karpov <nkarpov@nvidia.com>

Merge branch 'main' of github.com:NVIDIA/NeMo into karpnv/sde_s3

d67ec95

karpnv marked this pull request as ready for review January 31, 2026 00:23

karpnv requested a review from Jorjeous January 31, 2026 00:23

karpnv added the Run CICD label Jan 31, 2026

karpnv requested a review from vsl9 January 31, 2026 00:31

github-actions bot removed the Run CICD label Jan 31, 2026

vsl9 previously approved these changes Feb 2, 2026

View reviewed changes

Jorjeous requested changes Feb 3, 2026

View reviewed changes

file range

da895cb

Signed-off-by: Nikolay Karpov <nkarpov@nvidia.com>

karpnv dismissed vsl9’s stale review via da895cb February 4, 2026 02:03

karpnv and others added 2 commits February 4, 2026 02:04

Apply isort and black reformatting

b79a0da

Signed-off-by: karpnv <karpnv@users.noreply.github.com>

Avoid downloading of full tar, instead extracting specific audio file…

fce458b

…. Updaetd logging system Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

Jorjeous previously approved these changes Feb 6, 2026

View reviewed changes

Apply isort and black reformatting

3e5f4ec

Signed-off-by: Jorjeous <Jorjeous@users.noreply.github.com>

chtruong814 dismissed Jorjeous’s stale review via 3e5f4ec February 6, 2026 14:29

Jorjeous requested a review from vsl9 February 6, 2026 14:30

Jorjeous approved these changes Feb 6, 2026

View reviewed changes

github-advanced-security bot found potential problems Feb 6, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

read manifest from s3#15330

read manifest from s3#15330
karpnv wants to merge 8 commits intomainfrom
karpnv/sde_s3

karpnv commented Jan 28, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Jan 31, 2026

Uh oh!

vsl9 left a comment

Uh oh!

Jorjeous Feb 2, 2026

Uh oh!

Jorjeous Feb 6, 2026

Uh oh!

Jorjeous Feb 2, 2026

Uh oh!

Jorjeous Feb 6, 2026

Uh oh!

Jorjeous Feb 2, 2026

Uh oh!

Jorjeous Feb 6, 2026

Uh oh!

Jorjeous commented Feb 6, 2026

Uh oh!

Check notice

Copilot Autofix

Check notice

Copilot Autofix

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

@@ -300,7 +300,7 @@
                 for line in lines[1:]:  # Skip header line
                     parts = line.split()
                     if len(parts) >= 4:
-                        file_type = parts[0]
+                        _unused_file_type = parts[0]
                         offset = int(parts[1])
                         size = int(parts[2])
                         filename = parts[3]

Conversation

karpnv commented Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changelog

Usage

GitHub Actions CI

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Jan 31, 2026

Uh oh!

vsl9 left a comment

Choose a reason for hiding this comment

Uh oh!

Jorjeous Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

Jorjeous Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

Jorjeous Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

Jorjeous Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

Jorjeous Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

Jorjeous Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

Jorjeous commented Feb 6, 2026

Uh oh!

Check notice

Copilot Autofix

Check notice

Copilot Autofix

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

karpnv commented Jan 28, 2026 •

edited

Loading