Skip to content

read manifest from s3#15330

Open
karpnv wants to merge 8 commits intomainfrom
karpnv/sde_s3
Open

read manifest from s3#15330
karpnv wants to merge 8 commits intomainfrom
karpnv/sde_s3

Conversation

@karpnv
Copy link
Collaborator

@karpnv karpnv commented Jan 28, 2026

S3 support

Collection: ASR

Changelog

  • Use input manifest in s3 object storage (s3://abc/sharded_manifests/manifest_0.jsonl or s3://abc/sharded_manifests/manifest__OP_0..2047_CL_.jsonl)
  • Add path to the s3 credentials file and section. --s3cfg Example: ~/.s3cfg[default]. Set to "" to disable S3 support. Default is "".
  • Add S3 path to tarred audio files --tar-base-path (e.g., s3://ASR/tarred/audio_0.tar or s3://ASR/tarred/audio__OP_0..2047_CL_.tar).
    When specified, audio_filepath values in the manifest are treated as filenames within this tar archive.

Usage

  • You can potentially add a usage example below
python tools/speech_data_explorer/data_explorer.py s3://abc/sharded_manifests/manifest_0.json --tar-base-path s3://abc/tarred/audio_0.tar --s3cfg ~/.s3cfg[default]

GitHub Actions CI

PR Type:

  • [ V] New Feature
  • Bugfix
  • Documentation

karpnv and others added 2 commits January 27, 2026 18:13
Signed-off-by: Nikolay Karpov <nkarpov@nvidia.com>
Signed-off-by: karpnv <karpnv@users.noreply.github.com>
@karpnv karpnv marked this pull request as ready for review January 31, 2026 00:23
@karpnv karpnv requested a review from Jorjeous January 31, 2026 00:23
@karpnv karpnv requested a review from vsl9 January 31, 2026 00:31
@github-actions github-actions bot removed the Run CICD label Jan 31, 2026
@github-actions
Copy link
Contributor

[🤖]: Hi @karpnv 👋,

We wanted to let you know that a CICD pipeline for this PR just finished successfully.

So it might be time to merge this PR or get some approvals.

//cc @chtruong814 @ko3n1g @pablo-garay @thomasdhc

vsl9
vsl9 previously approved these changes Feb 2, 2026
Copy link
Collaborator

@vsl9 vsl9 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, LGTM

global _tar_cache

# Check if we have this tar file cached
if tar_s3_path not in _tar_cache:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be there is a way to read especially audio, without download whole tar?
this may cause Huge delays

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we can for webds, will do

global _s3_client
path, section = s3cfg.rsplit('[', 1)
s3_config = parse_s3cfg(path, section.rstrip(']'))
print("s3_config", s3_config)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move to logger?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will be updated

"""
global _s3_client
bucket, key = parse_s3_path(s3_path)
try:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will fail in local manifest contain s3 path, and nothing is set [like s3 config etc]. + actuall error (missing s3 configs) covered with attr err

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will fix

Signed-off-by: Nikolay Karpov <nkarpov@nvidia.com>
karpnv and others added 2 commits February 4, 2026 02:04
Signed-off-by: karpnv <karpnv@users.noreply.github.com>
…. Updaetd logging system

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Jorjeous
Jorjeous previously approved these changes Feb 6, 2026
Signed-off-by: Jorjeous <Jorjeous@users.noreply.github.com>
@Jorjeous
Copy link
Member

Jorjeous commented Feb 6, 2026

@karpnv please check again, now all LGTM

import operator
import os
import pickle
import tarfile

Check notice

Code scanning / CodeQL

Unused import Note

Import of 'tarfile' is not used.

Copilot Autofix

AI 2 days ago

In general, an unused import should be removed to reduce unnecessary dependencies, speed up module loading slightly, and improve readability. The best fix here is to delete the import tarfile line from tools/speech_data_explorer/data_explorer.py.

Concretely, within tools/speech_data_explorer/data_explorer.py, at the top of the file where other standard-library modules are imported, remove the single line import tarfile (currently line 27). No additional methods, imports, or definitions are needed because we are only eliminating an unused symbol; there is no functional code depending on it.

Suggested changeset 1
tools/speech_data_explorer/data_explorer.py

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/tools/speech_data_explorer/data_explorer.py b/tools/speech_data_explorer/data_explorer.py
--- a/tools/speech_data_explorer/data_explorer.py
+++ b/tools/speech_data_explorer/data_explorer.py
@@ -24,7 +24,6 @@
 import operator
 import os
 import pickle
-import tarfile
 from collections import defaultdict
 from os.path import expanduser
 from pathlib import Path
EOF
@@ -24,7 +24,6 @@
import operator
import os
import pickle
import tarfile
from collections import defaultdict
from os.path import expanduser
from pathlib import Path
Copilot is powered by AI and may make mistakes. Always verify output.
for line in lines[1:]: # Skip header line
parts = line.split()
if len(parts) >= 4:
file_type = parts[0]

Check notice

Code scanning / CodeQL

Unused local variable Note

Variable file_type is not used.

Copilot Autofix

AI 2 days ago

To fix an unused local variable, either (a) delete the variable assignment if it is unnecessary, or (b) if the field is read for documentation or future-proofing, rename it to follow an "unused" naming convention (for example _, unused_file_type, or _unused_file_type) so that both humans and tools understand that it is intentionally unused.

In this function, the file_type field is part of the documented DALI index format and may be useful for understanding the structure, even though it is not currently used in logic. The safest fix that does not change behavior is to keep reading the first field but rename the variable to something like _unused_file_type. This preserves any (current or future) expectation that the line has four fields, keeps the code self-documenting, and silences the static analysis warning.

Concretely, in tools/speech_data_explorer/data_explorer.py, within parse_dali_index, change line 303 from file_type = parts[0] to _unused_file_type = parts[0]. No additional imports or method changes are needed.

Suggested changeset 1
tools/speech_data_explorer/data_explorer.py

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/tools/speech_data_explorer/data_explorer.py b/tools/speech_data_explorer/data_explorer.py
--- a/tools/speech_data_explorer/data_explorer.py
+++ b/tools/speech_data_explorer/data_explorer.py
@@ -300,7 +300,7 @@
     for line in lines[1:]:  # Skip header line
         parts = line.split()
         if len(parts) >= 4:
-            file_type = parts[0]
+            _unused_file_type = parts[0]
             offset = int(parts[1])
             size = int(parts[2])
             filename = parts[3]
EOF
@@ -300,7 +300,7 @@
for line in lines[1:]: # Skip header line
parts = line.split()
if len(parts) >= 4:
file_type = parts[0]
_unused_file_type = parts[0]
offset = int(parts[1])
size = int(parts[2])
filename = parts[3]
Copilot is powered by AI and may make mistakes. Always verify output.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants