Conversation
Signed-off-by: Nikolay Karpov <nkarpov@nvidia.com>
Signed-off-by: karpnv <karpnv@users.noreply.github.com>
Signed-off-by: Nikolay Karpov <nkarpov@nvidia.com>
|
[🤖]: Hi @karpnv 👋, We wanted to let you know that a CICD pipeline for this PR just finished successfully. So it might be time to merge this PR or get some approvals. |
| global _tar_cache | ||
|
|
||
| # Check if we have this tar file cached | ||
| if tar_s3_path not in _tar_cache: |
There was a problem hiding this comment.
May be there is a way to read especially audio, without download whole tar?
this may cause Huge delays
| global _s3_client | ||
| path, section = s3cfg.rsplit('[', 1) | ||
| s3_config = parse_s3cfg(path, section.rstrip(']')) | ||
| print("s3_config", s3_config) |
| """ | ||
| global _s3_client | ||
| bucket, key = parse_s3_path(s3_path) | ||
| try: |
There was a problem hiding this comment.
This will fail in local manifest contain s3 path, and nothing is set [like s3 config etc]. + actuall error (missing s3 configs) covered with attr err
Signed-off-by: Nikolay Karpov <nkarpov@nvidia.com>
Signed-off-by: karpnv <karpnv@users.noreply.github.com>
…. Updaetd logging system Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Signed-off-by: Jorjeous <Jorjeous@users.noreply.github.com>
|
@karpnv please check again, now all LGTM |
| import operator | ||
| import os | ||
| import pickle | ||
| import tarfile |
Check notice
Code scanning / CodeQL
Unused import Note
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI 2 days ago
In general, an unused import should be removed to reduce unnecessary dependencies, speed up module loading slightly, and improve readability. The best fix here is to delete the import tarfile line from tools/speech_data_explorer/data_explorer.py.
Concretely, within tools/speech_data_explorer/data_explorer.py, at the top of the file where other standard-library modules are imported, remove the single line import tarfile (currently line 27). No additional methods, imports, or definitions are needed because we are only eliminating an unused symbol; there is no functional code depending on it.
| @@ -24,7 +24,6 @@ | ||
| import operator | ||
| import os | ||
| import pickle | ||
| import tarfile | ||
| from collections import defaultdict | ||
| from os.path import expanduser | ||
| from pathlib import Path |
| for line in lines[1:]: # Skip header line | ||
| parts = line.split() | ||
| if len(parts) >= 4: | ||
| file_type = parts[0] |
Check notice
Code scanning / CodeQL
Unused local variable Note
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI 2 days ago
To fix an unused local variable, either (a) delete the variable assignment if it is unnecessary, or (b) if the field is read for documentation or future-proofing, rename it to follow an "unused" naming convention (for example _, unused_file_type, or _unused_file_type) so that both humans and tools understand that it is intentionally unused.
In this function, the file_type field is part of the documented DALI index format and may be useful for understanding the structure, even though it is not currently used in logic. The safest fix that does not change behavior is to keep reading the first field but rename the variable to something like _unused_file_type. This preserves any (current or future) expectation that the line has four fields, keeps the code self-documenting, and silences the static analysis warning.
Concretely, in tools/speech_data_explorer/data_explorer.py, within parse_dali_index, change line 303 from file_type = parts[0] to _unused_file_type = parts[0]. No additional imports or method changes are needed.
| @@ -300,7 +300,7 @@ | ||
| for line in lines[1:]: # Skip header line | ||
| parts = line.split() | ||
| if len(parts) >= 4: | ||
| file_type = parts[0] | ||
| _unused_file_type = parts[0] | ||
| offset = int(parts[1]) | ||
| size = int(parts[2]) | ||
| filename = parts[3] |
S3 support
Collection: ASR
Changelog
--s3cfgExample: ~/.s3cfg[default]. Set to "" to disable S3 support. Default is "".--tar-base-path(e.g., s3://ASR/tarred/audio_0.tar or s3://ASR/tarred/audio__OP_0..2047_CL_.tar).When specified, audio_filepath values in the manifest are treated as filenames within this tar archive.
Usage
python tools/speech_data_explorer/data_explorer.py s3://abc/sharded_manifests/manifest_0.json --tar-base-path s3://abc/tarred/audio_0.tar --s3cfg ~/.s3cfg[default]GitHub Actions CI
PR Type: