Add the cstr-vctk dataset#30
Merged
Merged
Conversation
Contributor
Reviewer's GuideAdds a new cstr-vctk dataset converter package that downloads, extracts, converts, and publishes the CSTR VCTK Corpus 0.92 as an audb database version 1.0.0, including detailed speaker/file metadata and project scaffolding. Sequence diagram for publishing_cstr_vctk_dataset_via_helper_scriptssequenceDiagram
actor User
participant DownloadScript as download_from_web_sh
participant ExtractScript as extract_from_web_sh
participant ConvertScript as convert_py
participant PublishScript as publish_py
participant Datashare as Edinburgh_datashare
participant LocalFS as Local_file_system
participant AudbRepo as audb_public_S3_repository
User ->> DownloadScript: run_script
DownloadScript ->> Datashare: HTTP_GET VCTK_Corpus_0_92_zip
Datashare -->> DownloadScript: ZIP_payload
DownloadScript ->> LocalFS: save_zip_and_license_to_data_raw
User ->> ExtractScript: run_script
ExtractScript ->> LocalFS: read_zip_from_data_raw
ExtractScript ->> LocalFS: unzip_to_data_extracted
User ->> ConvertScript: python_convert_py
ConvertScript ->> LocalFS: read_speaker_info_and_transcripts
ConvertScript ->> LocalFS: flatten_audio_files_to_build_flac
ConvertScript ->> LocalFS: write_audformat_db_to_build
User ->> PublishScript: python_publish_py
PublishScript ->> LocalFS: load_database_from_build
PublishScript ->> AudbRepo: audb_publish_database_version_1_0_0
AudbRepo -->> PublishScript: publish_success
PublishScript -->> User: report_published_version
Flow diagram for convert_py_cstr_vctk_database_constructionflowchart TD
Start([Start_main])
SetPaths["Set_input_output_paths_and_directories"]
LoadSpeakerInfo["load_speaker_info_from_speaker_info_txt"]
NormalizeGender["Map_gender_codes_and_set_language_column"]
DefineDb["Create_audformat_Database_and_schemes"]
FlattenFiles["flatten_files_copy_or_validate_audio_into_build_flac"]
CheckOutDir{Is_build_flac_already_populated}
MakeRelPaths["Strip_db_root_prefix_to_get_relative_file_paths"]
CreateFileTable["Create_filewise_Table_with_file_index"]
BuildFilesDF["build_files_dataframe_join_speaker_metadata_and_transcripts"]
AssignColumns["Assign_column_schemes_and_set_values_from_df_files"]
SaveDb["db_save_to_build_directory"]
End([End_main])
Start --> SetPaths --> LoadSpeakerInfo --> NormalizeGender --> DefineDb --> FlattenFiles
FlattenFiles --> CheckOutDir
CheckOutDir -->|yes| MakeRelPaths
CheckOutDir -->|no| MakeRelPaths
MakeRelPaths --> CreateFileTable --> BuildFilesDF --> AssignColumns --> SaveDb --> End
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
Contributor
There was a problem hiding this comment.
Hey - I've found 3 issues, and left some high level feedback:
- Consider wrapping the
main()invocation in aif __name__ == "__main__":guard so the conversion logic doesn’t run on import (e.g., when used as a library or during tooling/inspection). - The various paths in
main()(e.g.,data/,build,VCTK-Corpus-0.92) are currently hard-coded; exposing them as arguments or environment-configurable values would make the converter more reusable and easier to run in different environments. - In
flatten_files(), usingfilecmp.cmp(..., shallow=True)may miss content differences if timestamps and sizes match; if the goal is strict validation of existing flattened files, considershallow=Falseor a content-based hash check.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- Consider wrapping the `main()` invocation in a `if __name__ == "__main__":` guard so the conversion logic doesn’t run on import (e.g., when used as a library or during tooling/inspection).
- The various paths in `main()` (e.g., `data/`, `build`, `VCTK-Corpus-0.92`) are currently hard-coded; exposing them as arguments or environment-configurable values would make the converter more reusable and easier to run in different environments.
- In `flatten_files()`, using `filecmp.cmp(..., shallow=True)` may miss content differences if timestamps and sizes match; if the goal is strict validation of existing flattened files, consider `shallow=False` or a content-based hash check.
## Individual Comments
### Comment 1
<location path="datasets/cstr-vctk/1.0.0/convert.py" line_range="38-39" />
<code_context>
+ line = raw.rstrip()
+ if not line:
+ continue
+ # Detach optional parenthetical comment at line end
+ m = re.search(r"\(.*\)\s*$", line)
+ comment = m.group(0).strip() if m else ""
+ line_clean = line[: m.start()].rstrip() if m else line
</code_context>
<issue_to_address>
**suggestion (bug_risk):** Make the trailing comment regex non-greedy to avoid swallowing earlier parentheses in the line.
`r"\(.*\)\s*$"` is greedy and will match from the first `(` to the last `)` on the line, so any earlier parentheses (e.g., in region/accent fields) will be treated as part of the comment. Use a pattern constrained to the final parenthetical block, e.g. `r"\([^()]*\)\s*$"` or a non-greedy `r"\(.*?\)\s*$"`, so only the trailing comment is matched.
```suggestion
# Detach optional parenthetical comment at line end
# Use a non-greedy / constrained pattern so only the final parenthetical block is treated as comment
m = re.search(r"\([^()]*\)\s*$", line)
```
</issue_to_address>
### Comment 2
<location path="datasets/cstr-vctk/1.0.0/convert.py" line_range="400-5" />
<code_context>
+ db[table_id]["microphone"] = audformat.Column(scheme_id="microphone")
+ db[table_id]["microphone"].set(df_files["microphone"])
+
+ db.save(db_root)
+
+
+main()
</code_context>
<issue_to_address>
**issue (bug_risk):** Avoid executing the conversion pipeline on import by guarding the main entry point.
Calling `main()` at module level means simply importing this file will immediately read data, copy files, and write to the database, which is unsafe and makes the module hard to reuse. Please wrap the `main()` call in an `if __name__ == "__main__":` guard so these side effects only occur on direct CLI execution, not on import.
</issue_to_address>
### Comment 3
<location path="datasets/cstr-vctk/README.md" line_range="1" />
<code_context>
+Use [`download-from_web.sh`](1.0.0/download-from_web.sh) and [`extract-from_web.sh`](1.0.0/extract-from_web.sh) to download and extract the dataset from the web.
+
+> Christophe Veaux, Junichi Yamagishi, Kirsten MacDonald,
+> "CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit",
+> The Centre for Speech Technology Research (CSTR),
+> University of Edinburgh
</code_context>
<issue_to_address>
**question (typo):** Dataset title wording differs between the link text and the citation.
The first sentence uses `for Centre for Speech Technology Voice Cloning Toolkit`, while the citation uses `for CSTR Voice Cloning Toolkit`. Please align these so the dataset title is consistent with the official wording.
```suggestion
Publish the [`CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit`](https://datashare.ed.ac.uk/handle/10283/3443) dataset.
```
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
Contributor
There was a problem hiding this comment.
Hey - I've found 2 issues, and left some high level feedback:
- In
pyproject.tomltherequires-python = ">=3.13"constraint is very restrictive compared to typical deployment environments; consider lowering it to the minimum version actually required by your dependencies and code (e.g., matching the rest of the project). - In
flatten_files()you usefilecmp.cmp(..., shallow=True)for validation; switching toshallow=Falsewould ensure content-level equality instead of relying on metadata, which is safer when verifying copied audio files. - When normalizing paths for
idx_files, usingos.path.relpath(path, db_root)instead ofstr.removeprefix(db_root + os.sep)would be more robust across platforms and avoid assumptions about how the prefix appears in the string.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- In `pyproject.toml` the `requires-python = ">=3.13"` constraint is very restrictive compared to typical deployment environments; consider lowering it to the minimum version actually required by your dependencies and code (e.g., matching the rest of the project).
- In `flatten_files()` you use `filecmp.cmp(..., shallow=True)` for validation; switching to `shallow=False` would ensure content-level equality instead of relying on metadata, which is safer when verifying copied audio files.
- When normalizing paths for `idx_files`, using `os.path.relpath(path, db_root)` instead of `str.removeprefix(db_root + os.sep)` would be more robust across platforms and avoid assumptions about how the prefix appears in the string.
## Individual Comments
### Comment 1
<location path="datasets/cstr-vctk/1.0.0/convert.py" line_range="38-39" />
<code_context>
+ line = raw.rstrip()
+ if not line:
+ continue
+ # Detach optional parenthetical comment at line end
+ m = re.search(r"\(.*\)\s*$", line)
+ comment = m.group(0).strip() if m else ""
+ line_clean = line[: m.start()].rstrip() if m else line
</code_context>
<issue_to_address>
**suggestion (bug_risk):** The greedy comment regex may accidentally strip content if earlier parentheses appear in the line
`r"\(.*\)\s*$"` is greedy and will match from the first `(` in the line to the end. If any non-comment content ever includes parentheses, everything from that first `(` onward will be discarded. To avoid this, restrict the match to a single parenthetical group at the end, e.g. `r"\([^()]*\)\s*$"`, so only a trailing comment block is removed.
```suggestion
# Detach optional parenthetical comment at line end
# Use a non-greedy, single-group match so only a trailing parenthetical block is removed
m = re.search(r"\([^()]*\)\s*$", line)
```
</issue_to_address>
### Comment 2
<location path="datasets/cstr-vctk/1.0.0/convert.py" line_range="140-147" />
<code_context>
+ return df
+
+
+def flatten_files(dir_in: str, dir_out: str) -> pd.Index:
+ """
+ Extract all files from the speaker-based subdirectories and put them all in
+ a plain directory. Check if there are colliding file names.
+ If the output directory already exists: validate that the therein
+ contained, flattened-out files are matching the ones in the input directory.
+
+ Returns a Pandas Index with the relative file paths.
+ """
+ files = []
</code_context>
<issue_to_address>
**nitpick:** Docstring claims to return relative paths, but the function returns absolute paths
The function currently appends absolute `target_file` paths to `files` and returns them as a `pd.Index`, while they are only made relative later in `main()` via `str.removeprefix`. Please either update the docstring to state that absolute paths are returned, or adjust the function to strip `dir_out` before returning so the behavior matches the current docstring.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The initial release of the CSTR's VCTK Corpus (Centre for Speech Technology Voice Cloning Toolkit).
In the dataset itself, no splits are defined, so this initial release just takes the dataset as is and publishes that initial version.
Summary by Sourcery
Add infrastructure to ingest and publish the CSTR VCTK Corpus as an audb dataset.
New Features:
Enhancements:
Documentation:
Chores: