Skip to content

Add the cstr-vctk dataset#30

Merged
Pascal-H merged 3 commits into
mainfrom
publish-cstr-vctk
Mar 12, 2026
Merged

Add the cstr-vctk dataset#30
Pascal-H merged 3 commits into
mainfrom
publish-cstr-vctk

Conversation

@Pascal-H
Copy link
Copy Markdown
Member

@Pascal-H Pascal-H commented Mar 12, 2026

The initial release of the CSTR's VCTK Corpus (Centre for Speech Technology Voice Cloning Toolkit).
In the dataset itself, no splits are defined, so this initial release just takes the dataset as is and publishes that initial version.

Summary by Sourcery

Add infrastructure to ingest and publish the CSTR VCTK Corpus as an audb dataset.

New Features:

  • Introduce a converter script that builds an audformat database for the CSTR VCTK Corpus with speaker and file-level metadata.
  • Add download and extraction scripts to retrieve and unpack the VCTK-Corpus-0.92 release from the web.
  • Provide a publish script to upload the prepared CSTR VCTK dataset to the audb-public repository.

Enhancements:

  • Define project metadata and dependencies for the cstr-vctk dataset converter in a dedicated pyproject file.

Documentation:

  • Document the CSTR VCTK Corpus dataset integration and usage of helper scripts in a dataset-specific README.
  • Record the initial dataset release in a changelog entry.

Chores:

  • Add boilerplate files such as license, lockfile, and gitignore for the cstr-vctk dataset package.

@Pascal-H Pascal-H self-assigned this Mar 12, 2026
@Pascal-H Pascal-H added the enhancement New feature or request label Mar 12, 2026
@sourcery-ai
Copy link
Copy Markdown
Contributor

sourcery-ai Bot commented Mar 12, 2026

Reviewer's Guide

Adds a new cstr-vctk dataset converter package that downloads, extracts, converts, and publishes the CSTR VCTK Corpus 0.92 as an audb database version 1.0.0, including detailed speaker/file metadata and project scaffolding.

Sequence diagram for publishing_cstr_vctk_dataset_via_helper_scripts

sequenceDiagram
  actor User
  participant DownloadScript as download_from_web_sh
  participant ExtractScript as extract_from_web_sh
  participant ConvertScript as convert_py
  participant PublishScript as publish_py
  participant Datashare as Edinburgh_datashare
  participant LocalFS as Local_file_system
  participant AudbRepo as audb_public_S3_repository

  User ->> DownloadScript: run_script
  DownloadScript ->> Datashare: HTTP_GET VCTK_Corpus_0_92_zip
  Datashare -->> DownloadScript: ZIP_payload
  DownloadScript ->> LocalFS: save_zip_and_license_to_data_raw

  User ->> ExtractScript: run_script
  ExtractScript ->> LocalFS: read_zip_from_data_raw
  ExtractScript ->> LocalFS: unzip_to_data_extracted

  User ->> ConvertScript: python_convert_py
  ConvertScript ->> LocalFS: read_speaker_info_and_transcripts
  ConvertScript ->> LocalFS: flatten_audio_files_to_build_flac
  ConvertScript ->> LocalFS: write_audformat_db_to_build

  User ->> PublishScript: python_publish_py
  PublishScript ->> LocalFS: load_database_from_build
  PublishScript ->> AudbRepo: audb_publish_database_version_1_0_0
  AudbRepo -->> PublishScript: publish_success
  PublishScript -->> User: report_published_version
Loading

Flow diagram for convert_py_cstr_vctk_database_construction

flowchart TD
  Start([Start_main])
  SetPaths["Set_input_output_paths_and_directories"]
  LoadSpeakerInfo["load_speaker_info_from_speaker_info_txt"]
  NormalizeGender["Map_gender_codes_and_set_language_column"]
  DefineDb["Create_audformat_Database_and_schemes"]
  FlattenFiles["flatten_files_copy_or_validate_audio_into_build_flac"]
  CheckOutDir{Is_build_flac_already_populated}
  MakeRelPaths["Strip_db_root_prefix_to_get_relative_file_paths"]
  CreateFileTable["Create_filewise_Table_with_file_index"]
  BuildFilesDF["build_files_dataframe_join_speaker_metadata_and_transcripts"]
  AssignColumns["Assign_column_schemes_and_set_values_from_df_files"]
  SaveDb["db_save_to_build_directory"]
  End([End_main])

  Start --> SetPaths --> LoadSpeakerInfo --> NormalizeGender --> DefineDb --> FlattenFiles
  FlattenFiles --> CheckOutDir
  CheckOutDir -->|yes| MakeRelPaths
  CheckOutDir -->|no| MakeRelPaths

  MakeRelPaths --> CreateFileTable --> BuildFilesDF --> AssignColumns --> SaveDb --> End
Loading

File-Level Changes

Change Details Files
Implement converter that flattens VCTK audio, parses speaker and transcript metadata, and builds an audformat/audb database.
  • Add parsing utilities for speaker-info.txt, including handling of two-word accent labels and optional comments.
  • Load per-speaker transcript .txt files into a recording_id-to-text mapping, skipping empty files.
  • Flatten per-speaker audio directories into a single flac/ directory with collision detection and optional validation of pre-existing flattened audio.
  • Define audformat Database with media, schemes, a speaker-level MiscTable, and a file-wise table populated from derived DataFrames, then save it to a build directory.
datasets/cstr-vctk/1.0.0/convert.py
Add scripts to download and extract the upstream VCTK 0.92 release and to publish the built audb dataset.
  • Provide a download-from_web.sh helper that fetches the VCTK 0.92 ZIP, README, and license text into a raw data directory.
  • Provide an extract-from_web.sh helper that unzips the corpus into the expected extracted directory layout, with basic existence checks.
  • Add a Python publish.py script that reads the built database from the build directory and publishes version 1.0.0 to the configured audb S3 repository.
datasets/cstr-vctk/1.0.0/download-from_web.sh
datasets/cstr-vctk/1.0.0/extract-from_web.sh
datasets/cstr-vctk/1.0.0/publish.py
Introduce project configuration and documentation for the cstr-vctk converter package.
  • Define pyproject.toml with package metadata, Python version requirement, runtime dependencies (audb, audeer, audformat, pandas, tqdm), and a dev dependency group (black).
  • Add README describing the CSTR VCTK Corpus and how to use the helper scripts for download and extraction.
  • Add a changelog documenting the initial 1.0.0 release and include ancillary project files like LICENSE, .gitignore, and uv.lock.
datasets/cstr-vctk/pyproject.toml
datasets/cstr-vctk/README.md
datasets/cstr-vctk/CHANGELOG.md
datasets/cstr-vctk/LICENSE
datasets/cstr-vctk/.gitignore
datasets/cstr-vctk/uv.lock

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link
Copy Markdown
Contributor

@sourcery-ai sourcery-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 3 issues, and left some high level feedback:

  • Consider wrapping the main() invocation in a if __name__ == "__main__": guard so the conversion logic doesn’t run on import (e.g., when used as a library or during tooling/inspection).
  • The various paths in main() (e.g., data/, build, VCTK-Corpus-0.92) are currently hard-coded; exposing them as arguments or environment-configurable values would make the converter more reusable and easier to run in different environments.
  • In flatten_files(), using filecmp.cmp(..., shallow=True) may miss content differences if timestamps and sizes match; if the goal is strict validation of existing flattened files, consider shallow=False or a content-based hash check.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- Consider wrapping the `main()` invocation in a `if __name__ == "__main__":` guard so the conversion logic doesn’t run on import (e.g., when used as a library or during tooling/inspection).
- The various paths in `main()` (e.g., `data/`, `build`, `VCTK-Corpus-0.92`) are currently hard-coded; exposing them as arguments or environment-configurable values would make the converter more reusable and easier to run in different environments.
- In `flatten_files()`, using `filecmp.cmp(..., shallow=True)` may miss content differences if timestamps and sizes match; if the goal is strict validation of existing flattened files, consider `shallow=False` or a content-based hash check.

## Individual Comments

### Comment 1
<location path="datasets/cstr-vctk/1.0.0/convert.py" line_range="38-39" />
<code_context>
+            line = raw.rstrip()
+            if not line:
+                continue
+            # Detach optional parenthetical comment at line end
+            m = re.search(r"\(.*\)\s*$", line)
+            comment = m.group(0).strip() if m else ""
+            line_clean = line[: m.start()].rstrip() if m else line
</code_context>
<issue_to_address>
**suggestion (bug_risk):** Make the trailing comment regex non-greedy to avoid swallowing earlier parentheses in the line.

`r"\(.*\)\s*$"` is greedy and will match from the first `(` to the last `)` on the line, so any earlier parentheses (e.g., in region/accent fields) will be treated as part of the comment. Use a pattern constrained to the final parenthetical block, e.g. `r"\([^()]*\)\s*$"` or a non-greedy `r"\(.*?\)\s*$"`, so only the trailing comment is matched.

```suggestion
            # Detach optional parenthetical comment at line end
            # Use a non-greedy / constrained pattern so only the final parenthetical block is treated as comment
            m = re.search(r"\([^()]*\)\s*$", line)
```
</issue_to_address>

### Comment 2
<location path="datasets/cstr-vctk/1.0.0/convert.py" line_range="400-5" />
<code_context>
+    db[table_id]["microphone"] = audformat.Column(scheme_id="microphone")
+    db[table_id]["microphone"].set(df_files["microphone"])
+
+    db.save(db_root)
+
+
+main()
</code_context>
<issue_to_address>
**issue (bug_risk):** Avoid executing the conversion pipeline on import by guarding the main entry point.

Calling `main()` at module level means simply importing this file will immediately read data, copy files, and write to the database, which is unsafe and makes the module hard to reuse. Please wrap the `main()` call in an `if __name__ == "__main__":` guard so these side effects only occur on direct CLI execution, not on import.
</issue_to_address>

### Comment 3
<location path="datasets/cstr-vctk/README.md" line_range="1" />
<code_context>
+Use [`download-from_web.sh`](1.0.0/download-from_web.sh) and [`extract-from_web.sh`](1.0.0/extract-from_web.sh) to download and extract the dataset from the web.  
+
+> Christophe Veaux,  Junichi Yamagishi, Kirsten MacDonald,  
+> "CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit",  
+> The Centre for Speech Technology Research (CSTR),  
+> University of Edinburgh 
</code_context>
<issue_to_address>
**question (typo):** Dataset title wording differs between the link text and the citation.

The first sentence uses `for Centre for Speech Technology Voice Cloning Toolkit`, while the citation uses `for CSTR Voice Cloning Toolkit`. Please align these so the dataset title is consistent with the official wording.

```suggestion
Publish the [`CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit`](https://datashare.ed.ac.uk/handle/10283/3443) dataset.  
```
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment thread datasets/cstr-vctk/1.0.0/convert.py
Comment thread datasets/cstr-vctk/1.0.0/convert.py
Comment thread datasets/cstr-vctk/README.md
@Pascal-H Pascal-H marked this pull request as draft March 12, 2026 16:28
@Pascal-H Pascal-H marked this pull request as ready for review March 12, 2026 16:36
Copy link
Copy Markdown
Contributor

@sourcery-ai sourcery-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 2 issues, and left some high level feedback:

  • In pyproject.toml the requires-python = ">=3.13" constraint is very restrictive compared to typical deployment environments; consider lowering it to the minimum version actually required by your dependencies and code (e.g., matching the rest of the project).
  • In flatten_files() you use filecmp.cmp(..., shallow=True) for validation; switching to shallow=False would ensure content-level equality instead of relying on metadata, which is safer when verifying copied audio files.
  • When normalizing paths for idx_files, using os.path.relpath(path, db_root) instead of str.removeprefix(db_root + os.sep) would be more robust across platforms and avoid assumptions about how the prefix appears in the string.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- In `pyproject.toml` the `requires-python = ">=3.13"` constraint is very restrictive compared to typical deployment environments; consider lowering it to the minimum version actually required by your dependencies and code (e.g., matching the rest of the project).
- In `flatten_files()` you use `filecmp.cmp(..., shallow=True)` for validation; switching to `shallow=False` would ensure content-level equality instead of relying on metadata, which is safer when verifying copied audio files.
- When normalizing paths for `idx_files`, using `os.path.relpath(path, db_root)` instead of `str.removeprefix(db_root + os.sep)` would be more robust across platforms and avoid assumptions about how the prefix appears in the string.

## Individual Comments

### Comment 1
<location path="datasets/cstr-vctk/1.0.0/convert.py" line_range="38-39" />
<code_context>
+            line = raw.rstrip()
+            if not line:
+                continue
+            # Detach optional parenthetical comment at line end
+            m = re.search(r"\(.*\)\s*$", line)
+            comment = m.group(0).strip() if m else ""
+            line_clean = line[: m.start()].rstrip() if m else line
</code_context>
<issue_to_address>
**suggestion (bug_risk):** The greedy comment regex may accidentally strip content if earlier parentheses appear in the line

`r"\(.*\)\s*$"` is greedy and will match from the first `(` in the line to the end. If any non-comment content ever includes parentheses, everything from that first `(` onward will be discarded. To avoid this, restrict the match to a single parenthetical group at the end, e.g. `r"\([^()]*\)\s*$"`, so only a trailing comment block is removed.

```suggestion
            # Detach optional parenthetical comment at line end
            # Use a non-greedy, single-group match so only a trailing parenthetical block is removed
            m = re.search(r"\([^()]*\)\s*$", line)
```
</issue_to_address>

### Comment 2
<location path="datasets/cstr-vctk/1.0.0/convert.py" line_range="140-147" />
<code_context>
+    return df
+
+
+def flatten_files(dir_in: str, dir_out: str) -> pd.Index:
+    """
+    Extract all files from the speaker-based subdirectories and put them all in
+    a plain directory. Check if there are colliding file names.
+    If the output directory already exists: validate that the therein
+    contained, flattened-out files are matching the ones in the input directory.
+
+    Returns a Pandas Index with the relative file paths.
+    """
+    files = []
</code_context>
<issue_to_address>
**nitpick:** Docstring claims to return relative paths, but the function returns absolute paths

The function currently appends absolute `target_file` paths to `files` and returns them as a `pd.Index`, while they are only made relative later in `main()` via `str.removeprefix`. Please either update the docstring to state that absolute paths are returned, or adjust the function to strip `dir_out` before returning so the behavior matches the current docstring.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment thread datasets/cstr-vctk/1.0.0/convert.py
Comment thread datasets/cstr-vctk/1.0.0/convert.py Outdated
@Pascal-H Pascal-H merged commit 14362cf into main Mar 12, 2026
3 checks passed
@Pascal-H Pascal-H deleted the publish-cstr-vctk branch March 12, 2026 17:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant