Add audbcards.Dataset.example_json by hagenw · Pull Request #142 · audeering/audbcards

hagenw · 2026-01-06T08:24:52Z

As we are interested in not only showing an example audio/video file on a datacard, but also the content (or an excerpt of that content) of a json file, this pull request adds audbcards.Dataset.example_json to randomly select a json file from a dataset.

We do not have yet a public dataset that contains a json file (this will change with audeering/datasets#29), but it can be tested with our internal datasets:

>>> import audbcards
>>> ds = audbcards.Dataset("toolace", "1.0.0")
>>> ds.example_json
'json/sample-1077.json'

Summary by Sourcery

Add support for selecting example JSON files from datasets and reuse archive file counting logic across example selections.

New Features:

Introduce Dataset.example_json cached property to return a representative JSON file from a dataset when available.

Enhancements:

Refactor archive file counting into a shared helper used by example selection methods and simplify example media selection return logic.
Seed the random module to make example selection deterministic.

Tests:

Add tests validating example_json selection behavior for datasets with and without acceptable JSON files.

sourcery-ai · 2026-01-06T08:24:58Z

Reviewer's Guide

Adds a new Dataset.example_json cached property to select a representative JSON file from a dataset, refactors archive-size checking into a shared helper used by both example_json and example_media, seeds the random module for deterministic selection, and introduces tests for the new behavior.

Class diagram for updated _Dataset example selection API

classDiagram
    class _Dataset {
        <<class>>
        +segment_durations
        +duration() pd.Timedelta
        +example_media() str|None
        +example_json() str|None
        +files() int
        +_cached_properties() list
        +_files_in_archive(index int) int
        +_load_backend() audbackend.interface.Base
        +deps() DependencyTable
        +repository_object RepositoryObject
    }

    class DependencyTable {
        <<class>>
        +format list~str~
        +files list~str~
        +media list~str~
        +archive list~str~
    }

    class RepositoryObject {
        <<class>>
        +create_backend_interface() audbackend.interface.Base
    }

    _Dataset --> DependencyTable : uses_deps
    _Dataset --> RepositoryObject : has_repository_object

Flow diagram for example_json selection logic

flowchart TD
    A["Start example_json"] --> B["Collect indices with format == json"]
    B --> C{"Any json indices?"}
    C -- "No" --> D["Return None"]
    C -- "Yes" --> E["Pick random index from json indices"]
    E --> F["Call _files_in_archive(index)"]
    F --> G{"Files in archive < 100?"}
    G -- "No" --> D
    G -- "Yes" --> H["Return deps.files[index]"]

Flow diagram for refactored example_media archive check

flowchart TD
    A["Start example_media"] --> B["Compute durations from segment_durations"]
    B --> C["Select duration closest to median"]
    C --> D["Find index of selected duration"]
    D --> E["Call _files_in_archive(index)"]
    E --> F{"Files in archive < 100?"}
    F -- "No" --> G["Return None"]
    F -- "Yes" --> H["Return deps.media[index]"]

File-Level Changes

Change	Details	Files
Introduce Dataset.example_json cached property to return a representative JSON file when available and within archive-size constraints.	Add example_json cached property that collects indices of JSON dependencies and randomly selects one Return None when no JSON files exist or when the JSON file resides in an archive with 100 or more files Use shared helper to enforce the max-archive-files condition	`audbcards/core/dataset.py` `tests/test_dataset.py`
Refactor archive-file-count logic into a reusable helper and apply it to example_media.	Add _files_in_archive helper that counts how many files share the same archive as a given dependency index Simplify example_media to return the selected media file only if its archive has fewer than 100 files by using _files_in_archive	`audbcards/core/dataset.py`
Make example selection deterministic and cover new JSON example logic with tests.	Seed Python's random module with a fixed value (123) to ensure deterministic example selection Add parametrized test_dataset_example_json to verify JSON example selection and None behavior for large archives	`audbcards/core/dataset.py` `tests/test_dataset.py`

Possibly linked issues

#Show example JSON file content: PR adds Dataset.example_json, enabling selection of example JSON files needed to show their content as in issue.

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

sourcery-ai

Hey - I've found 3 issues, and left some high level feedback:

Seeding the global RNG with random.seed(123) at module import will affect all consumers of this library; instead, consider using a local random.Random instance inside the relevant methods or controlling randomness only within tests.
The _files_in_archive helper currently counts matches via a list comprehension; you could simplify and speed this up by using a vectorized operation on the archives Series (e.g., (archives == selected_archive).sum()).

Prompt for AI Agents

Please address the comments from this code review:

## Overall Comments
- Seeding the global RNG with `random.seed(123)` at module import will affect all consumers of this library; instead, consider using a local `random.Random` instance inside the relevant methods or controlling randomness only within tests.
- The `_files_in_archive` helper currently counts matches via a list comprehension; you could simplify and speed this up by using a vectorized operation on the `archives` Series (e.g., `(archives == selected_archive).sum()`).

## Individual Comments

### Comment 1
<location> `audbcards/core/dataset.py:22` </location>
<code_context>
 from audbcards.core.config import config


+random.seed(123)
+
+
</code_context>

<issue_to_address>
**issue (bug_risk):** Avoid setting the global random seed at import time to prevent side effects for library users.

This can unexpectedly affect any code using the global `random` (including tests and other libraries). For deterministic behavior, use a dedicated `random.Random(123)` instance (as a private module-level object or created locally) instead of seeding the global RNG.
</issue_to_address>

### Comment 2
<location> `audbcards/core/dataset.py:672-674` </location>
<code_context>
+        """
+        archives = self.deps().archive
+        selected_archive = archives.iloc[index]
+        return len([archive for archive in archives if archive == selected_archive])
+
     def _load_backend(self) -> type[audbackend.interface.Base]:
</code_context>

<issue_to_address>
**suggestion (performance):** Use vectorized counting instead of building a list to count matching archives.

Since `archives` is a pandas Series, you can avoid creating an intermediate list and use a vectorized comparison instead:

```python
return (archives == selected_archive).sum()
```
This is both clearer and more efficient than the current list comprehension plus `len`.

```suggestion
        archives = self.deps().archive
        selected_archive = archives.iloc[index]
        return (archives == selected_archive).sum()
```
</issue_to_address>

### Comment 3
<location> `tests/test_dataset.py:366-373` </location>
<code_context>
     assert dataset.example_media == expected_example


+@pytest.mark.parametrize(
+    "db, expected",
+    [
+        ("mixed_db", "c0.json"),
+        ("medium_db", None),
+    ],
+)
+def test_dataset_example_json(db, expected, cache, request):
+    r"""Test Dataset.example_json.
+
</code_context>

<issue_to_address>
**suggestion (testing):** Add a test case where a JSON file exists but is filtered out because its archive contains ≥100 files.

The current parametrization only checks a case with a JSON file (`mixed_db`) and a case that likely has none (`medium_db`), but doesn’t verify the archive-size exclusion rule described in the `example_json` docstring. Please add a dedicated fixture (e.g. `large_json_db`) where a JSON file is inside an archive with ≥100 files and assert that `example_json` returns `None`, so the `_files_in_archive` size condition is explicitly covered and protected from regressions.

Suggested implementation:

```python
@pytest.mark.parametrize(
    "db, expected",
    [
        ("mixed_db", "c0.json"),
        ("medium_db", None),
        ("large_json_db", None),
    ],
)
def test_dataset_example_json(db, expected, cache, request):
    r"""Test Dataset.example_json.

```

To fully implement the requested behavior you will also need to:

1. Define a `large_json_db` fixture in `tests/test_dataset.py` (or a shared conftest) alongside the existing `mixed_db` / `medium_db` fixtures. For example:
   - Create an archive (e.g. `large_archive.zip`) with ≥100 files (99 dummy files + 1 JSON file such as `filtered.json`).
   - Register this archive in the test database metadata so that:
     * The JSON file is discoverable by whatever logic `Dataset.example_json` uses.
     * The archive’s `_files_in_archive` (or equivalent) is set to ≥100.
2. Make sure `large_json_db` returns the same type/shape as `mixed_db` and `medium_db` (likely a path or a dataset object), so the existing `test_dataset_example_json` body continues to work.
3. Optionally, update any factory/helper used to create test DBs (if `mixed_db`/`medium_db` are built via a helper) to add the `large_json_db` configuration in a consistent way rather than hand-rolling it in the fixture.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

Co-authored-by: sourcery-ai[bot] <58596630+sourcery-ai[bot]@users.noreply.github.com>

hagenw · 2026-01-06T08:37:49Z

@ChristianGeng if you experience an error when trying to build the HTML pages, e.g.

$ uv run python -m sphinx docs/ build/html -b html
Running Sphinx v5.3.0                                                                                                                                                         
loading pickled environment... done                                                                                                                                           
loading intersphinx inventory from https://audeering.github.io/audb/objects.inv...     
[autosummary] generating autosummary for: api/audbcards.rst, changelog.rst, contributing.rst, index.rst, install.rst, sphinx-extension.rst                                    
[autosummary] generating autosummary for: /home/hwierstorf/git/audeering/audbcards/docs/api/audbcards.Datacard.rst, /home/hwierstorf/git/audeering/audbcards/docs/api/audbcard
s.Dataset.rst, /home/hwierstorf/git/audeering/audbcards/docs/api/audbcards.config.rst  
Get list of available datasets... done                                                                                                                                        
Parse air-1.4.2... wrote docs/audb-public/air.rst                                                                                                                             
Parse clac-1.1.0... wrote docs/audb-public/clac.rst                                                                                                                           
Parse cmu-mosei-1.2.4... wrote docs/audb-public/cmu-mosei.rst                                                                                                                 
Parse cmu-mosi-1.1.1... wrote docs/audb-public/cmu-mosi.rst                                                                                                                   
Parse cochlscene-1.0.0... wrote docs/audb-public/cochlscene.rst                                                                                                               
Parse cough-speech-sneeze-2.0.1... wrote docs/audb-public/cough-speech-sneeze.rst      
Parse crema-d-1.3.0... wrote docs/audb-public/crema-d.rst                                                                                                                     
Parse css10-1.0.0... wrote docs/audb-public/css10.rst                                                                                                                         
Parse eesc-1.0.1... wrote docs/audb-public/eesc.rst                                    
Parse emodb-2.0.0...                                                                                                                                                          
Extension error (audbcards.sphinx):                                                                                                                                           
Handler <function builder_inited at 0x7685bfe07130> for event 'builder-inited' threw an exception (exception: issubclass() arg 1 must be a class)

it is related to audeering/audformat#483

As a workaround, the affected cache can be deleted, e.g. here it would be

$ rm -rf ~/audb/emodb

ChristianGeng · 2026-01-06T09:27:09Z

@ChristianGeng if you experience an error when trying to build the HTML pages, e.g.

it is related to audeering/audformat#483

As a workaround, the affected cache can be deleted, e.g. here it would be
$ rm -rf ~/audb/emodb

I do not encounter this right now. I have tried it on two separate machines,
For one I am having cached versions in ~/audb that date back as far as Feb 2025.
For the other one, the cached data even dates back to 2023.

hagenw · 2026-01-06T09:40:52Z

If you don't get the error, it's also fine :)

I think it depends on the used pandas versions and will be solved anyway with the next audformat release.

ChristianGeng · 2026-01-09T13:06:11Z

Code review

Found 1 issue:

Docstring incorrectly states "Content" when property returns path

The docstring at line 297 says "Content of example json file from dataset" but the property actually returns a file path (string), not the file content. This will confuse users who expect to receive file content instead of a path. The docstring should say "Path to example json file" or "Example json file from dataset" to match the actual behavior.

audbcards/audbcards/core/dataset.py

Lines 295 to 299 in 3b6a649

    
                   r"""Example json file. 
        
                   Content of example json file from dataset. 
        
                   The json file needs to be stored in an archive 
        
                   with less than 100 files.

🤖 Generated with Claude Code

_{- If this code review was useful, please react with 👍. Otherwise, react with 👎.}

hagenw · 2026-01-09T13:52:22Z

Good catch. First I had the idea to include the json file content instead of the file, but then I realized that we have the file for example_media and it makes anyway more sense to not read the file already at this stage. I changed the docstring to Path to example json file from dataset..

ChristianGeng · 2026-01-09T14:41:52Z

Good catch. First I had the idea to include the json file content instead of the file, but then I realized that we have the file for example_media and it makes anyway more sense to not read the file already at this stage. I changed the docstring to Path to example json file from dataset..

This was an experiment of getting the claude code review to work, so credits to claude.

I would proceed with approval, but I am having some doctest failues right now in audbcards/core/dataset.py belonging to loading emodb version 1.4.1.

(Pdb++) Dataset
<class 'audbcards.core.dataset.Dataset'>
(Pdb++) Dataset("emodb", "2.0.0")
<audbcards.core.dataset._Dataset object at 0x78b7164e2110>
(Pdb++) Dataset("emodb", "1.4.1")
*** RuntimeError: Cannot find database 'emodb'.
Traceback (most recent call last):
  File "/home/cgeng/work/.envs/daitools/lib/python3.10/site-packages/audbcards/core/dataset.py", line 873, in __new__
    instance = _Dataset.create(
  File "/home/cgeng/work/.envs/daitools/lib/python3.10/site-packages/audbcards/core/dataset.py", line 94, in create
    obj = cls(name, version, cache_root, load_tables)
  File "/home/cgeng/work/.envs/daitools/lib/python3.10/site-packages/audbcards/core/dataset.py", line 138, in __init__
    self._repository_object = self._load_repository_object()  # load before backend
  File "/home/cgeng/work/.envs/daitools/lib/python3.10/site-packages/audbcards/core/dataset.py", line 660, in _load_repository_object
    return audb.repository(self.name, self.version)
  File "/home/cgeng/work/.envs/daitools/lib/python3.10/site-packages/audb/core/api.py", line 576, in repository
    raise RuntimeError(f"Cannot find database " f"'{name}'.")

In other words, 2.0.0 loads but 1.4.1 refuses to. It is related to the merge request that you link to above - but the error is different. Still it will probably be better treated in the audeering/audformat#483 I take it?

hagenw · 2026-01-09T14:59:31Z

Yes, this is very likely the error described in audeering/audformat#483 and indepednent of audbcards.

ChristianGeng

Review

To me this looks sound.
There were a few cosmetic changes but nothing earthshaking.
So I can proceed with approval.

Add audbcards.Dataset.example_json

c551d44

sourcery-ai Bot reviewed Jan 6, 2026

View reviewed changes

Comment thread audbcards/core/dataset.py Outdated

Comment thread audbcards/core/dataset.py Outdated

hagenw and others added 2 commits January 6, 2026 09:28

Update audbcards/core/dataset.py

3653f61

Co-authored-by: sourcery-ai[bot] <58596630+sourcery-ai[bot]@users.noreply.github.com>

Use dedicated random object

3b6a649

hagenw self-assigned this Jan 6, 2026

hagenw requested a review from ChristianGeng January 6, 2026 08:35

hagenw mentioned this pull request Jan 7, 2026

Show json example on datacard #143

Merged

Fix docstring to json file

0233016

ChristianGeng approved these changes Jan 9, 2026

View reviewed changes

hagenw merged commit 2ef1168 into main Jan 9, 2026
13 checks passed

hagenw deleted the example_json branch January 9, 2026 15:55

Conversation

hagenw commented Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by Sourcery

Uh oh!

sourcery-ai Bot commented Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer's Guide

Class diagram for updated _Dataset example selection API

Flow diagram for example_json selection logic

Flow diagram for refactored example_media archive check

File-Level Changes

Possibly linked issues

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

sourcery-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

hagenw commented Jan 6, 2026

Uh oh!

ChristianGeng commented Jan 6, 2026

Uh oh!

hagenw commented Jan 6, 2026

Uh oh!

ChristianGeng commented Jan 9, 2026

Code review

Uh oh!

hagenw commented Jan 9, 2026

Uh oh!

ChristianGeng commented Jan 9, 2026

Uh oh!

hagenw commented Jan 9, 2026

Uh oh!

ChristianGeng left a comment

Choose a reason for hiding this comment

Review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hagenw commented Jan 6, 2026 •

edited

Loading

sourcery-ai Bot commented Jan 6, 2026 •

edited

Loading