Skip to content

Add audbcards.Dataset.example_json#142

Merged
hagenw merged 4 commits into
mainfrom
example_json
Jan 9, 2026
Merged

Add audbcards.Dataset.example_json#142
hagenw merged 4 commits into
mainfrom
example_json

Conversation

@hagenw
Copy link
Copy Markdown
Member

@hagenw hagenw commented Jan 6, 2026

As we are interested in not only showing an example audio/video file on a datacard, but also the content (or an excerpt of that content) of a json file, this pull request adds audbcards.Dataset.example_json to randomly select a json file from a dataset.

We do not have yet a public dataset that contains a json file (this will change with audeering/datasets#29), but it can be tested with our internal datasets:

>>> import audbcards
>>> ds = audbcards.Dataset("toolace", "1.0.0")
>>> ds.example_json
'json/sample-1077.json'
image

Summary by Sourcery

Add support for selecting example JSON files from datasets and reuse archive file counting logic across example selections.

New Features:

  • Introduce Dataset.example_json cached property to return a representative JSON file from a dataset when available.

Enhancements:

  • Refactor archive file counting into a shared helper used by example selection methods and simplify example media selection return logic.
  • Seed the random module to make example selection deterministic.

Tests:

  • Add tests validating example_json selection behavior for datasets with and without acceptable JSON files.

@sourcery-ai
Copy link
Copy Markdown
Contributor

sourcery-ai Bot commented Jan 6, 2026

Reviewer's Guide

Adds a new Dataset.example_json cached property to select a representative JSON file from a dataset, refactors archive-size checking into a shared helper used by both example_json and example_media, seeds the random module for deterministic selection, and introduces tests for the new behavior.

Class diagram for updated _Dataset example selection API

classDiagram
    class _Dataset {
        <<class>>
        +segment_durations
        +duration() pd.Timedelta
        +example_media() str|None
        +example_json() str|None
        +files() int
        +_cached_properties() list
        +_files_in_archive(index int) int
        +_load_backend() audbackend.interface.Base
        +deps() DependencyTable
        +repository_object RepositoryObject
    }

    class DependencyTable {
        <<class>>
        +format list~str~
        +files list~str~
        +media list~str~
        +archive list~str~
    }

    class RepositoryObject {
        <<class>>
        +create_backend_interface() audbackend.interface.Base
    }

    _Dataset --> DependencyTable : uses_deps
    _Dataset --> RepositoryObject : has_repository_object
Loading

Flow diagram for example_json selection logic

flowchart TD
    A["Start example_json"] --> B["Collect indices with format == json"]
    B --> C{"Any json indices?"}
    C -- "No" --> D["Return None"]
    C -- "Yes" --> E["Pick random index from json indices"]
    E --> F["Call _files_in_archive(index)"]
    F --> G{"Files in archive < 100?"}
    G -- "No" --> D
    G -- "Yes" --> H["Return deps.files[index]"]
Loading

Flow diagram for refactored example_media archive check

flowchart TD
    A["Start example_media"] --> B["Compute durations from segment_durations"]
    B --> C["Select duration closest to median"]
    C --> D["Find index of selected duration"]
    D --> E["Call _files_in_archive(index)"]
    E --> F{"Files in archive < 100?"}
    F -- "No" --> G["Return None"]
    F -- "Yes" --> H["Return deps.media[index]"]
Loading

File-Level Changes

Change Details Files
Introduce Dataset.example_json cached property to return a representative JSON file when available and within archive-size constraints.
  • Add example_json cached property that collects indices of JSON dependencies and randomly selects one
  • Return None when no JSON files exist or when the JSON file resides in an archive with 100 or more files
  • Use shared helper to enforce the max-archive-files condition
audbcards/core/dataset.py
tests/test_dataset.py
Refactor archive-file-count logic into a reusable helper and apply it to example_media.
  • Add _files_in_archive helper that counts how many files share the same archive as a given dependency index
  • Simplify example_media to return the selected media file only if its archive has fewer than 100 files by using _files_in_archive
audbcards/core/dataset.py
Make example selection deterministic and cover new JSON example logic with tests.
  • Seed Python's random module with a fixed value (123) to ensure deterministic example selection
  • Add parametrized test_dataset_example_json to verify JSON example selection and None behavior for large archives
audbcards/core/dataset.py
tests/test_dataset.py

Possibly linked issues

  • #Show example JSON file content: PR adds Dataset.example_json, enabling selection of example JSON files needed to show their content as in issue.

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link
Copy Markdown
Contributor

@sourcery-ai sourcery-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 3 issues, and left some high level feedback:

  • Seeding the global RNG with random.seed(123) at module import will affect all consumers of this library; instead, consider using a local random.Random instance inside the relevant methods or controlling randomness only within tests.
  • The _files_in_archive helper currently counts matches via a list comprehension; you could simplify and speed this up by using a vectorized operation on the archives Series (e.g., (archives == selected_archive).sum()).
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- Seeding the global RNG with `random.seed(123)` at module import will affect all consumers of this library; instead, consider using a local `random.Random` instance inside the relevant methods or controlling randomness only within tests.
- The `_files_in_archive` helper currently counts matches via a list comprehension; you could simplify and speed this up by using a vectorized operation on the `archives` Series (e.g., `(archives == selected_archive).sum()`).

## Individual Comments

### Comment 1
<location> `audbcards/core/dataset.py:22` </location>
<code_context>
 from audbcards.core.config import config


+random.seed(123)
+
+
</code_context>

<issue_to_address>
**issue (bug_risk):** Avoid setting the global random seed at import time to prevent side effects for library users.

This can unexpectedly affect any code using the global `random` (including tests and other libraries). For deterministic behavior, use a dedicated `random.Random(123)` instance (as a private module-level object or created locally) instead of seeding the global RNG.
</issue_to_address>

### Comment 2
<location> `audbcards/core/dataset.py:672-674` </location>
<code_context>
+        """
+        archives = self.deps().archive
+        selected_archive = archives.iloc[index]
+        return len([archive for archive in archives if archive == selected_archive])
+
     def _load_backend(self) -> type[audbackend.interface.Base]:
</code_context>

<issue_to_address>
**suggestion (performance):** Use vectorized counting instead of building a list to count matching archives.

Since `archives` is a pandas Series, you can avoid creating an intermediate list and use a vectorized comparison instead:

```python
return (archives == selected_archive).sum()
```
This is both clearer and more efficient than the current list comprehension plus `len`.

```suggestion
        archives = self.deps().archive
        selected_archive = archives.iloc[index]
        return (archives == selected_archive).sum()
```
</issue_to_address>

### Comment 3
<location> `tests/test_dataset.py:366-373` </location>
<code_context>
     assert dataset.example_media == expected_example


+@pytest.mark.parametrize(
+    "db, expected",
+    [
+        ("mixed_db", "c0.json"),
+        ("medium_db", None),
+    ],
+)
+def test_dataset_example_json(db, expected, cache, request):
+    r"""Test Dataset.example_json.
+
</code_context>

<issue_to_address>
**suggestion (testing):** Add a test case where a JSON file exists but is filtered out because its archive contains ≥100 files.

The current parametrization only checks a case with a JSON file (`mixed_db`) and a case that likely has none (`medium_db`), but doesn’t verify the archive-size exclusion rule described in the `example_json` docstring. Please add a dedicated fixture (e.g. `large_json_db`) where a JSON file is inside an archive with ≥100 files and assert that `example_json` returns `None`, so the `_files_in_archive` size condition is explicitly covered and protected from regressions.

Suggested implementation:

```python
@pytest.mark.parametrize(
    "db, expected",
    [
        ("mixed_db", "c0.json"),
        ("medium_db", None),
        ("large_json_db", None),
    ],
)
def test_dataset_example_json(db, expected, cache, request):
    r"""Test Dataset.example_json.

```

To fully implement the requested behavior you will also need to:

1. Define a `large_json_db` fixture in `tests/test_dataset.py` (or a shared conftest) alongside the existing `mixed_db` / `medium_db` fixtures. For example:
   - Create an archive (e.g. `large_archive.zip`) with ≥100 files (99 dummy files + 1 JSON file such as `filtered.json`).
   - Register this archive in the test database metadata so that:
     * The JSON file is discoverable by whatever logic `Dataset.example_json` uses.
     * The archive’s `_files_in_archive` (or equivalent) is set to ≥100.
2. Make sure `large_json_db` returns the same type/shape as `mixed_db` and `medium_db` (likely a path or a dataset object), so the existing `test_dataset_example_json` body continues to work.
3. Optionally, update any factory/helper used to create test DBs (if `mixed_db`/`medium_db` are built via a helper) to add the `large_json_db` configuration in a consistent way rather than hand-rolling it in the fixture.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment thread audbcards/core/dataset.py Outdated
Comment thread audbcards/core/dataset.py Outdated
hagenw and others added 2 commits January 6, 2026 09:28
Co-authored-by: sourcery-ai[bot] <58596630+sourcery-ai[bot]@users.noreply.github.com>
@hagenw hagenw self-assigned this Jan 6, 2026
@hagenw hagenw requested a review from ChristianGeng January 6, 2026 08:35
@hagenw
Copy link
Copy Markdown
Member Author

hagenw commented Jan 6, 2026

@ChristianGeng if you experience an error when trying to build the HTML pages, e.g.

$ uv run python -m sphinx docs/ build/html -b html
Running Sphinx v5.3.0                                                                                                                                                         
loading pickled environment... done                                                                                                                                           
loading intersphinx inventory from https://audeering.github.io/audb/objects.inv...     
[autosummary] generating autosummary for: api/audbcards.rst, changelog.rst, contributing.rst, index.rst, install.rst, sphinx-extension.rst                                    
[autosummary] generating autosummary for: /home/hwierstorf/git/audeering/audbcards/docs/api/audbcards.Datacard.rst, /home/hwierstorf/git/audeering/audbcards/docs/api/audbcard
s.Dataset.rst, /home/hwierstorf/git/audeering/audbcards/docs/api/audbcards.config.rst  
Get list of available datasets... done                                                                                                                                        
Parse air-1.4.2... wrote docs/audb-public/air.rst                                                                                                                             
Parse clac-1.1.0... wrote docs/audb-public/clac.rst                                                                                                                           
Parse cmu-mosei-1.2.4... wrote docs/audb-public/cmu-mosei.rst                                                                                                                 
Parse cmu-mosi-1.1.1... wrote docs/audb-public/cmu-mosi.rst                                                                                                                   
Parse cochlscene-1.0.0... wrote docs/audb-public/cochlscene.rst                                                                                                               
Parse cough-speech-sneeze-2.0.1... wrote docs/audb-public/cough-speech-sneeze.rst      
Parse crema-d-1.3.0... wrote docs/audb-public/crema-d.rst                                                                                                                     
Parse css10-1.0.0... wrote docs/audb-public/css10.rst                                                                                                                         
Parse eesc-1.0.1... wrote docs/audb-public/eesc.rst                                    
Parse emodb-2.0.0...                                                                                                                                                          
Extension error (audbcards.sphinx):                                                                                                                                           
Handler <function builder_inited at 0x7685bfe07130> for event 'builder-inited' threw an exception (exception: issubclass() arg 1 must be a class)

it is related to audeering/audformat#483

As a workaround, the affected cache can be deleted, e.g. here it would be

$ rm -rf ~/audb/emodb

@ChristianGeng
Copy link
Copy Markdown
Member

@ChristianGeng if you experience an error when trying to build the HTML pages, e.g.

it is related to audeering/audformat#483

As a workaround, the affected cache can be deleted, e.g. here it would be

$ rm -rf ~/audb/emodb

I do not encounter this right now. I have tried it on two separate machines,
For one I am having cached versions in ~/audb that date back as far as Feb 2025.
For the other one, the cached data even dates back to 2023.

@hagenw
Copy link
Copy Markdown
Member Author

hagenw commented Jan 6, 2026

If you don't get the error, it's also fine :)

I think it depends on the used pandas versions and will be solved anyway with the next audformat release.

@ChristianGeng
Copy link
Copy Markdown
Member

Code review

Found 1 issue:

  1. Docstring incorrectly states "Content" when property returns path

The docstring at line 297 says "Content of example json file from dataset" but the property actually returns a file path (string), not the file content. This will confuse users who expect to receive file content instead of a path. The docstring should say "Path to example json file" or "Example json file from dataset" to match the actual behavior.

r"""Example json file.
Content of example json file from dataset.
The json file needs to be stored in an archive
with less than 100 files.

🤖 Generated with Claude Code

- If this code review was useful, please react with 👍. Otherwise, react with 👎.

@hagenw
Copy link
Copy Markdown
Member Author

hagenw commented Jan 9, 2026

Good catch. First I had the idea to include the json file content instead of the file, but then I realized that we have the file for example_media and it makes anyway more sense to not read the file already at this stage. I changed the docstring to Path to example json file from dataset..

@ChristianGeng
Copy link
Copy Markdown
Member

Good catch. First I had the idea to include the json file content instead of the file, but then I realized that we have the file for example_media and it makes anyway more sense to not read the file already at this stage. I changed the docstring to Path to example json file from dataset..

This was an experiment of getting the claude code review to work, so credits to claude.

I would proceed with approval, but I am having some doctest failues right now in audbcards/core/dataset.py belonging to loading emodb version 1.4.1.

(Pdb++) Dataset
<class 'audbcards.core.dataset.Dataset'>
(Pdb++) Dataset("emodb", "2.0.0")
<audbcards.core.dataset._Dataset object at 0x78b7164e2110>
(Pdb++) Dataset("emodb", "1.4.1")
*** RuntimeError: Cannot find database 'emodb'.
Traceback (most recent call last):
  File "/home/cgeng/work/.envs/daitools/lib/python3.10/site-packages/audbcards/core/dataset.py", line 873, in __new__
    instance = _Dataset.create(
  File "/home/cgeng/work/.envs/daitools/lib/python3.10/site-packages/audbcards/core/dataset.py", line 94, in create
    obj = cls(name, version, cache_root, load_tables)
  File "/home/cgeng/work/.envs/daitools/lib/python3.10/site-packages/audbcards/core/dataset.py", line 138, in __init__
    self._repository_object = self._load_repository_object()  # load before backend
  File "/home/cgeng/work/.envs/daitools/lib/python3.10/site-packages/audbcards/core/dataset.py", line 660, in _load_repository_object
    return audb.repository(self.name, self.version)
  File "/home/cgeng/work/.envs/daitools/lib/python3.10/site-packages/audb/core/api.py", line 576, in repository
    raise RuntimeError(f"Cannot find database " f"'{name}'.")

In other words, 2.0.0 loads but 1.4.1 refuses to. It is related to the merge request that you link to above - but the error is different. Still it will probably be better treated in the audeering/audformat#483 I take it?

@hagenw
Copy link
Copy Markdown
Member Author

hagenw commented Jan 9, 2026

Yes, this is very likely the error described in audeering/audformat#483 and indepednent of audbcards.

Copy link
Copy Markdown
Member

@ChristianGeng ChristianGeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review

To me this looks sound.
There were a few cosmetic changes but nothing earthshaking.
So I can proceed with approval.

@hagenw hagenw merged commit 2ef1168 into main Jan 9, 2026
13 checks passed
@hagenw hagenw deleted the example_json branch January 9, 2026 15:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants