Add audbcards.Dataset.example_json#142
Conversation
Reviewer's GuideAdds a new Dataset.example_json cached property to select a representative JSON file from a dataset, refactors archive-size checking into a shared helper used by both example_json and example_media, seeds the random module for deterministic selection, and introduces tests for the new behavior. Class diagram for updated _Dataset example selection APIclassDiagram
class _Dataset {
<<class>>
+segment_durations
+duration() pd.Timedelta
+example_media() str|None
+example_json() str|None
+files() int
+_cached_properties() list
+_files_in_archive(index int) int
+_load_backend() audbackend.interface.Base
+deps() DependencyTable
+repository_object RepositoryObject
}
class DependencyTable {
<<class>>
+format list~str~
+files list~str~
+media list~str~
+archive list~str~
}
class RepositoryObject {
<<class>>
+create_backend_interface() audbackend.interface.Base
}
_Dataset --> DependencyTable : uses_deps
_Dataset --> RepositoryObject : has_repository_object
Flow diagram for example_json selection logicflowchart TD
A["Start example_json"] --> B["Collect indices with format == json"]
B --> C{"Any json indices?"}
C -- "No" --> D["Return None"]
C -- "Yes" --> E["Pick random index from json indices"]
E --> F["Call _files_in_archive(index)"]
F --> G{"Files in archive < 100?"}
G -- "No" --> D
G -- "Yes" --> H["Return deps.files[index]"]
Flow diagram for refactored example_media archive checkflowchart TD
A["Start example_media"] --> B["Compute durations from segment_durations"]
B --> C["Select duration closest to median"]
C --> D["Find index of selected duration"]
D --> E["Call _files_in_archive(index)"]
E --> F{"Files in archive < 100?"}
F -- "No" --> G["Return None"]
F -- "Yes" --> H["Return deps.media[index]"]
File-Level Changes
Possibly linked issues
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Hey - I've found 3 issues, and left some high level feedback:
- Seeding the global RNG with
random.seed(123)at module import will affect all consumers of this library; instead, consider using a localrandom.Randominstance inside the relevant methods or controlling randomness only within tests. - The
_files_in_archivehelper currently counts matches via a list comprehension; you could simplify and speed this up by using a vectorized operation on thearchivesSeries (e.g.,(archives == selected_archive).sum()).
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- Seeding the global RNG with `random.seed(123)` at module import will affect all consumers of this library; instead, consider using a local `random.Random` instance inside the relevant methods or controlling randomness only within tests.
- The `_files_in_archive` helper currently counts matches via a list comprehension; you could simplify and speed this up by using a vectorized operation on the `archives` Series (e.g., `(archives == selected_archive).sum()`).
## Individual Comments
### Comment 1
<location> `audbcards/core/dataset.py:22` </location>
<code_context>
from audbcards.core.config import config
+random.seed(123)
+
+
</code_context>
<issue_to_address>
**issue (bug_risk):** Avoid setting the global random seed at import time to prevent side effects for library users.
This can unexpectedly affect any code using the global `random` (including tests and other libraries). For deterministic behavior, use a dedicated `random.Random(123)` instance (as a private module-level object or created locally) instead of seeding the global RNG.
</issue_to_address>
### Comment 2
<location> `audbcards/core/dataset.py:672-674` </location>
<code_context>
+ """
+ archives = self.deps().archive
+ selected_archive = archives.iloc[index]
+ return len([archive for archive in archives if archive == selected_archive])
+
def _load_backend(self) -> type[audbackend.interface.Base]:
</code_context>
<issue_to_address>
**suggestion (performance):** Use vectorized counting instead of building a list to count matching archives.
Since `archives` is a pandas Series, you can avoid creating an intermediate list and use a vectorized comparison instead:
```python
return (archives == selected_archive).sum()
```
This is both clearer and more efficient than the current list comprehension plus `len`.
```suggestion
archives = self.deps().archive
selected_archive = archives.iloc[index]
return (archives == selected_archive).sum()
```
</issue_to_address>
### Comment 3
<location> `tests/test_dataset.py:366-373` </location>
<code_context>
assert dataset.example_media == expected_example
+@pytest.mark.parametrize(
+ "db, expected",
+ [
+ ("mixed_db", "c0.json"),
+ ("medium_db", None),
+ ],
+)
+def test_dataset_example_json(db, expected, cache, request):
+ r"""Test Dataset.example_json.
+
</code_context>
<issue_to_address>
**suggestion (testing):** Add a test case where a JSON file exists but is filtered out because its archive contains ≥100 files.
The current parametrization only checks a case with a JSON file (`mixed_db`) and a case that likely has none (`medium_db`), but doesn’t verify the archive-size exclusion rule described in the `example_json` docstring. Please add a dedicated fixture (e.g. `large_json_db`) where a JSON file is inside an archive with ≥100 files and assert that `example_json` returns `None`, so the `_files_in_archive` size condition is explicitly covered and protected from regressions.
Suggested implementation:
```python
@pytest.mark.parametrize(
"db, expected",
[
("mixed_db", "c0.json"),
("medium_db", None),
("large_json_db", None),
],
)
def test_dataset_example_json(db, expected, cache, request):
r"""Test Dataset.example_json.
```
To fully implement the requested behavior you will also need to:
1. Define a `large_json_db` fixture in `tests/test_dataset.py` (or a shared conftest) alongside the existing `mixed_db` / `medium_db` fixtures. For example:
- Create an archive (e.g. `large_archive.zip`) with ≥100 files (99 dummy files + 1 JSON file such as `filtered.json`).
- Register this archive in the test database metadata so that:
* The JSON file is discoverable by whatever logic `Dataset.example_json` uses.
* The archive’s `_files_in_archive` (or equivalent) is set to ≥100.
2. Make sure `large_json_db` returns the same type/shape as `mixed_db` and `medium_db` (likely a path or a dataset object), so the existing `test_dataset_example_json` body continues to work.
3. Optionally, update any factory/helper used to create test DBs (if `mixed_db`/`medium_db` are built via a helper) to add the `large_json_db` configuration in a consistent way rather than hand-rolling it in the fixture.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
Co-authored-by: sourcery-ai[bot] <58596630+sourcery-ai[bot]@users.noreply.github.com>
|
@ChristianGeng if you experience an error when trying to build the HTML pages, e.g. $ uv run python -m sphinx docs/ build/html -b html
Running Sphinx v5.3.0
loading pickled environment... done
loading intersphinx inventory from https://audeering.github.io/audb/objects.inv...
[autosummary] generating autosummary for: api/audbcards.rst, changelog.rst, contributing.rst, index.rst, install.rst, sphinx-extension.rst
[autosummary] generating autosummary for: /home/hwierstorf/git/audeering/audbcards/docs/api/audbcards.Datacard.rst, /home/hwierstorf/git/audeering/audbcards/docs/api/audbcard
s.Dataset.rst, /home/hwierstorf/git/audeering/audbcards/docs/api/audbcards.config.rst
Get list of available datasets... done
Parse air-1.4.2... wrote docs/audb-public/air.rst
Parse clac-1.1.0... wrote docs/audb-public/clac.rst
Parse cmu-mosei-1.2.4... wrote docs/audb-public/cmu-mosei.rst
Parse cmu-mosi-1.1.1... wrote docs/audb-public/cmu-mosi.rst
Parse cochlscene-1.0.0... wrote docs/audb-public/cochlscene.rst
Parse cough-speech-sneeze-2.0.1... wrote docs/audb-public/cough-speech-sneeze.rst
Parse crema-d-1.3.0... wrote docs/audb-public/crema-d.rst
Parse css10-1.0.0... wrote docs/audb-public/css10.rst
Parse eesc-1.0.1... wrote docs/audb-public/eesc.rst
Parse emodb-2.0.0...
Extension error (audbcards.sphinx):
Handler <function builder_inited at 0x7685bfe07130> for event 'builder-inited' threw an exception (exception: issubclass() arg 1 must be a class)it is related to audeering/audformat#483 As a workaround, the affected cache can be deleted, e.g. here it would be $ rm -rf ~/audb/emodb |
I do not encounter this right now. I have tried it on two separate machines, |
|
If you don't get the error, it's also fine :) I think it depends on the used pandas versions and will be solved anyway with the next |
Code reviewFound 1 issue:
The docstring at line 297 says "Content of example json file from dataset" but the property actually returns a file path (string), not the file content. This will confuse users who expect to receive file content instead of a path. The docstring should say "Path to example json file" or "Example json file from dataset" to match the actual behavior. audbcards/audbcards/core/dataset.py Lines 295 to 299 in 3b6a649 🤖 Generated with Claude Code - If this code review was useful, please react with 👍. Otherwise, react with 👎. |
|
Good catch. First I had the idea to include the json file content instead of the file, but then I realized that we have the file for |
This was an experiment of getting the claude code review to work, so credits to claude. I would proceed with approval, but I am having some doctest failues right now in In other words, 2.0.0 loads but 1.4.1 refuses to. It is related to the merge request that you link to above - but the error is different. Still it will probably be better treated in the audeering/audformat#483 I take it? |
|
Yes, this is very likely the error described in audeering/audformat#483 and indepednent of |
ChristianGeng
left a comment
There was a problem hiding this comment.
Review
To me this looks sound.
There were a few cosmetic changes but nothing earthshaking.
So I can proceed with approval.
As we are interested in not only showing an example audio/video file on a datacard, but also the content (or an excerpt of that content) of a json file, this pull request adds
audbcards.Dataset.example_jsonto randomly select a json file from a dataset.We do not have yet a public dataset that contains a json file (this will change with audeering/datasets#29), but it can be tested with our internal datasets:
Summary by Sourcery
Add support for selecting example JSON files from datasets and reuse archive file counting logic across example selections.
New Features:
Enhancements:
Tests: