remove duplicate datasets from datasets.csv during aplose export #246

Gautzilla · 2025-02-13T14:28:42Z

The changes in this PR make sure all previous entries in /home/datawork-osmose/dataset/datasets.csv that share the same path/name than the dataset from which spectrograms are generated are removed before updating the dataframe.

However, I'm not sure to understand something:

The datasets.csv has spectro_duration and dataset_sr columns: Is this supposed to keep track of the original audio only? Let's say I have 14 days of original audio with 2h-long files at 144 kHz, and I want to plot some spectrograms of the 1st hour at 500 Hz with 1min-long audio files: is this a new dataset? Or an update of the previous one? Or something that shouldn't even be added in the datasets.csv file?
Since the entries are added to this file when generate_spectro is called (not when Dataset.build() is called), I'm a bit lost.

The current PR only watch for the dataset path, so in the previous scenario, the dataset entry in datasets.csv would be updated to a 1min-500Hz dataset.

Gautzilla · 2025-02-13T14:48:35Z

If different spectro_duration / dataset_sr means different datasets:

info.path = info.path.map(str)
meta = pd.concat(
    [meta[meta.path != str(info.iloc[0].path)], info], ignore_index=True
)

Should be changed to:

info.spectro_duration = info.spectro_duration.map(int)
        info.dataset_sr = info.dataset_sr.map(int)
        info.path = info.path.map(str)
        meta = pd.concat(
            (
                meta[
                    (meta.path != str(info.iloc[0].path))
                    | (meta.spectro_duration != info.iloc[0].spectro_duration)
                    | (meta.dataset_sr != info.iloc[0].dataset_sr)
                ],
                info,
            ),
            ignore_index=True,
        )

For some reason (❔), spectro_duration and dataset_sr are cast into strings in the dataset_info DataFrame, hence the map lines. Since the code is split between OSEkit and the datarmor-toolkit, it avoids having to make coordinate changes to the two repos..

mathieudpnt

@Gautzilla to me spectro_duration / dataset_sr are those of the segments that you have generated, not the original audios. It's true though that we should at least add a begin_datetime / end_datetime and maybe orignal_sr/original_duration to differentiate any dataset ? @ElodieENSTA would that be an issue for you if we were to add new columns ?

ElodieENSTA · 2025-03-03T08:42:39Z

@Gautzilla to me spectro_duration / dataset_sr are those of the segments that you have generated, not the original audios. It's true though that we should at least add a begin_datetime / end_datetime and maybe orignal_sr/original_duration to differentiate any dataset ? @ElodieENSTA would that be an issue for you if we were to add new columns ?

No problem on APLOSE side if any CSV has more columns, it will just read the required ones. Just let me know if we need these columns to be loaded and saved by APLOSE

Gautzilla · 2025-03-10T11:00:59Z

@Gautzilla to me spectro_duration / dataset_sr are those of the segments that you have generated, not the original audios.

Ok, so I pushed ef5b5d8 to include the spectro_duration/dataset_sr in the filtering process.

It's true though that we should at least add a begin_datetime / end_datetime and maybe orignal_sr/original_duration to differentiate any dataset ?

The thing is that as of now, initializing a dataset with identical parameters but on a different time range will overwrite the previous files (I don't know if it initially was an intended behaviour ?), so adding them to the datasets.csv file would refer to non-existing files, right?

ElodieENSTA · 2025-04-15T09:54:52Z

src/OSmOSE/utils/core_utils.py

+        meta = pd.concat(
+            (
+                meta[
+                    (meta.path != str(info.iloc[0].path))


I'm wondering if the purpose here is not rather to handle multiple spectroduration X dataset_sr configuration for a same dataset (name AND path).

Yup, that's the point, but that's already what the changes do (unless I misunderstood your statement! 🥸):

pd.concat( ( meta[ # Keeps any of the datasets that differs either in: (meta.path != str(info.iloc[0].path)) # Path (includes name) | (meta.spectro_duration != info.iloc[0].spectro_duration) # OR duration | (meta.dataset_sr != info.iloc[0].dataset_sr) # OR sample rate ], info, # Adds the current dataset ) )

This way, if one creates a new dataset that only differs in sample rate, it will be added in addition to the previous dataset since (meta.dataset_sr != info.iloc[0].dataset_sr) will return True.

remove duplicate datasets from datasets.csv during aplose export

df6dc38

Gautzilla added the APLOSE related The changes are impacted APLOSE behavior label Feb 13, 2025

Gautzilla requested review from ElodieENSTA, cazaudo and mathieudpnt February 13, 2025 14:28

Gautzilla self-assigned this Feb 13, 2025

add spectro_duration and dataset_sr in dataset filtering process

ef5b5d8

mathieudpnt requested changes Mar 3, 2025

View reviewed changes

Merge branch 'main' into remove-duplicates-from-datasets-csv

5402327

ElodieENSTA reviewed Apr 15, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

remove duplicate datasets from datasets.csv during aplose export #246

remove duplicate datasets from datasets.csv during aplose export #246

Uh oh!

Gautzilla commented Feb 13, 2025

Uh oh!

Gautzilla commented Feb 13, 2025 •

edited

Loading

Uh oh!

mathieudpnt left a comment •

edited

Loading

Uh oh!

ElodieENSTA commented Mar 3, 2025

Uh oh!

Gautzilla commented Mar 10, 2025

Uh oh!

ElodieENSTA Apr 15, 2025

Uh oh!

Gautzilla Apr 15, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

remove duplicate datasets from datasets.csv during aplose export #246

Are you sure you want to change the base?

remove duplicate datasets from datasets.csv during aplose export #246

Uh oh!

Conversation

Gautzilla commented Feb 13, 2025

Uh oh!

Gautzilla commented Feb 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mathieudpnt left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ElodieENSTA commented Mar 3, 2025

Uh oh!

Gautzilla commented Mar 10, 2025

Uh oh!

ElodieENSTA Apr 15, 2025

Choose a reason for hiding this comment

Uh oh!

Gautzilla Apr 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Gautzilla commented Feb 13, 2025 •

edited

Loading

mathieudpnt left a comment •

edited

Loading

Gautzilla Apr 15, 2025 •

edited

Loading