Skip to content

Conversation

@Gautzilla
Copy link
Contributor

The changes in this PR make sure all previous entries in /home/datawork-osmose/dataset/datasets.csv that share the same path/name than the dataset from which spectrograms are generated are removed before updating the dataframe.

However, I'm not sure to understand something:

The datasets.csv has spectro_duration and dataset_sr columns: Is this supposed to keep track of the original audio only? Let's say I have 14 days of original audio with 2h-long files at 144 kHz, and I want to plot some spectrograms of the 1st hour at 500 Hz with 1min-long audio files: is this a new dataset? Or an update of the previous one? Or something that shouldn't even be added in the datasets.csv file?
Since the entries are added to this file when generate_spectro is called (not when Dataset.build() is called), I'm a bit lost.

The current PR only watch for the dataset path, so in the previous scenario, the dataset entry in datasets.csv would be updated to a 1min-500Hz dataset.

@Gautzilla Gautzilla added the APLOSE related The changes are impacted APLOSE behavior label Feb 13, 2025
@Gautzilla Gautzilla self-assigned this Feb 13, 2025
@Gautzilla
Copy link
Contributor Author

Gautzilla commented Feb 13, 2025

If different spectro_duration / dataset_sr means different datasets:

info.path = info.path.map(str)
meta = pd.concat(
    [meta[meta.path != str(info.iloc[0].path)], info], ignore_index=True
)

Should be changed to:

info.spectro_duration = info.spectro_duration.map(int)
        info.dataset_sr = info.dataset_sr.map(int)
        info.path = info.path.map(str)
        meta = pd.concat(
            (
                meta[
                    (meta.path != str(info.iloc[0].path))
                    | (meta.spectro_duration != info.iloc[0].spectro_duration)
                    | (meta.dataset_sr != info.iloc[0].dataset_sr)
                ],
                info,
            ),
            ignore_index=True,
        )

For some reason (❔), spectro_duration and dataset_sr are cast into strings in the dataset_info DataFrame, hence the map lines. Since the code is split between OSEkit and the datarmor-toolkit, it avoids having to make coordinate changes to the two repos..

Copy link
Contributor

@mathieudpnt mathieudpnt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Gautzilla to me spectro_duration / dataset_sr are those of the segments that you have generated, not the original audios. It's true though that we should at least add a begin_datetime / end_datetime and maybe orignal_sr/original_duration to differentiate any dataset ? @ElodieENSTA would that be an issue for you if we were to add new columns ?

@ElodieENSTA
Copy link
Member

@Gautzilla to me spectro_duration / dataset_sr are those of the segments that you have generated, not the original audios. It's true though that we should at least add a begin_datetime / end_datetime and maybe orignal_sr/original_duration to differentiate any dataset ? @ElodieENSTA would that be an issue for you if we were to add new columns ?

No problem on APLOSE side if any CSV has more columns, it will just read the required ones. Just let me know if we need these columns to be loaded and saved by APLOSE

@Gautzilla
Copy link
Contributor Author

@Gautzilla to me spectro_duration / dataset_sr are those of the segments that you have generated, not the original audios.

Ok, so I pushed ef5b5d8 to include the spectro_duration/dataset_sr in the filtering process.

It's true though that we should at least add a begin_datetime / end_datetime and maybe orignal_sr/original_duration to differentiate any dataset ?

The thing is that as of now, initializing a dataset with identical parameters but on a different time range will overwrite the previous files (I don't know if it initially was an intended behaviour ?), so adding them to the datasets.csv file would refer to non-existing files, right?

meta = pd.concat(
(
meta[
(meta.path != str(info.iloc[0].path))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if the purpose here is not rather to handle multiple spectroduration X dataset_sr configuration for a same dataset (name AND path).

Copy link
Contributor Author

@Gautzilla Gautzilla Apr 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, that's the point, but that's already what the changes do (unless I misunderstood your statement! 🥸):

pd.concat(
            (
                meta[  # Keeps any of the datasets that differs either in:
                    (meta.path != str(info.iloc[0].path)) # Path (includes name)
                    | (meta.spectro_duration != info.iloc[0].spectro_duration) # OR duration
                    | (meta.dataset_sr != info.iloc[0].dataset_sr) # OR sample rate
                ],
                info, # Adds the current dataset
            )
)

This way, if one creates a new dataset that only differs in sample rate, it will be added in addition to the previous dataset since (meta.dataset_sr != info.iloc[0].dataset_sr) will return True.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

APLOSE related The changes are impacted APLOSE behavior

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants