-
Notifications
You must be signed in to change notification settings - Fork 7
remove duplicate datasets from datasets.csv during aplose export #246
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
remove duplicate datasets from datasets.csv during aplose export #246
Conversation
|
If different info.path = info.path.map(str)
meta = pd.concat(
[meta[meta.path != str(info.iloc[0].path)], info], ignore_index=True
)Should be changed to: info.spectro_duration = info.spectro_duration.map(int)
info.dataset_sr = info.dataset_sr.map(int)
info.path = info.path.map(str)
meta = pd.concat(
(
meta[
(meta.path != str(info.iloc[0].path))
| (meta.spectro_duration != info.iloc[0].spectro_duration)
| (meta.dataset_sr != info.iloc[0].dataset_sr)
],
info,
),
ignore_index=True,
)For some reason (❔), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Gautzilla to me spectro_duration / dataset_sr are those of the segments that you have generated, not the original audios. It's true though that we should at least add a begin_datetime / end_datetime and maybe orignal_sr/original_duration to differentiate any dataset ? @ElodieENSTA would that be an issue for you if we were to add new columns ?
No problem on APLOSE side if any CSV has more columns, it will just read the required ones. Just let me know if we need these columns to be loaded and saved by APLOSE |
Ok, so I pushed ef5b5d8 to include the
The thing is that as of now, initializing a dataset with identical parameters but on a different time range will overwrite the previous files (I don't know if it initially was an intended behaviour ?), so adding them to the |
| meta = pd.concat( | ||
| ( | ||
| meta[ | ||
| (meta.path != str(info.iloc[0].path)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if the purpose here is not rather to handle multiple spectroduration X dataset_sr configuration for a same dataset (name AND path).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, that's the point, but that's already what the changes do (unless I misunderstood your statement! 🥸):
pd.concat(
(
meta[ # Keeps any of the datasets that differs either in:
(meta.path != str(info.iloc[0].path)) # Path (includes name)
| (meta.spectro_duration != info.iloc[0].spectro_duration) # OR duration
| (meta.dataset_sr != info.iloc[0].dataset_sr) # OR sample rate
],
info, # Adds the current dataset
)
)This way, if one creates a new dataset that only differs in sample rate, it will be added in addition to the previous dataset since (meta.dataset_sr != info.iloc[0].dataset_sr) will return True.
The changes in this PR make sure all previous entries in
/home/datawork-osmose/dataset/datasets.csvthat share the same path/name than the dataset from which spectrograms are generated are removed before updating the dataframe.However, I'm not sure to understand something:
The
datasets.csvhasspectro_durationanddataset_srcolumns: Is this supposed to keep track of the original audio only? Let's say I have 14 days of original audio with 2h-long files at 144 kHz, and I want to plot some spectrograms of the 1st hour at 500 Hz with 1min-long audio files: is this a new dataset? Or an update of the previous one? Or something that shouldn't even be added in thedatasets.csvfile?Since the entries are added to this file when
generate_spectrois called (not whenDataset.build()is called), I'm a bit lost.The current PR only watch for the dataset path, so in the previous scenario, the dataset entry in
datasets.csvwould be updated to a 1min-500Hz dataset.