Skip to content

Dataset directory structure #215

@Gautzilla

Description

@Gautzilla

The current dataset directory structure suffers from some flaws. For example, running an analysis that differ from a previous one only in time period request overwriting the previous analysis.

In this issue, I try to expose these flaws by using an example dataset in which I run 4 analyses that differ by the audio parameters (time duration, sample rate) and/or by the fft parameters (in that case no reshaping of the audio files is needed).
I'll first describe the analyses and the original dataset, then show the code snippets matching each analysis, and then the directory structure that results from these analyses.

Finally, I've added 2 draft directory structures:

  • a first one that is built on top of the existing structure and simply add the layers the current structure doesn't consider
  • a second structure that implies more changes, where directories are analysis-based

What do you, as OSEkit users, think of these draft structures?

Example

An original dataset, from which 4 analyses are run:

Analysis Description
A1 Different audio length than original
Different start/end times than original
A2 Same audio parameters than A1 : no reshaping needed. Only fft parameters change
B Different start/end times than A1 and A2: reshaping needed.
C Different audio parameters than A1, A2 and B: reshaping needed.

Original Dataset :

audio_file_length = 3_600
sampling_frequency = 128_000
t_start = Timestamp("01-01-2023 00:00:00")
t_stop = Timestamp("03-01-2023 12:00:00")

Analyses :

Analysis A1

# Different time period than original
t_start = Timestamp("02-01-2023 00:00:00")
t_stop = Timestamp("02-01-2023 12:00:00")

# Different audio parameters than original
audio_length = 1_800
sampling_frequency = 128_000

nfft = 1_024
window_size = 4_096
overlap = 20
zoom_level = 0
scale = 'linear'

Analysis A2

# Same time period and audio parameters than A1: audio files doesn't need to be reshaped.
t_start = Timestamp("02-01-2023 00:00:00")
t_stop = Timestamp("02-01-2023 12:00:00")

audio_length = 1_800
sampling_frequency = 128_000

# Only fft parameters differ from analysis A1

nfft = 1_024
window_size = 2_048
overlap = 50
zoom_level = 5
scale = 'log'

Analysis B

# Different time period: reshape needed.

t_start = Timestamp("03-01-2023 00:00:00")
t_stop = Timestamp("03-01-2023 12:00:00")

audio_length = 1_800
sampling_frequency = 128_000

nfft = 1_024
window_size = 4_096
overlap = 20
zoom_level = 0
scale = 'linear'

Analysis C

t_start = Timestamp("02-01-2023 00:00:00")
t_stop = Timestamp("02-01-2023 12:00:00")

# Different audio parameters: reshape needed.

audio_length = 900
sampling_frequency = 64_000

nfft = 1_024
window_size = 4_096
overlap = 20
zoom_level = 0
scale = 'linear'

Current directory structure

dataset
    ├╴ data
    │   ├╴ audio
    │   │   ├╴ 1800_128000
    │   │   │   ├╴ audio_1a.wav
    │   │   │   ├╴ audio_2a.wav
    │   │   │   ├╴ ...
    │   │   │   ├╴ metadata.csv
    │   │   │   └╴ timestamp.csv
    │   │   ├╴ 900_64000
    │   │   │   └╴ ...
    │   │   └╴ 3600_128000
    │   │       ├╴ audio_1.wav
    │   │       ├╴ audio_2.wav
    │   │       ├╴ ...
    │   │       ├╴ file_metadata.csv
    │   │       ├╴ metadata.csv
    │   │       └╴ timestamp.csv
    │   └╴ auxiliary
    ├╴ other
    ├╴ log
    └╴ processed
        ├╴ adjustment_spectros
        │   ├╴ spectro_a1.png
        │   ├╴ spectro_a2.png
        │   └╴ adjust_metadata.csv
        └╴ spectrogram
            ├╴ 1800_128000
            │   ├╴ 1024_4096_20_linear
            │   │   ├╴ image
            │   │   │   ├╴ spectro_A1_1.png
            │   │   │   ├╴ spectro_A1_2.png
            │   │   │   └╴ ...
            │   │   ├╴ matrix
            │   │   └╴ metadata.csv
            │   └╴ 1024_2048_50_linear
            │       └╴ ...
            └╴ 900_64000
                └╴1024_4096_20_linear
                    └╴ ...

Problems:

  • No support for t_start and t_stop
    • Initializing Analysis A2 implies overwriting Analysis A1
  • No support for all spectrogram parameters (eg zoom level, y-axis method)
  • Some names could be more explicit:
    • processed: could be replaced by output.
    • spectrogram and matrix could fall into a spectrum upper level

Draft modifications of existing structure

  • Adds subfolders for the start and end dates of the analysis.
  • Add missing spectrogram parameters in folder names
  • Minor renaming of some folders (e.g. processed -> output)
dataset
    ├╴ data
    │   ├╴ audio
    │   │   ├╴ 1800_128000
    │   │   │   ├╴ 2023-01-02_00-00-00__2023-01-02_12-00-00
    │   │   │   │   ├╴ audio_1a.wav
    │   │   │   │   ├╴ audio_2a.wav
    │   │   │   │   ├╴ ...
    │   │   │   │   ├╴ analysis_metadata.csv
    │   │   │   │   └╴ file_metadata.csv
    │   │   │   └╴ 2023-01-03_00-00-00__2023-01-03_12-00-00
    │   │   │       └╴ ...
    │   │   ├╴ 900_64000
    │   │   │   └╴ 2023-01-02_00-00-00__2023-01-02_12-00-00
    │   │   │       └╴ ...
    │   │   └╴ 3600_128000_original
    │   │       └╴ 2023-01-01_00-00-00__2023-01-03_12-00-00
    │   │           ├╴ audio_1.wav
    │   │           ├╴ audio_2.wav
    │   │           ├╴ ...
    │   │           ├╴ analysis_metadata.csv
    │   │           └╴ file_metadata.csv
    │   └╴ auxiliary
    ├╴ other
    ├╴ logs
    └╴ output
        ├╴ adjustment_spectros
        │   ├╴ spectro_a1.png
        │   ├╴ spectro_a2.png
        │   └╴ adjust_metadata.csv
        └╴ spectrum
            ├╴ 1800_128000
            │   ├╴ 2023-01-02_00-00-00__2023-01-02_12-00-00
            │   │   ├╴ 1024_4096_20_0_linear
            │   │   │   ├╴ spectrogram
            │   │   │   │   ├╴ spectro_A1_1.png
            │   │   │   │   ├╴ spectro_A1_2.png
            │   │   │   │   └╴ ...
            │   │   │   ├╴ matrix
            │   │   │   └╴ spectrum_metadata.csv
            │   │   └╴ 1024_2048_50_5_log
            │   │       └╴ ...
            │   └╴ 2023-01-03_00-00-00__2023-01-03_12-00-00
            │       └╴ 1024_4096_20_0_linear
            │           └╴...
            └╴ 900_64000
                └╴ 2023-01-02_00-00-00__2023-01-02_12-00-00
                    └╴ 1024_4096_20_0_linear
                        └╴...

Remarks

There still are some flaws in this structure:

  • Should LTAS be put in special directories, or just a spectrum directory with timestamps / file duration / sample rate that match the original data?
  • Should all adjustement spectrograms be put in a same folder?
  • The metadata.csv name is used several times for different uses
    • In the original data folder, file_metadata.csv and timestamp.csv contain redundant information, keep only file_metadata.csv?
    • Replace xxx_metadata.csv files with xxx.json files that could be used for serializing python classes? (e.g., an analysis_dataset.json file in each analysis folder that can be parsed to a Dataset object in OSEkit).

Draft new structure

  • Consider latter remarks
  • One directory per analysis (dataset\audiolength_samplerate\tstart_tend\: correspond to one call to the reshaper module).
    • These directories include both the data and output folders.
  • Specifies original dataset (which could also hold analyses with output etc.)
dataset
    ├╴ 3600_128000_original
    │   └╴ 2023-01-01_00-00-00__2023-01-03_12-00-00
    │       ├╴ analysis.json
    │       ├╴ data
    │       │   ├╴ audio
    │       │   │   ├╴ audio_1.wav
    │       │   │   ├╴ audio_2.wav
    │       │   │   ├╴ ...
    │       │   │   └╴ audio.json
    |       |   └╴ auxiliary
    |       ├╴ log
    |       └╴ output
    ├╴ 1800_128000
    │   ├╴ 2023-01-02_00-00-00__2023-01-02_12-00-00
    │   │   ├╴ analysis.json
    │   │   ├╴ data
    │   │   │   ├╴ audio
    │   │   │   │   ├╴ audio_1.wav
    │   │   │   │   ├╴ audio_2.wav
    │   │   │   │   ├╴ ...
    │   │   │   │   └╴ audio.json
    │   │   │   └╴ auxiliary
    │   │   ├╴ output
    │   │   │   ├╴ 1024_4096_20_0_linear
    │   │   │   │   ├╴ spectrogram
    │   │   │   │   │   ├╴ spectrogram_1.png
    │   │   │   │   │   ├╴ spectrogram_2.png
    │   │   │   │   │   └╴ ...
    │   │   │   │   ├╴ matrix
    │   │   │   │   └╴ spectrum.json
    │   │   │   └╴ 1024_2048_50_5_log
    │   │   │       ├╴ spectrogram
    │   │   │       │   ├╴ spectrogram_1.png
    │   │   │       │   └╴ ...
    │   │   │       ├╴ matrix
    │   │   │       └╴ spectrum.json
    │   │   └╴ log
    │   └╴ 2023-01-03_00-00-00__2023-01-03_12-00-00
    │       ├╴ analysis.json
    │       ├╴ data
    │       │   ├╴ audio
    │       │   │   ├╴ audio_1.wav
    │       │   │   ├╴ audio_2.wav
    │       │   │   ├╴ ...
    │       │   │   └╴ audio.json
    │       │   └╴ auxiliary
    │       ├╴ output
    │       │   └╴ 1024_4096_20_0_linear
    │       │       ├╴ spectrogram
    │       │       │   ├╴ spectrogram_1.png
    │       │       │   ├╴ spectrogram_2.png
    │       │       │   └╴ ...
    │       │       ├╴ matrix
    │       │       └╴ spectrum.json
    │       └╴ log
    └╴ 900_64000
        └╴ 2023-01-02_00-00-00__2023-01-02_12-00-00
            ├╴ analysis.json
            ├╴ data
            │   ├╴ audio
            │   │   ├╴ audio_1.wav
            │   │   ├╴ audio_2.wav
            │   │   ├╴ ...
            │   │   └╴ audio.json
            │   └╴ auxiliary
            ├╴ output
            │   └╴ 1024_4096_20_0_linear
            │       ├╴ spectrogram
            │       │   ├╴ spectrogram_1.png
            │       │   ├╴ spectrogram_2.png
            │       │   └╴ ...
            │       ├╴ matrix
            │       └╴ spectrum.json
            └╴ log

Metadata

Metadata

Assignees

Labels

APLOSE relatedThe changes are impacted APLOSE behaviorlong termIssue that takes a long time to be corrected

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions