Dataset directory structure

The current dataset directory structure suffers from some flaws. For example, running an analysis that differ from a previous one only in time period request overwriting the previous analysis.

In this issue, I try to expose these flaws by using an example dataset in which I run 4 analyses that differ by the audio parameters (time duration, sample rate) and/or by the fft parameters (in that case no reshaping of the audio files is needed).
I'll first describe the analyses and the original dataset, then show the code snippets matching each analysis, and then the directory structure that results from these analyses.

Finally, I've added 2 draft directory structures: 
- a first one that is built on top of the existing structure and simply add the layers the current structure doesn't consider
- a second structure that implies more changes, where directories are analysis-based

What do you, as OSEkit users, think of these draft structures?

# Example

An original dataset, from which 4 analyses are run:

| Analysis |                                   Description                                    |
|:--------:|:--------------------------------------------------------------------------------:|
|    A1    | Different audio length than original<br/>Different start/end times than original |
|    A2    | Same audio parameters than A1 : no reshaping needed. Only fft parameters change  |
|    B     |           Different start/end times than A1 and A2: reshaping needed.            |
|    C     |         Different audio parameters than A1, A2 and B: reshaping needed.          |

## Original Dataset :
```py
audio_file_length = 3_600
sampling_frequency = 128_000
t_start = Timestamp("01-01-2023 00:00:00")
t_stop = Timestamp("03-01-2023 12:00:00")
```

## Analyses :

### Analysis A1
```py
# Different time period than original
t_start = Timestamp("02-01-2023 00:00:00")
t_stop = Timestamp("02-01-2023 12:00:00")

# Different audio parameters than original
audio_length = 1_800
sampling_frequency = 128_000

nfft = 1_024
window_size = 4_096
overlap = 20
zoom_level = 0
scale = 'linear'
```
### Analysis A2
```py
# Same time period and audio parameters than A1: audio files doesn't need to be reshaped.
t_start = Timestamp("02-01-2023 00:00:00")
t_stop = Timestamp("02-01-2023 12:00:00")

audio_length = 1_800
sampling_frequency = 128_000

# Only fft parameters differ from analysis A1

nfft = 1_024
window_size = 2_048
overlap = 50
zoom_level = 5
scale = 'log'
```

### Analysis B
```py
# Different time period: reshape needed.

t_start = Timestamp("03-01-2023 00:00:00")
t_stop = Timestamp("03-01-2023 12:00:00")

audio_length = 1_800
sampling_frequency = 128_000

nfft = 1_024
window_size = 4_096
overlap = 20
zoom_level = 0
scale = 'linear'
```

### Analysis C
```py
t_start = Timestamp("02-01-2023 00:00:00")
t_stop = Timestamp("02-01-2023 12:00:00")

# Different audio parameters: reshape needed.

audio_length = 900
sampling_frequency = 64_000

nfft = 1_024
window_size = 4_096
overlap = 20
zoom_level = 0
scale = 'linear'
```
# Current directory structure

````
dataset
    ├╴ data
    │   ├╴ audio
    │   │   ├╴ 1800_128000
    │   │   │   ├╴ audio_1a.wav
    │   │   │   ├╴ audio_2a.wav
    │   │   │   ├╴ ...
    │   │   │   ├╴ metadata.csv
    │   │   │   └╴ timestamp.csv
    │   │   ├╴ 900_64000
    │   │   │   └╴ ...
    │   │   └╴ 3600_128000
    │   │       ├╴ audio_1.wav
    │   │       ├╴ audio_2.wav
    │   │       ├╴ ...
    │   │       ├╴ file_metadata.csv
    │   │       ├╴ metadata.csv
    │   │       └╴ timestamp.csv
    │   └╴ auxiliary
    ├╴ other
    ├╴ log
    └╴ processed
        ├╴ adjustment_spectros
        │   ├╴ spectro_a1.png
        │   ├╴ spectro_a2.png
        │   └╴ adjust_metadata.csv
        └╴ spectrogram
            ├╴ 1800_128000
            │   ├╴ 1024_4096_20_linear
            │   │   ├╴ image
            │   │   │   ├╴ spectro_A1_1.png
            │   │   │   ├╴ spectro_A1_2.png
            │   │   │   └╴ ...
            │   │   ├╴ matrix
            │   │   └╴ metadata.csv
            │   └╴ 1024_2048_50_linear
            │       └╴ ...
            └╴ 900_64000
                └╴1024_4096_20_linear
                    └╴ ...
````

## Problems:

- No support for `t_start` and `t_stop`
  - Initializing Analysis A2 implies overwriting Analysis A1
- No support for all spectrogram parameters (eg zoom level, y-axis method)
- Some names could be more explicit:
  - `processed`: could be replaced by `output`.
  - `spectrogram` and `matrix` could fall into a `spectrum` upper level


# Draft modifications of existing structure

- Adds subfolders for the start and end dates of the analysis.
- Add missing spectrogram parameters in folder names
- Minor renaming of some folders (e.g. _processed_ -> _output_)

````
dataset
    ├╴ data
    │   ├╴ audio
    │   │   ├╴ 1800_128000
    │   │   │   ├╴ 2023-01-02_00-00-00__2023-01-02_12-00-00
    │   │   │   │   ├╴ audio_1a.wav
    │   │   │   │   ├╴ audio_2a.wav
    │   │   │   │   ├╴ ...
    │   │   │   │   ├╴ analysis_metadata.csv
    │   │   │   │   └╴ file_metadata.csv
    │   │   │   └╴ 2023-01-03_00-00-00__2023-01-03_12-00-00
    │   │   │       └╴ ...
    │   │   ├╴ 900_64000
    │   │   │   └╴ 2023-01-02_00-00-00__2023-01-02_12-00-00
    │   │   │       └╴ ...
    │   │   └╴ 3600_128000_original
    │   │       └╴ 2023-01-01_00-00-00__2023-01-03_12-00-00
    │   │           ├╴ audio_1.wav
    │   │           ├╴ audio_2.wav
    │   │           ├╴ ...
    │   │           ├╴ analysis_metadata.csv
    │   │           └╴ file_metadata.csv
    │   └╴ auxiliary
    ├╴ other
    ├╴ logs
    └╴ output
        ├╴ adjustment_spectros
        │   ├╴ spectro_a1.png
        │   ├╴ spectro_a2.png
        │   └╴ adjust_metadata.csv
        └╴ spectrum
            ├╴ 1800_128000
            │   ├╴ 2023-01-02_00-00-00__2023-01-02_12-00-00
            │   │   ├╴ 1024_4096_20_0_linear
            │   │   │   ├╴ spectrogram
            │   │   │   │   ├╴ spectro_A1_1.png
            │   │   │   │   ├╴ spectro_A1_2.png
            │   │   │   │   └╴ ...
            │   │   │   ├╴ matrix
            │   │   │   └╴ spectrum_metadata.csv
            │   │   └╴ 1024_2048_50_5_log
            │   │       └╴ ...
            │   └╴ 2023-01-03_00-00-00__2023-01-03_12-00-00
            │       └╴ 1024_4096_20_0_linear
            │           └╴...
            └╴ 900_64000
                └╴ 2023-01-02_00-00-00__2023-01-02_12-00-00
                    └╴ 1024_4096_20_0_linear
                        └╴...
````

## Remarks

There still are some flaws in this structure:

- Should LTAS be put in special directories, or just a spectrum directory with timestamps / file duration / sample rate that match the original data?
- Should all adjustement spectrograms be put in a same folder?
- The `metadata.csv` name is used several times for different uses
  - In the original data folder, `file_metadata.csv` and `timestamp.csv` contain redundant information, keep only `file_metadata.csv`?
  - Replace `xxx_metadata.csv` files with `xxx.json` files that could be used for serializing python classes? (e.g., an `analysis_dataset.json` file in each analysis folder that can be parsed to a `Dataset` object in OSEkit).

# Draft new structure

- Consider latter remarks
- One directory per analysis (`dataset\audiolength_samplerate\tstart_tend\`: correspond to one call to the reshaper module).
  - These directories include both the `data` and `output` folders. 
- Specifies original dataset (which could also hold analyses with output etc.)

````
dataset
    ├╴ 3600_128000_original
    │   └╴ 2023-01-01_00-00-00__2023-01-03_12-00-00
    │       ├╴ analysis.json
    │       ├╴ data
    │       │   ├╴ audio
    │       │   │   ├╴ audio_1.wav
    │       │   │   ├╴ audio_2.wav
    │       │   │   ├╴ ...
    │       │   │   └╴ audio.json
    |       |   └╴ auxiliary
    |       ├╴ log
    |       └╴ output
    ├╴ 1800_128000
    │   ├╴ 2023-01-02_00-00-00__2023-01-02_12-00-00
    │   │   ├╴ analysis.json
    │   │   ├╴ data
    │   │   │   ├╴ audio
    │   │   │   │   ├╴ audio_1.wav
    │   │   │   │   ├╴ audio_2.wav
    │   │   │   │   ├╴ ...
    │   │   │   │   └╴ audio.json
    │   │   │   └╴ auxiliary
    │   │   ├╴ output
    │   │   │   ├╴ 1024_4096_20_0_linear
    │   │   │   │   ├╴ spectrogram
    │   │   │   │   │   ├╴ spectrogram_1.png
    │   │   │   │   │   ├╴ spectrogram_2.png
    │   │   │   │   │   └╴ ...
    │   │   │   │   ├╴ matrix
    │   │   │   │   └╴ spectrum.json
    │   │   │   └╴ 1024_2048_50_5_log
    │   │   │       ├╴ spectrogram
    │   │   │       │   ├╴ spectrogram_1.png
    │   │   │       │   └╴ ...
    │   │   │       ├╴ matrix
    │   │   │       └╴ spectrum.json
    │   │   └╴ log
    │   └╴ 2023-01-03_00-00-00__2023-01-03_12-00-00
    │       ├╴ analysis.json
    │       ├╴ data
    │       │   ├╴ audio
    │       │   │   ├╴ audio_1.wav
    │       │   │   ├╴ audio_2.wav
    │       │   │   ├╴ ...
    │       │   │   └╴ audio.json
    │       │   └╴ auxiliary
    │       ├╴ output
    │       │   └╴ 1024_4096_20_0_linear
    │       │       ├╴ spectrogram
    │       │       │   ├╴ spectrogram_1.png
    │       │       │   ├╴ spectrogram_2.png
    │       │       │   └╴ ...
    │       │       ├╴ matrix
    │       │       └╴ spectrum.json
    │       └╴ log
    └╴ 900_64000
        └╴ 2023-01-02_00-00-00__2023-01-02_12-00-00
            ├╴ analysis.json
            ├╴ data
            │   ├╴ audio
            │   │   ├╴ audio_1.wav
            │   │   ├╴ audio_2.wav
            │   │   ├╴ ...
            │   │   └╴ audio.json
            │   └╴ auxiliary
            ├╴ output
            │   └╴ 1024_4096_20_0_linear
            │       ├╴ spectrogram
            │       │   ├╴ spectrogram_1.png
            │       │   ├╴ spectrogram_2.png
            │       │   └╴ ...
            │       ├╴ matrix
            │       └╴ spectrum.json
            └╴ log

````

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Dataset directory structure #215

Example

Original Dataset :

Analyses :

Analysis A1

Analysis A2

Analysis B

Analysis C

Current directory structure

Problems:

Draft modifications of existing structure

Remarks

Draft new structure

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Analysis	Description
A1	Different audio length than original Different start/end times than original
A2	Same audio parameters than A1 : no reshaping needed. Only fft parameters change
B	Different start/end times than A1 and A2: reshaping needed.
C	Different audio parameters than A1, A2 and B: reshaping needed.

Dataset directory structure #215

Description

Example

Original Dataset :

Analyses :

Analysis A1

Analysis A2

Analysis B

Analysis C

Current directory structure

Problems:

Draft modifications of existing structure

Remarks

Draft new structure

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions