Skip to content

possible data quality issues in dataset 0052, specifically in: - 'raw_data/0052/0052_rtdata.tsv' #241

@keemwoojae

Description

@keemwoojae

The file appears to contain duplicate entries, missing structure annotations, and a likely misalignment between compound names/RTs and PubChem fields.

Observed Issues

1. Possible annotation shift in later rows

Rows 0052_00105-0052_00137 appear suspicious.
The compound name and rt fields seem to repeat earlier compounds, but the structure annotations appear to be shifted/misaligned.

Example:

  • 0052_00105
    • name: Myo-inositol
    • rt: 1.1
    • formula: C22H18O11
    • PubChem CID: 65064
    • InChIKey: starts with WMBWRE...

These structure fields appear to correspond to (-)-Epigallocatechin gallate, not Myo-inositol.

Similar mismatches repeat through 0052_00137.

2. Duplicate-related comments are widespread

The comment column contains many duplicate-related notes:

  • doublet: 62 rows
  • removed another duplicate entry: 69 rows
  • potential duplicate: 4 rows
  • standardized from inchi; removed another duplicate entry: 1 row

It is unclear which entries are intended biological/analytical doublets and which should be removed or consolidated.

Please Check

Could you please check whether:

  • rows 0052_00105-0052_00137 have shifted PubChem annotations,
  • the doublet / potential duplicate / removed another duplicate entry labels reflect the intended final state,

Thanks for maintaining RepoRT. This dataset is very useful for retention time prediction benchmarking.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions