Skip to content

Avoid loading full datasets when generating benchmark dataset details #609

@R-Palazzo

Description

@R-Palazzo

Problem Description

When uploading benchmark results, SDGym generates dataset details that include row counts. For large datasets, this currently requires loading the dataset data, which can be slow and may crash due to memory usage (this was observed with rel-bench datasets).

Expected behavior

SDGym should compute accurate row counts for dataset details without loading the full dataset into pandas or saving the dataset locally. This should make benchmark result uploads more reliable for large datasets while preserving the existing Dataset_Details.xlsx output.

Additional context

This should be tested with rel-bench datasets.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions