Skip to content

Audit: align sample stages and complete trajectories #218

@neubig

Description

@neubig

Dataset quality audit category: Sample alignment and completeness issues

The May 2026 dataset audit found 27 issues in this class:

  • sample_alignment: 15
  • trajectory_completeness: 8
  • sample_representativeness: 4

Problem

Some datasets do not keep the same records across sample_raw.json, sample_std.json, and sample_sft.json, while others produce trajectories that end mid-action or combine multiple independent tasks into one trajectory. This makes examples hard to inspect, regenerate, and trust.

Examples

  • android_in_the_wild: stage counts do not match; sample_raw.json has 3 records, but sample_std.json and sample_sft.json have 1.
  • go-browse-wa: stage counts do not match; sample_raw.json has 100 records, but sample_std.json and sample_sft.json have 5.
  • nemotron_terminal_corpus: stage counts do not match; sample_raw.json has 5 records, but sample_std.json and sample_sft.json have 4.
  • androidcontrol: standardized ids (0, 20, 40) do not match SFT ids (androidcontrol-0, androidcontrol-1, androidcontrol-2).
  • CharlieDreemur_OpenManus-RL: three of five standardized trajectories end on an api_action such as perform_action or a weather API call without a following observation or final answer.
  • agenttuning_os: a second task is appended after a completed first task, making one ADP trajectory contain multiple independent OS problems.

Suggested work

  • Add or strengthen tests requiring sample stage counts and ids to align across raw, standardized, and SFT files.
  • Ensure converters preserve the same record order through the pipeline.
  • Avoid dropping raw records silently during standardization or SFT conversion; if filtering is intentional, document and encode it deterministically.
  • Require trajectories to end with a plausible terminal state, final answer, or documented reason for truncation.
  • Split multiple independent tasks into separate trajectories where appropriate.
  • Keep representative samples small but broad enough to cover important action/observation edge cases.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions