Dataset quality audit category: Sample alignment and completeness issues
The May 2026 dataset audit found 27 issues in this class:
sample_alignment: 15
trajectory_completeness: 8
sample_representativeness: 4
Problem
Some datasets do not keep the same records across sample_raw.json, sample_std.json, and sample_sft.json, while others produce trajectories that end mid-action or combine multiple independent tasks into one trajectory. This makes examples hard to inspect, regenerate, and trust.
Examples
android_in_the_wild: stage counts do not match; sample_raw.json has 3 records, but sample_std.json and sample_sft.json have 1.
go-browse-wa: stage counts do not match; sample_raw.json has 100 records, but sample_std.json and sample_sft.json have 5.
nemotron_terminal_corpus: stage counts do not match; sample_raw.json has 5 records, but sample_std.json and sample_sft.json have 4.
androidcontrol: standardized ids (0, 20, 40) do not match SFT ids (androidcontrol-0, androidcontrol-1, androidcontrol-2).
CharlieDreemur_OpenManus-RL: three of five standardized trajectories end on an api_action such as perform_action or a weather API call without a following observation or final answer.
agenttuning_os: a second task is appended after a completed first task, making one ADP trajectory contain multiple independent OS problems.
Suggested work
- Add or strengthen tests requiring sample stage counts and ids to align across raw, standardized, and SFT files.
- Ensure converters preserve the same record order through the pipeline.
- Avoid dropping raw records silently during standardization or SFT conversion; if filtering is intentional, document and encode it deterministically.
- Require trajectories to end with a plausible terminal state, final answer, or documented reason for truncation.
- Split multiple independent tasks into separate trajectories where appropriate.
- Keep representative samples small but broad enough to cover important action/observation edge cases.
Dataset quality audit category: Sample alignment and completeness issues
The May 2026 dataset audit found 27 issues in this class:
sample_alignment: 15trajectory_completeness: 8sample_representativeness: 4Problem
Some datasets do not keep the same records across
sample_raw.json,sample_std.json, andsample_sft.json, while others produce trajectories that end mid-action or combine multiple independent tasks into one trajectory. This makes examples hard to inspect, regenerate, and trust.Examples
android_in_the_wild: stage counts do not match;sample_raw.jsonhas 3 records, butsample_std.jsonandsample_sft.jsonhave 1.go-browse-wa: stage counts do not match;sample_raw.jsonhas 100 records, butsample_std.jsonandsample_sft.jsonhave 5.nemotron_terminal_corpus: stage counts do not match;sample_raw.jsonhas 5 records, butsample_std.jsonandsample_sft.jsonhave 4.androidcontrol: standardized ids (0,20,40) do not match SFT ids (androidcontrol-0,androidcontrol-1,androidcontrol-2).CharlieDreemur_OpenManus-RL: three of five standardized trajectories end on anapi_actionsuch asperform_actionor a weather API call without a following observation or final answer.agenttuning_os: a second task is appended after a completed first task, making one ADP trajectory contain multiple independent OS problems.Suggested work