Dataset quality audit category: Schema/API modeling issues
The May 2026 dataset audit found 54 issues in this class:
schema_extra_fields: 20
api_argument_encoding: 16
available_apis: 16
api_surface_normalization: 2
Problem
Several datasets either preserve fields that the current ADP standardized schema does not model, encode API arguments in lossy or inconsistent ways, or omit/overuse available_apis. Because Pydantic currently accepts/drops many extra fields unless models explicitly forbid them, some of these issues pass validation while still losing structure or weakening downstream consumers.
Examples
CharlieDreemur_OpenManus-RL: sample_std.json contains non-schema event keys, including 74 reward fields and 36 reasoning_content fields embedded inside content events.
SALT-NLP_SWE-chat: sample_std.json contains non-schema event keys, including 26 reward fields and 15 reasoning_content fields.
allenai_Sera-4.6-Lite-T2: sample_std.json contains non-schema event keys, including 165 reward fields and 81 reasoning_content fields.
agenttuning_alfworld: API kwargs contain extra literal quotes, for example "location": "\"shelf 1\"".
SALT-NLP_SWE-chat: samples include api_action calls but omit top-level available_apis, even though source metadata includes tool and transcript information.
dolci_instruct_sft_tool_use: function names preserve heterogeneous upstream naming such as leaguepowerrankingrounds and weather_forecast_weather_api, producing an inconsistent API surface.
Suggested work
- Decide whether standardized ADP models should use
extra="forbid" for Trajectory, actions, and observations.
- Add negative tests proving unknown standardized fields fail validation, if strictness is desired.
- For fields such as rewards and reasoning traces, either map them into supported ADP fields or intentionally preserve them under
details with documented semantics.
- Normalize API kwargs so structured values are represented as native JSON, not stringified/literal-quoted values.
- Clarify and enforce when
available_apis should be present, especially for datasets whose raw source includes explicit tool availability.
- For heterogeneous upstream tool names, decide whether dataset-local APIs should normalize names or preserve raw names with mapping metadata.
Dataset quality audit category: Schema/API modeling issues
The May 2026 dataset audit found 54 issues in this class:
schema_extra_fields: 20api_argument_encoding: 16available_apis: 16api_surface_normalization: 2Problem
Several datasets either preserve fields that the current ADP standardized schema does not model, encode API arguments in lossy or inconsistent ways, or omit/overuse
available_apis. Because Pydantic currently accepts/drops many extra fields unless models explicitly forbid them, some of these issues pass validation while still losing structure or weakening downstream consumers.Examples
CharlieDreemur_OpenManus-RL:sample_std.jsoncontains non-schema event keys, including 74rewardfields and 36reasoning_contentfields embedded inside content events.SALT-NLP_SWE-chat:sample_std.jsoncontains non-schema event keys, including 26rewardfields and 15reasoning_contentfields.allenai_Sera-4.6-Lite-T2:sample_std.jsoncontains non-schema event keys, including 165rewardfields and 81reasoning_contentfields.agenttuning_alfworld: API kwargs contain extra literal quotes, for example"location": "\"shelf 1\"".SALT-NLP_SWE-chat: samples includeapi_actioncalls but omit top-levelavailable_apis, even though source metadata includes tool and transcript information.dolci_instruct_sft_tool_use: function names preserve heterogeneous upstream naming such asleaguepowerrankingroundsandweather_forecast_weather_api, producing an inconsistent API surface.Suggested work
extra="forbid"forTrajectory, actions, and observations.detailswith documented semantics.available_apisshould be present, especially for datasets whose raw source includes explicit tool availability.