Audit: tighten schema and API modeling across datasets

Dataset quality audit category: **Schema/API modeling issues**

The May 2026 dataset audit found **54** issues in this class:

- `schema_extra_fields`: 20
- `api_argument_encoding`: 16
- `available_apis`: 16
- `api_surface_normalization`: 2

## Problem

Several datasets either preserve fields that the current ADP standardized schema does not model, encode API arguments in lossy or inconsistent ways, or omit/overuse `available_apis`. Because Pydantic currently accepts/drops many extra fields unless models explicitly forbid them, some of these issues pass validation while still losing structure or weakening downstream consumers.

## Examples

- `CharlieDreemur_OpenManus-RL`: `sample_std.json` contains non-schema event keys, including 74 `reward` fields and 36 `reasoning_content` fields embedded inside content events.
- `SALT-NLP_SWE-chat`: `sample_std.json` contains non-schema event keys, including 26 `reward` fields and 15 `reasoning_content` fields.
- `allenai_Sera-4.6-Lite-T2`: `sample_std.json` contains non-schema event keys, including 165 `reward` fields and 81 `reasoning_content` fields.
- `agenttuning_alfworld`: API kwargs contain extra literal quotes, for example `"location": "\"shelf 1\""`.
- `SALT-NLP_SWE-chat`: samples include `api_action` calls but omit top-level `available_apis`, even though source metadata includes tool and transcript information.
- `dolci_instruct_sft_tool_use`: function names preserve heterogeneous upstream naming such as `leaguepowerrankingrounds` and `weather_forecast_weather_api`, producing an inconsistent API surface.

## Suggested work

- Decide whether standardized ADP models should use `extra="forbid"` for `Trajectory`, actions, and observations.
- Add negative tests proving unknown standardized fields fail validation, if strictness is desired.
- For fields such as rewards and reasoning traces, either map them into supported ADP fields or intentionally preserve them under `details` with documented semantics.
- Normalize API kwargs so structured values are represented as native JSON, not stringified/literal-quoted values.
- Clarify and enforce when `available_apis` should be present, especially for datasets whose raw source includes explicit tool availability.
- For heterogeneous upstream tool names, decide whether dataset-local APIs should normalize names or preserve raw names with mapping metadata.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Audit: tighten schema and API modeling across datasets #216

Problem

Examples

Suggested work

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Audit: tighten schema and API modeling across datasets #216

Description

Problem

Examples

Suggested work

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions