Skip to content

Audit: tighten schema and API modeling across datasets #216

@neubig

Description

@neubig

Dataset quality audit category: Schema/API modeling issues

The May 2026 dataset audit found 54 issues in this class:

  • schema_extra_fields: 20
  • api_argument_encoding: 16
  • available_apis: 16
  • api_surface_normalization: 2

Problem

Several datasets either preserve fields that the current ADP standardized schema does not model, encode API arguments in lossy or inconsistent ways, or omit/overuse available_apis. Because Pydantic currently accepts/drops many extra fields unless models explicitly forbid them, some of these issues pass validation while still losing structure or weakening downstream consumers.

Examples

  • CharlieDreemur_OpenManus-RL: sample_std.json contains non-schema event keys, including 74 reward fields and 36 reasoning_content fields embedded inside content events.
  • SALT-NLP_SWE-chat: sample_std.json contains non-schema event keys, including 26 reward fields and 15 reasoning_content fields.
  • allenai_Sera-4.6-Lite-T2: sample_std.json contains non-schema event keys, including 165 reward fields and 81 reasoning_content fields.
  • agenttuning_alfworld: API kwargs contain extra literal quotes, for example "location": "\"shelf 1\"".
  • SALT-NLP_SWE-chat: samples include api_action calls but omit top-level available_apis, even though source metadata includes tool and transcript information.
  • dolci_instruct_sft_tool_use: function names preserve heterogeneous upstream naming such as leaguepowerrankingrounds and weather_forecast_weather_api, producing an inconsistent API surface.

Suggested work

  • Decide whether standardized ADP models should use extra="forbid" for Trajectory, actions, and observations.
  • Add negative tests proving unknown standardized fields fail validation, if strictness is desired.
  • For fields such as rewards and reasoning traces, either map them into supported ADP fields or intentionally preserve them under details with documented semantics.
  • Normalize API kwargs so structured values are represented as native JSON, not stringified/literal-quoted values.
  • Clarify and enforce when available_apis should be present, especially for datasets whose raw source includes explicit tool availability.
  • For heterogeneous upstream tool names, decide whether dataset-local APIs should normalize names or preserve raw names with mapping metadata.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions