Audit: fix SFT roles and action/source representation issues

Dataset quality audit category: **Conversation/action representation issues**

The May 2026 dataset audit found **53** issues in this class:

- `sft_format_or_role`: 17
- `role_or_source_mapping`: 15
- `action_representation`: 15
- `sft_placeholder`: 6

## Problem

Several datasets flatten structured behavior into plain text, map environment observations to the wrong source, or assign SFT roles inconsistently. These issues are especially risky because downstream SFT consumers may train on incorrect assistant/tool boundaries.

## Examples

- `agenttuning_alfworld`: root `sample_sft.json` marks plain acknowledgements such as `OK. I'll follow...` as `from: "function_call"` even though they contain no function-call syntax.
- `agenttuning_db`: root SFT sample marks all assistant messages as `function_call` messages without function-call syntax.
- `agenttuning_mind2web`: root SFT sample uses `function_call` for final-choice text without an actual function call.
- `agenttuning_alfworld`: all 94 standardized text observations are marked `source: "user"`, including environment responses immediately after API actions such as `You pick up...` and `On the shelf...`.
- `agenttuning_db`: SQL operations are not represented as `CodeAction` or `ApiAction`; SQL is embedded in assistant text or omitted entirely.
- `androidcontrol`: root `sample_sft.json` is a placeholder conversation and is not derived from the standardized mobile trajectories.

## Suggested work

- Ensure SFT messages containing actual function-call syntax use `from: "function_call"`, and plain assistant text does not.
- Fix converters rather than hand-editing generated sample JSON.
- Audit `TextObservation.source` mapping for user/environment/agent boundaries, especially after tool/API calls.
- Represent executable commands, SQL, browser actions, and API calls with `CodeAction` or `ApiAction` where the raw data supports it.
- Replace placeholder root `sample_sft.json` files with pipeline-derived SFT samples.
- Add tests that detect function-call roles without function-call syntax, and function-call syntax outside `from: "function_call"`.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Audit: fix SFT roles and action/source representation issues #217

Problem

Examples

Suggested work

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Audit: fix SFT roles and action/source representation issues #217

Description

Problem

Examples

Suggested work

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions