Skip to content

Latest commit

 

History

History
182 lines (139 loc) · 6.58 KB

File metadata and controls

182 lines (139 loc) · 6.58 KB

Benchmarks

DM-Code-Agent has two benchmark suites:

  • coding: compact hidden-test coding tasks.
  • maintenance: repository-maintenance tasks that mimic real fixes more closely.

Both suites create a temporary workspace, let the agent inspect and edit files, inject hidden tests after the agent finishes, and score the run by executable behavior.

Commands

List coding tasks:

dm-agent-bench --list

List maintenance tasks:

dm-agent-bench --suite maintenance --list

Run one task:

dm-agent-bench --suite maintenance --provider deepseek --task config_precedence

Write reports:

dm-agent-bench --suite maintenance \
  --provider deepseek \
  --output bench_reports/maintenance.json \
  --markdown bench_reports/maintenance.md \
  --trace-dir bench_reports/traces

Each JSON report includes a manifest block with task fingerprints and a suite signature. Task fingerprints include hidden-test content and changed-file constraints, so reports can detect suite drift without exposing hidden tests.

When --trace-dir is enabled, each run metadata also includes compact trace_analysis fields: primary/final failure stage, recovery, verification gap, and trace-health grade. This is advisory debugging metadata and does not affect hidden-test scoring.

Opt-in adaptive replanning and local token accounting:

dm-agent-bench --suite maintenance \
  --provider deepseek \
  --enable-adaptive-replanning \
  --enable-repeated-failure-policy-experiment \
  --max-replans 3 \
  --cost-per-1k-tokens 0.00027 \
  --output bench_reports/maintenance.json \
  --markdown bench_reports/maintenance.md

Generate an offline economics table from existing JSON reports:

dm-agent-economics bench_reports/maintenance.json \
  --label maintenance-deepseek \
  --output-json bench_reports/economics.json \
  --output-md bench_reports/economics.md

dm-agent-economics never runs a model, downloads a dataset, or queries live pricing. Prices are explicit inputs for local accounting. When source benchmark reports include pass-rate confidence intervals, the economics Markdown carries those intervals into the pass-rate column. When multiple input reports carry different manifest.suite_signature values, the economics summary and Markdown emit a warning because cost/pass-rate rankings may not be comparable.

Compare two benchmark manifests before comparing scores:

dm-agent-manifest-diff bench_reports/baseline.json bench_reports/experiment.json

The manifest diff is offline-only. It exits with 0 when suite signatures, task fingerprints, and variant names match; it exits with 1 when reports are from different task contracts.

Default-off v2 plumbing for coding/maintenance benchmark experiments:

dm-agent-bench --suite maintenance \
  --enable-rag \
  --rag-top-k 5 \
  --enable-critic \
  --self-consistency-runs 3 \
  --self-consistency-strategy test_pass

RAG builds a local BM25 index for each candidate workspace. Critic review uses the same configured LLM client as the main run unless future code supplies a separate client. Self-consistency creates fresh workspaces per candidate and then selects by majority vote, critic score, or test pass. These features are disabled by default and are not used by CI live runs.

SWE-bench Lite self-consistency is intentionally blocked while real SWE-bench evaluation is frozen.

Maintenance Suite

The maintenance suite currently includes:

  • config_precedence: config precedence and type coercion.
  • patch_summary_name_status: git diff --name-status parsing for run reports.
  • retry_regression_tests: retry policy fix with required regression-test changes.
  • safe_workspace_join: path traversal protection for workspace file access.
  • cross_file_user_contract: cross-file API contract repair for a serializer/model pair.
  • cli_config_docs_contract: multi-file CLI/docs/test consistency repair for configuration documentation.
  • packaging_ci_contract: multi-file packaging metadata and CI workflow repair with required regression-test updates.

These tasks are intentionally closer to repository upkeep than puzzle-style algorithms. They include hidden tests, edge cases, and changed-file constraints.

Scoring

A run is successful only if:

  1. The agent reports successful completion.
  2. Hidden tests pass.
  3. The task's changed-file constraints are satisfied.

The report includes:

  • overall_pass_rate
  • overall_pass_rate_ci_95
  • overall_hidden_test_pass_rate
  • overall_hidden_test_pass_rate_ci_95
  • overall_agent_completion_rate
  • overall_agent_completion_rate_ci_95
  • average steps
  • average tool calls
  • average changed files
  • estimated tokens
  • estimated cost and cost per success when --cost-per-1k-tokens is provided
  • provider request count
  • per-run changed files
  • optional per-run trace paths
  • hidden test stdout/stderr tail
  • agent metadata such as replan, parse repair, and tool error counts
  • adaptive replanning metadata when enabled: signal kind, selected strategy, skipped replans, and replan budget exhaustion
  • repeated-failure policy experiment metadata when explicitly enabled: loop-breaking strategy counts for repeated action/error signatures
  • RAG / critic / self-consistency configuration metadata when those default-off switches are used
  • self-consistency uncertainty metadata when multiple candidates are run: vote distribution, selected support, support fraction, tie detection, margin to runner-up, and confidence label
  • self-consistency patch fingerprints when file edits are available, so majority voting can group equivalent workspace changes before falling back to final-answer text
  • manifest provenance: task ids, per-task fingerprints, variant names, and suite signature
  • compact trace analysis when --trace-dir is enabled

Pass-rate confidence intervals use Wilson 95% intervals. They are computed from the runs already in the report and do not increase the default repeat count.

Changed-File Constraints

BenchmarkTask supports:

  • allowed_changed_files: files the agent may change.
  • required_changed_files: files the agent must change.

This makes the benchmark more practical. A task can require the agent to add regression tests, or fail a run that edits unrelated files to game the score.

Design Direction

Future benchmark work should add:

  • more multi-file refactors with behavior-preserving hidden tests
  • documentation/CLI consistency tasks
  • CI and packaging repair tasks
  • trace completeness checks
  • richer repeated-sample variance summaries beyond binomial confidence intervals
  • cross-model comparison tables
  • cost-per-success economics across existing reports