Benchmark and optimize AGENTS.md and SKILL.md for Codex.
CodexOpt is a lightweight CLI for benchmarking and optimizing Codex instruction assets.
It focuses on two repo-local files:
AGENTS.md.codex/skills/**/SKILL.md
- Documentation: superagenticai.github.io/CodexOpt
- Demo repository: github.com/SuperagenticAI/codexopt-demo
- PyPI package: pypi.org/project/codexopt
- Docs source: docs/
CodexOpt gives teams a repeatable workflow to:
- Scan instruction files.
- Benchmark quality.
- Generate optimized candidates.
- Apply only improvements.
- Produce a report.
Most teams edit AGENTS.md and SKILL.md manually, but struggle to answer:
- Did quality actually improve?
- Did we increase prompt bloat?
- Did we break skill frontmatter conventions?
CodexOpt turns these edits into measurable runs with artifacts you can inspect and version.
- Project scan with issue detection for agents and skills.
- Benchmark scoring with sub-scores and natural-language feedback.
- Optional evidence inputs from repo task files and issue exports.
- Optimization engine
heuristic(default, local and deterministic). - Optional optimization engine
gepa(viagepa.optimize_anything). - Explicit reporting when a GEPA-requested run falls back to heuristic optimization.
- Safe apply flow with automatic backups.
- Markdown reporting from latest runs.
- Minimal OSS CI (lint, test, build).
- Python
>=3.10 uv(recommended) orpip
uv sync --extra devRun commands through the managed environment:
uv run codexopt --helpuv.lock is committed to keep dependency resolution reproducible across machines and CI.
pip install -e ".[dev]"# 1) Create config
uv run codexopt init
# 2) Inspect what will be evaluated
uv run codexopt scan
# 3) Get baseline scores
uv run codexopt benchmark
# 4) Optimize AGENTS.md
uv run codexopt optimize agents --file AGENTS.md
# 5) Optimize skills
uv run codexopt optimize skills --glob ".codex/skills/**/SKILL.md"
# 6) Review apply impact without writing
uv run codexopt apply --kind agents --dry-run
# 7) Apply selected improvements
uv run codexopt apply --kind agents
# 8) Generate markdown summary
uv run codexopt report --output codexopt-report.mdDevelopers use CodexOpt in the repository that contains their Codex instruction assets:
AGENTS.md.codex/skills/**/SKILL.md
Optional evidence can also be added to improve benchmarking and optimization quality:
- task files (
tasks.md, task lists, or JSON fixtures) - issue/review exports (
issues.mdor JSON exports)
Typical workflow:
- Run
scanandbenchmarkto measure the current instruction assets. - Run
optimize agentsandoptimize skillsto generate improved candidates. - Review the generated diffs and report artifacts under
.codexopt/runs/. - Run
apply --dry-runfirst, then apply accepted changes. - Commit the updated instruction files and, if useful, attach the report to a PR.
Example with optional evidence configured in codexopt.yaml:
evidence:
task_files:
- tasks.md
issue_files:
- issues.mdWith that config in place, benchmark and optimize use:
- static prompt-quality checks
- repo task alignment
- recurring issue/review themes
Today, task and issue files influence scoring and feedback. CodexOpt does not yet execute full agent task simulations.
Use codexopt.example.yaml as a starting point for committed team config.
codexopt --config <path-to-codexopt.yaml> <command>Create a default config file.
codexopt init [--path PATH] [--force]Discover AGENTS/SKILL targets and validate shape.
codexopt scanScore current files using built-in heuristics.
codexopt benchmarkOptimize AGENTS files.
codexopt optimize agents \
[--file PATTERN] \
[--engine heuristic|gepa] \
[--reflection-model MODEL] \
[--max-metric-calls N]Optimize SKILL files.
codexopt optimize skills \
[--glob PATTERN] \
[--engine heuristic|gepa] \
[--reflection-model MODEL] \
[--max-metric-calls N]Apply best candidates from the latest optimization run (or a provided run id).
codexopt apply [--kind agents|skills] [--run-id RUN_ID] [--dry-run]Generate a markdown report from latest runs in state.
codexopt report [--output FILE.md]Default codexopt.yaml:
version: 1
targets:
agents_files:
- AGENTS.md
- "**/AGENTS.md"
- "**/AGENTS.override.md"
skills_globs:
- ".codex/skills/**/SKILL.md"
- "**/.codex/skills/**/SKILL.md"
exclude_globs:
- ".git/**"
- ".codexopt/**"
- ".venv/**"
- "node_modules/**"
- "reference/**"
output:
root_dir: ".codexopt"
evidence:
task_files: []
issue_files: []
optimization:
engine: "heuristic"
min_apply_delta: 0.01
max_metric_calls: 60
reflection_model: nullConfig notes:
targets.agents_files: glob patterns for AGENTS targets.targets.skills_globs: glob patterns forSKILL.mdtargets.targets.exclude_globs: paths ignored during scan.output.root_dir: run artifacts and backups location.evidence.task_files: optional markdown/json task lists used for repo-alignment scoring.evidence.issue_files: optional markdown/json issue or review exports used for theme-aware feedback.optimization.engine: default optimization engine.optimization.min_apply_delta: minimum score gain required to apply.optimization.max_metric_calls: GEPA metric budget.optimization.reflection_model: required when using GEPA engine.
CodexOpt computes a 0.0 to 1.0 score per file.
AGENTS scoring factors include:
- Too short or too long content penalties.
- Token-heaviness estimate penalty.
- Empty file penalty.
- Contradictory guidance penalties.
- Missing workflow / verification / output-format guidance penalties.
- Repo-context and task-alignment signals when evidence files are configured.
SKILL scoring factors include:
- Missing frontmatter penalties.
- Missing
name/descriptionpenalties. - Overly long frontmatter fields penalties.
- Too short or too long content penalties.
- Weak trigger/workflow/verification guidance penalties.
- Repo task alignment signals when evidence files are configured.
Each benchmarked file also includes:
- criterion-level sub-scores
- natural-language feedback
- optional evidence summary from configured task/issue files
Candidate transforms include:
- Whitespace normalization.
- Blank-line compaction.
- Duplicate adjacent line removal.
- Skill-specific frontmatter synthesis/trimming.
The best candidate is selected by score delta. If delta is below min_apply_delta, original content is kept.
CodexOpt can call gepa.optimize_anything when --engine gepa is selected.
The GEPA path is model-agnostic. In practice, teams can use any reflection model supported by their GEPA / LiteLLM setup, including OpenAI, Gemini, local models, or other compatible providers. That means you can ask GEPA to generate feedback and candidate improvements using whichever model gives you the best quality / cost tradeoff for your workflow.
Requirements:
gepainstalled in the environment.- A valid reflection model via
--reflection-modelor config.
Common examples:
optimization:
engine: "gepa"
reflection_model: "openai/gpt-5-mini"optimization:
engine: "gepa"
reflection_model: "gemini/gemini-2.5-pro"For OpenAI-backed GEPA runs, set:
export OPENAI_API_KEY="your-openai-key"For Gemini-backed GEPA runs, set:
export GEMINI_API_KEY="your-gemini-key"
export GOOGLE_API_KEY="$GEMINI_API_KEY"Fallback behavior:
- If GEPA is unavailable or errors, CodexOpt falls back to heuristic optimization.
- Fallbacks are recorded in optimization artifacts, CLI summaries, and reports.
By default, everything is written under .codexopt/:
runs/<run_id>/scan.jsonruns/<run_id>/benchmark.jsonruns/<run_id>/optimize.jsonruns/<run_id>/apply.jsonbackups/<timestamp>/...(created on non-dry-run apply)state.json(tracks latest run ids per command type)
Run ids are timestamped and namespaced by command kind, for example:
20260308T184800123456Z-benchmark20260308T184812654321Z-optimize-skills
- Commit current
AGENTS.mdand skills. - Run
scanandbenchmarkto establish baseline. - Run
optimize agentsand/oroptimize skills. - Review
optimize.jsonand diffs. - Run
apply --dry-runfirst, thenapply. - Run
reportand attach report to PR.
Before (AGENTS.md):
## Coding Rules
Always run tests before commit.
Always run tests before commit.
Keep changes minimal.After optimization (heuristic):
## Coding Rules
Always run tests before commit.
Keep changes minimal.What changed:
- Removed duplicate adjacent line.
- Compacted extra blank lines.
Before (.codex/skills/my_skill/SKILL.md):
Use this skill for repository release checks.
Run lint, tests, and changelog validation.After optimization (heuristic):
---
name: my-skill
description: Repository-specific workflow skill.
---
Use this skill for repository release checks.
Run lint, tests, and changelog validation.What changed:
- Added required frontmatter block.
- Generated normalized
namefrom folder name. - Added default
description.
uv run codexopt init
uv run codexopt scan
uv run codexopt benchmark
uv run codexopt optimize agents --file AGENTS.md
uv run codexopt optimize skills --glob ".codex/skills/**/SKILL.md"
uv run codexopt apply --kind skills --dry-run
uv run codexopt apply --kind skills
uv run codexopt report --output codexopt-report.mdFiles to inspect after running:
.codexopt/runs/*/scan.json.codexopt/runs/*/benchmark.json.codexopt/runs/*/optimize.json.codexopt/runs/*/apply.json.codexopt/backups/*
GitHub Actions workflow is included at .github/workflows/ci.yml and runs:
uv lock --checkfor lockfile consistency.uv sync --extra devfor environment setup.- Ruff lint checks.
- Pytest tests.
- Package build (
uv build).
It does not publish packages.
uv lock
uv sync --extra dev
uv run --no-sync ruff check src tests
PYTEST_DISABLE_PLUGIN_AUTOLOAD=1 uv run --no-sync pytest -q
uv buildCause:
- No prior optimization run for the selected kind.
state.jsondoes not contain the expected latest run pointer.
Fix:
uv run codexopt optimize agents
uv run codexopt apply --kind agentsOr pass an explicit run:
uv run codexopt apply --kind agents --run-id <run_id>Cause:
gepais not installed, orreflection_modelis missing.
Behavior:
- CodexOpt falls back to heuristic optimization when GEPA errors.
Fix:
uv run codexopt optimize agents --engine gepa --reflection-model <model_name>Expected behavior:
--dry-runreports candidate applications without writing files.
To write changes, run again without --dry-run:
uv run codexopt apply --kind agentsIf your environment blocks dependency resolution in isolated builds, use:
uv buildSome environments auto-load global pytest plugins that can break local tests. Run with plugin autoload disabled:
PYTEST_DISABLE_PLUGIN_AUTOLOAD=1 uv run --no-sync pytest -qCause:
- Best candidate delta is below
optimization.min_apply_delta, or - File content is already equivalent.
Fix:
- Lower
optimization.min_apply_deltaincodexopt.yaml, then re-run optimize/apply.
MIT. See LICENSE.
- Shashi (
shashi@super-agentic.ai)
