A coding protocol for agents that need proof, not vibes.
For Hermes, OpenClaw, Claude Code, Codex CLI, and any agent that can read files, edit code, run tests, delegate work, and check its own claims.
Coding agents are quick. That is the fun part, and also the dangerous part. AgentForge Protocol keeps the speed, but forces the work to leave evidence behind.
Agents can write a patch before they understand the repo. They can produce a plan that sounds tidy but proves nothing. They can spin up subagents, accept their reports, and quietly ship a mess.
Anyone who has used these tools for real work has seen some version of this.
AgentForge Protocol is a small operating routine for avoiding that failure mode. It tells the agent when to stay lightweight, when to slow down, when to write tests first, when to debug instead of guessing, and when to bring in subagents without letting them drive the car.
The point is simple: every meaningful step should leave evidence behind.
This version also borrows the useful parts of architecture-driven governance: read the baseline first, frame the impact before editing, separate the fix lane from the retirement lane, and keep checkpoints for long work so the task does not drift.
| If the task is... | The protocol does this |
|---|---|
| tiny and obvious | inspect, patch, run the cheapest useful check, stop |
| a clear behavior change | write the failing test first, then make it pass |
| vague or architectural | inspect first, grill the open decisions, save a plan |
| multi-step | split tasks, use focused subagents, review in stages |
| a bug or test failure | reproduce, trace root cause, add a regression test |
| uncertain | spike it before it becomes production architecture |
agentforge-protocol sits on top of a few smaller skills and decides which one should lead.
| Skill | Job |
|---|---|
karpathy-guidelines |
small diffs, fewer assumptions, less cleverness |
grill-plan |
decisions before code when the task is vague or risky |
writing-plans |
clear requirements turned into executable steps |
test-driven-development |
behavior changes with a failing test before the fix |
systematic-debugging |
root cause before patches |
subagent-driven-development |
split work without losing control of the result |
requesting-code-review |
final gate before commit, push, or ship |
spike |
disposable experiments when guessing is worse than building |
It uses Hermes' own layers instead of inventing extra paperwork:
- current progress goes to the
todotool - non-trivial plans go to
.hermes/plans/ - stable user or environment facts go to memory
- repeatable procedures and traps become skills
- project-local
tasks/lessons.mdis used only when the repo already works that way
git clone https://github.com/Yat-mo/agentforge-protocol.git
mkdir -p ~/.hermes/skills/software-development
cp -R agentforge-protocol/skills/software-development/agentforge-protocol \
~/.hermes/skills/software-development/Start a fresh Hermes session so the skill loader picks it up:
hermes --skills agentforge-protocolOr load it inside Hermes:
/skill agentforge-protocol
Use agentforge-protocol. Add email validation to the signup flow.
Use agentforge-protocol. Design and implement workspace-level permissions.
Use agentforge-protocol. The export job passes locally but fails in CI with a timezone assertion.
Use agentforge-protocol. Spike whether we can stream partial PDF extraction results to the UI.
The first job is to classify the work. A typo does not need a ceremony. A migration does.
Tiny obvious edit
Keep it light.
inspect → minimal patch → cheap verification → stop
No forced plan. No subagents. No theatre.
Clear behavior change
Use TDD unless there is a real reason not to.
read existing pattern
→ define baseline / hypothesis / success / failure / evidence plan
→ track fix lane + retirement lane when needed
→ write failing test
→ run RED
→ implement minimal code
→ run GREEN
→ run relevant regression
Ambiguous or architectural work
Use grill-plan before touching production code.
inspect code/docs/tests/logs
→ ask only what cannot be inspected
→ resolve decisions one by one
→ save `.hermes/plans/...md`
→ review plan
→ implement from the plan
Multi-task implementation
Use subagents, but keep the main agent responsible.
read saved plan once
→ extract tasks
→ implementer subagent per task
→ spec compliance review
→ code quality review
→ integration review
→ final verification
→ pre-commit gate
Bug or test failure
Debug first. Patch second.
read full error
→ reproduce
→ inspect recent changes
→ trace data flow
→ form one hypothesis
→ write regression test
→ fix root cause
→ verify
Feasibility unknown
Spike it. Do not turn uncertainty into production architecture.
decompose feasibility questions
→ test highest risk first
→ build disposable prototype
→ record VALIDATED / PARTIAL / INVALIDATED
→ only then plan production work
For non-trivial production code, write down the expectations before coding:
## Pre-coding expectations
### Baseline read set
The source of truth, architecture boundaries, owners, impact surface, compatibility constraints, and verification entry points to inspect before editing.
### Hypothesis
What I believe is true about the system and why this change should work.
### Success criteria
The checks that would make me comfortable saying this is done.
### Failure signals
Independent signs that the approach is wrong or unsafe.
These cannot just be "the success criteria did not pass".
### Ablations and expected observations
What I expect to see if a meaningful assumption or approach changes.
### Evidence plan
The fresh evidence that will support the final claim: tests, commands, logs, API responses, screenshots, or diff review results.
### Minimal verification path
The cheapest test, command, API call, UI action, or log check that proves the change.This is the part that keeps the agent honest. Not fancy, just useful.
Non-trivial plans live under .hermes/plans/ and use action → verification steps.
# <Task> Implementation Plan
## Goal
## Non-goals
## Context discovered from code/docs/logs
## Pre-coding expectations
### Baseline read set
### Hypothesis
### Success criteria
### Failure signals
### Ablations and expected observations
### Evidence plan
### Minimal verification path
## Confirmed decisions
## Rejected alternatives
## Fix lane and retirement lane
## Checkpoint, resume hint, and drift check
## Implementation steps
1. <Action> -> verify: <check>
2. <Action> -> verify: <check>
3. <Action> -> verify: <check>
## Files likely to change
## Tests and validation
## Risks and rollback
## Review notesA plan that says "make it work" is not a plan. Each step needs a way to prove itself.
Subagents are useful. They are also very good at sounding confident.
Use them like this:
| Rule | Why it matters |
|---|---|
| one subagent gets one focused task | broad prompts create vague work |
| include exact paths, commands, constraints, and expected output | fresh context needs real context |
| implementer subagents do not commit | the main agent owns the final state |
| spec review happens before code quality review | first ask if we built the right thing |
| the main agent verifies side effects | self-reports are not proof |
Good subagent roles: repository scout, implementation worker, spec compliance reviewer, code quality reviewer, debugging investigator, integration reviewer.
Before the agent says "done", check the boring stuff:
- tests or smoke checks actually ran
- if tests could not run, the reason is clear
- the diff is small and tied to the request
- there is no unrelated refactor or formatting drift
- the change did not leave behind orphan imports, files, configs, or TODOs
- fresh evidence is named, not implied
- logs, API responses, UI behavior, or test output support the claim
- bug fixes, refactors, and contract changes resolve both fix lane and retirement lane, or state remaining risk
- long or high-risk tasks have checkpoint, resume hint, and drift check
- risky changes got independent review
- reusable lessons were saved in the right place
Before commit, push, ship, or PR:
targeted tests
→ broader tests where reasonable
→ git diff / git status
→ secret and local-data scan
→ independent review when risk is meaningful
→ commit only after verification passes
skills/
└── software-development/
└── agentforge-protocol/
└── SKILL.md
The repo is intentionally small. It ships one workflow skill, not a framework.
Good agentic coding is not about making the model slower.
It is about making the model harder to fool.
Harder to fool with vague requirements. Harder to fool with tests that pass but prove nothing. Harder to fool with a plausible subagent report. Harder to fool with a patch that hides the symptom. Harder to fool with a big diff that feels productive.
Small when small is enough. Systematic when the work can hurt you.
MIT