Phase 3: Hermes-MCP server (hermes.execute_task as MCP tool)

## Context

The architectural decision from the May session: **Agent-as-MCP-tool, NOT LLM-caller-replacement**. The xiaozhi-server LLM (or any MCP-host LLM) keeps the fast/cheap turns on the cheap path; expensive agentic loops are explicit tool calls into Hermes (or OpenClaw — #118).

This issue is the **first proof** that M3's runtime-neutrality thesis works. Per [`milestones-roadmap.md` §4](https://github.com/litentry/agentKeys/blob/main/docs/spec/plans/milestones-roadmap.md), M3 needs 3+ runtimes proving the same AgentKeys backend serves them all. Hermes is the first; OpenClaw (#118) is the second; Doubao (already in M2's #112) is the third.

NousResearch [hermes-agent](https://github.com/nousresearch/hermes-agent) is MIT-licensed and battle-tested — self-improving learning loop, Honcho user modeling, FTS5 session search. Wrapping it as an MCP tool means any MCP host can invoke a full agentic runtime without having to embed it.

## Scope (M3)

### Deploy NousResearch Hermes-agent

- One instance (single-region for v0); scale-out in M4 if vendor pilots demand it
- Persistent storage for Hermes' session state (SQLite or Postgres; whichever the upstream defaults to)
- Hermes connects to the AgentKeys MCP server (#107) as a downstream — Hermes uses AgentKeys tools internally for memory + identity + audit

### MCP server wrapping Hermes

- One tool: `hermes.execute_task(task, context, constraints)`
- Tool signature per [`agent-iam-strategy.md`](https://github.com/litentry/agentKeys/blob/main/docs/research/agent-iam-strategy.md) "Hermes-as-MCP" discussion:

```
hermes.execute_task(
  task: string,
  context: {
    actor_omni: string,
    session_id: string,
    memory_namespaces: string[],
  },
  constraints: {
    max_duration_s: number,
    max_cost_usd: number,
    tools_allowed: string[],
  }
) → {
  result: string,
  steps_taken: number,
  cost_usd: number,
  audit_trail_id: string,
}
```

- Auth: same per-vendor Bearer + `X-AgentKeys-Actor` header pattern as #107
- Cost tracking: every Hermes step that calls an LLM increments the cost counter; constraint enforced server-side

### Recursive composition

Hermes-agent uses AgentKeys MCP tools internally:
- Read memory for context → `agentkeys.memory.get`
- Check permissions for actions → `agentkeys.permission.check`
- Append audit rows for each step → `agentkeys.audit.append`

This creates a **two-layer audit trail**: AgentKeys records "Hermes invoked"; Hermes-side audit records each step Hermes took inside the run. Both surface in #115's audit dashboard.

## Out of scope (defer)

- Multi-instance Hermes (M4 — single instance is enough for M3 proof)
- Tuning Hermes' system prompts per vendor (M4)
- Streaming responses (MCP spec supports it; defer until vendor demand)
- Cross-Hermes-session memory sharing (M4 with delegation work)

## Acceptance criteria

- [ ] A xiaozhi-server LLM (M1's setup) successfully calls `hermes.execute_task` for a complex task ("plan my 3-day Chengdu trip with ¥5000 budget") and gets a result back
- [ ] Hermes pulls memory via AgentKeys MCP for context — verified by audit-trail showing `memory.get` calls during the Hermes run
- [ ] End-to-end latency for non-real-time tasks is tolerable (30-60s acceptable per the M3 sequencing — these are tasks, not chat turns)
- [ ] `max_duration_s` constraint enforced: a deliberately-long task times out at the configured limit and returns a graceful timeout result + audit row
- [ ] `max_cost_usd` constraint enforced: a deliberately-expensive task halts at the cost cap + audit row explaining why
- [ ] Hermes invocation is observable in #115's audit dashboard via the two-layer trail

## Risks

| Risk | Mitigation |
|---|---|
| Hermes' LLM costs explode under bad constraints | Server-side cost tracker is authoritative; vendor's per-actor cost limit is enforced before Hermes-side limit |
| Hermes-side state diverges from AgentKeys session state | Hermes uses AgentKeys memory worker as its persistent context; session_id maps to actor_omni-scoped memory namespace |
| Cold start latency (Hermes process + LLM warmup) is unacceptable for the demo | Pre-warm one Hermes instance per vendor; warm-pool sizing is M4 tuning |

## References

- [`docs/spec/plans/milestones-roadmap.md`](https://github.com/litentry/agentKeys/blob/main/docs/spec/plans/milestones-roadmap.md) §4 (M3 scope)
- [`docs/research/agent-iam-strategy.md`](https://github.com/litentry/agentKeys/blob/main/docs/research/agent-iam-strategy.md) — "Agent-as-MCP-tool" architectural decision + Hermes-as-MCP discussion
- [`docs/research/xiaozhi-hermes-architecture.md`](https://github.com/litentry/agentKeys/blob/main/docs/research/xiaozhi-hermes-architecture.md) — architecture diagrams + per-turn flow
- [`docs/research/xiaozhi-hermes-risks.md`](https://github.com/litentry/agentKeys/blob/main/docs/research/xiaozhi-hermes-risks.md) — R1-R4 risk verification (latency, concurrency, cold-construction, gateway)
- [NousResearch/hermes-agent](https://github.com/nousresearch/hermes-agent) — upstream project (MIT)
- #107 (MCP server — Hermes uses these tools recursively)
- #115 (audit dashboard — surfaces the two-layer audit trail)
- #118 (OpenClaw-MCP — second runtime; pattern this issue establishes carries to OpenClaw)

## Effort

~1-2 weeks. Sequencing:

1. (Days 1-3) Deploy Hermes-agent + storage + AgentKeys MCP client config
2. (Days 3-5) MCP server wrapper + tool signature + auth
3. (Days 5-8) Constraint enforcement (duration + cost) + audit-trail wiring
4. (Days 8-10) Integration test from xiaozhi-server → Hermes → AgentKeys MCP → S3
5. (Days 10-14) Performance pass + vendor-demo readiness

## Pickup notes for the next agent / developer

- Read [`xiaozhi-hermes-architecture.md`](https://github.com/litentry/agentKeys/blob/main/docs/research/xiaozhi-hermes-architecture.md) first — the three-diagram explanation of where Hermes fits
- Then [`xiaozhi-hermes-risks.md`](https://github.com/litentry/agentKeys/blob/main/docs/research/xiaozhi-hermes-risks.md) — verified risks against actual repo code with file:line citations
- The architectural decision is in [`agent-iam-strategy.md`](https://github.com/litentry/agentKeys/blob/main/docs/research/agent-iam-strategy.md) — Hermes is a **callable tool**, not an LLM-caller replacement. Don't accidentally re-architect it as the LLM the xiaozhi-server calls; that destroys the cheap-path principle.
- Hermes-agent upstream lives at [github.com/nousresearch/hermes-agent](https://github.com/nousresearch/hermes-agent); it's Python; their HTTP gateway is the surface we wrap
- For the MCP server framework: stick with the same Python SDK choice from #107
- **Watch for**: the two-layer audit trail is the **proof** that runtime-neutrality works. If you ship without it, you can't demonstrate the value to a vendor.
- Use the `/agentkeys-issue-create` skill for follow-up issues (e.g., per-runtime tuning, M4 multi-instance scaling)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Phase 3: Hermes-MCP server (hermes.execute_task as MCP tool) #117

Context

Scope (M3)

Deploy NousResearch Hermes-agent

MCP server wrapping Hermes

Recursive composition

Out of scope (defer)

Acceptance criteria

Risks

References

Effort

Pickup notes for the next agent / developer

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Risk	Mitigation
Hermes' LLM costs explode under bad constraints	Server-side cost tracker is authoritative; vendor's per-actor cost limit is enforced before Hermes-side limit
Hermes-side state diverges from AgentKeys session state	Hermes uses AgentKeys memory worker as its persistent context; session_id maps to actor_omni-scoped memory namespace
Cold start latency (Hermes process + LLM warmup) is unacceptable for the demo	Pre-warm one Hermes instance per vendor; warm-pool sizing is M4 tuning

Phase 3: Hermes-MCP server (hermes.execute_task as MCP tool) #117

Description

Context

Scope (M3)

Deploy NousResearch Hermes-agent

MCP server wrapping Hermes

Recursive composition

Out of scope (defer)

Acceptance criteria

Risks

References

Effort

Pickup notes for the next agent / developer

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions