samples: add agentchat_behavioral_monitor — Ghost Consistency Score for long-running agent conversations#7484
Conversation
…ong conversations
|
@microsoft-github-policy-service agree |
|
The Ghost Consistency Score is a smart approach to detecting behavioral drift within a single conversation. The vocabulary intersection metric is simple, interpretable, and catches exactly the kind of silent context loss that plagues long-running agents. Two thoughts on extending this beyond single conversations: 1. CCS as a cross-session trust signal Behavioral drift within one conversation is detectable by the agent itself (or its orchestrator). The harder problem is drift across sessions and across organizations. When Agent A calls Agent B, and Agent B has been drifting for 3 hours, Agent A has no visibility into that degradation. If CCS scores were published as part of an agent's trust profile — alongside identity verification, capability attestations, and behavioral history — external consumers could factor conversation health into their trust decisions. An agent with CCS < 0.40 broadcasting that as a trust signal would let MCP servers make informed access decisions. 2. Temporal trust decay maps to CCS decay The CCS pattern (measuring vocabulary persistence over time) mirrors how trust attestation systems handle temporal decay. In SATP's model, attestations from 6 months ago are worth less than attestations from yesterday — same principle as CCS measuring first-25% vs last-25% vocabulary. The connection: CCS is behavioral self-measurement. External trust scoring is behavioral third-party measurement. Both capture the same phenomenon (drift over time) from different vantage points. Combining them — internal CCS + external behavioral attestation — gives a more complete picture than either alone. Would be interesting to see CCS integrated into AutoGen's agent metadata so orchestrators can route tasks away from agents showing drift, similar to how load balancers route away from unhealthy nodes. |
d0a33da to
775406f
Compare
What this adds
A new sample
python/samples/agentchat_behavioral_monitor/with amain.pyandREADME.md.The problem it demonstrates: When an AutoGen conversation runs long enough to trigger history summarization or truncation, earlier task context can disappear silently. The agent keeps responding, but its answers may ignore facts it established early on — different tool choices, forgotten constraints, missing domain vocabulary.
What the sample does: It measures Ghost Consistency Score (CCS) — the fraction of vocabulary from the earliest conversation turns still present in the most recent turns. A score below 0.40 indicates likely behavioral drift.
Ghost terms are task-relevant words (jwt, bcrypt, foreign_key, redis, etc.) that appeared early but have disappeared — the most direct signal of forgotten context.
Running it
cd python/samples/agentchat_behavioral_monitor python main.pyNo extra dependencies beyond the stdlib.
Integrating it into AgentChat
The sample now stays on the public AgentChat surface:
AssistantAgent.run()orrun_stream()TaskResult.messagesBehavioralMonitor.observe_result()It does not monkey-patch private agent internals.
Connection to existing discussion
This sample directly addresses the ghost-lexicon + behavioral footprint pattern from the production reliability discussion in #7265.
Scope
python/samples/agentchat_behavioral_monitor/main.pyandREADME.md