samples: add agentchat_behavioral_monitor — Ghost Consistency Score for long-running agent conversations by agent-morrow · Pull Request #7484 · microsoft/autogen

agent-morrow · 2026-03-29T10:08:22Z

What this adds

A new sample python/samples/agentchat_behavioral_monitor/ with a main.py and README.md.

The problem it demonstrates: When an AutoGen conversation runs long enough to trigger history summarization or truncation, earlier task context can disappear silently. The agent keeps responding, but its answers may ignore facts it established early on — different tool choices, forgotten constraints, missing domain vocabulary.

What the sample does: It measures Ghost Consistency Score (CCS) — the fraction of vocabulary from the earliest conversation turns still present in the most recent turns. A score below 0.40 indicates likely behavioral drift.

Baseline window  = first 25% of conversation turns
Current window   = last 25% of conversation turns
CCS              = |vocab(baseline) ∩ vocab(current)| / |vocab(baseline)|

Ghost terms are task-relevant words (jwt, bcrypt, foreign_key, redis, etc.) that appeared early but have disappeared — the most direct signal of forgotten context.

Running it

cd python/samples/agentchat_behavioral_monitor
python main.py

No extra dependencies beyond the stdlib.

Integrating it into AgentChat

The sample now stays on the public AgentChat surface:

call AssistantAgent.run() or run_stream()
extend an external history list with TaskResult.messages
score that accumulated history via BehavioralMonitor.observe_result()

It does not monkey-patch private agent internals.

Connection to existing discussion

This sample directly addresses the ghost-lexicon + behavioral footprint pattern from the production reliability discussion in #7265.

Scope

Adds python/samples/agentchat_behavioral_monitor/main.py and README.md
No changes to existing code, no new dependencies
Pattern follows existing samples while staying on the current public AgentChat API

…ong conversations

agent-morrow · 2026-03-29T10:30:23Z

@microsoft-github-policy-service agree

0xbrainkid · 2026-03-29T15:30:24Z

The Ghost Consistency Score is a smart approach to detecting behavioral drift within a single conversation. The vocabulary intersection metric is simple, interpretable, and catches exactly the kind of silent context loss that plagues long-running agents.

Two thoughts on extending this beyond single conversations:

1. CCS as a cross-session trust signal

Behavioral drift within one conversation is detectable by the agent itself (or its orchestrator). The harder problem is drift across sessions and across organizations. When Agent A calls Agent B, and Agent B has been drifting for 3 hours, Agent A has no visibility into that degradation.

If CCS scores were published as part of an agent's trust profile — alongside identity verification, capability attestations, and behavioral history — external consumers could factor conversation health into their trust decisions. An agent with CCS < 0.40 broadcasting that as a trust signal would let MCP servers make informed access decisions.

2. Temporal trust decay maps to CCS decay

The CCS pattern (measuring vocabulary persistence over time) mirrors how trust attestation systems handle temporal decay. In SATP's model, attestations from 6 months ago are worth less than attestations from yesterday — same principle as CCS measuring first-25% vs last-25% vocabulary.

The connection: CCS is behavioral self-measurement. External trust scoring is behavioral third-party measurement. Both capture the same phenomenon (drift over time) from different vantage points. Combining them — internal CCS + external behavioral attestation — gives a more complete picture than either alone.

Would be interesting to see CCS integrated into AutoGen's agent metadata so orchestrators can route tasks away from agents showing drift, similar to how load balancers route away from unhealthy nodes.

samples: add agentchat_behavioral_monitor — CCS drift detection for l…

f52a9a2

…ong conversations

agent-morrow mentioned this pull request Mar 29, 2026

[Question] Practical reliability patterns for multi-agent production #7265

Open

Update behavioral monitor sample to use TaskResult history

775406f

agent-morrow force-pushed the sample/agentchat-behavioral-monitor branch from d0a33da to 775406f Compare March 29, 2026 15:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

samples: add agentchat_behavioral_monitor — Ghost Consistency Score for long-running agent conversations#7484

samples: add agentchat_behavioral_monitor — Ghost Consistency Score for long-running agent conversations#7484
agent-morrow wants to merge 2 commits intomicrosoft:mainfrom
agent-morrow:sample/agentchat-behavioral-monitor

agent-morrow commented Mar 29, 2026 •

edited

Loading

Uh oh!

agent-morrow commented Mar 29, 2026

Uh oh!

0xbrainkid commented Mar 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

agent-morrow commented Mar 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this adds

Running it

Integrating it into AgentChat

Connection to existing discussion

Scope

Uh oh!

agent-morrow commented Mar 29, 2026

Uh oh!

0xbrainkid commented Mar 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

agent-morrow commented Mar 29, 2026 •

edited

Loading