Skip to content

feat: 案例记忆(Memory Case)— SRE/AIOps Agent 跨会话故障排查经验记忆#194

Open
niiish32x wants to merge 14 commits into
derisk-ai:mainfrom
niiish32x:feat/nsh_memory
Open

feat: 案例记忆(Memory Case)— SRE/AIOps Agent 跨会话故障排查经验记忆#194
niiish32x wants to merge 14 commits into
derisk-ai:mainfrom
niiish32x:feat/nsh_memory

Conversation

@niiish32x
Copy link
Copy Markdown
Contributor

@niiish32x niiish32x commented May 7, 2026

feat: 案例记忆(Memory Case)— SRE/AIOps Agent 跨会话故障排查经验记忆

Summary

为 OpenDerisk Agent 提供跨会话的故障案例记忆能力。Agent 排查故障前自动检索历史相似案例获得启发,排查后将结论回写积累经验。技术上是 MySQL FULLTEXT + ChromaDB 向量语义检索的混合系统,通过 MCP 内置插件以 4 个 LLM 可调用工具暴露给 Agent,以 零侵入可插拔 方式集成到任意 Agent 实现。

一句话

「每次排障前先看一眼历史上类似的坑,排完把经验写回去,下次别人遇到就少走弯路。」

核心设计

1. 零侵入可插拔架构

Agent 层完全不 import derisk-ext,通过三层解耦实现:

  • ResourceResolver:启动时 derisk-ext 注册 tool(memory_case)MemoryCaseToolPack 映射
  • ConversationScopeHooks:core 定义 hook 协议,ext 实现,serve 装配。Agent 只调 bind_conversation_scope_for_agent(app_code, conv_id)
  • MCP Service:依赖注入构建 DAO + 向量索引 + 配置
graph TB
    subgraph derisk_core["derisk-core (Agent 层)"]
        Agent["BAIZE / ReAct / Core v2 Agent"]
        Hooks["ConversationScopeHooks<br/>(hook 协议定义)"]
        Resolver["ResourceResolver<br/>(注册入口)"]
    end

    subgraph derisk_ext["derisk-ext (插件实现)"]
        ToolPack["MemoryCaseToolPack<br/>4 个工具注册 + scope 注入"]
        Service["MemoryCasePluginService<br/>search / upsert / feedback / render"]
        DAO["MemoryCaseDao<br/>(SQLAlchemy)"]
        Vector["CandidateCaseVectorIndex<br/>(ChromaDB, lazy init)"]
    end

    subgraph derisk_serve["derisk-serve (装配层)"]
        MCP["McpService.init_app()<br/>依赖注入 + 注册"]
        Bind["bind_memory_case_scope_for_agent()"]
    end

    Agent -->|"bind_conversation_scope_for_agent(app_code, conv_id)"| Hooks
    Hooks -.->|"遍历已注册 hook"| Bind
    Bind -->|"ContextVar.set(scope)"| ToolPack
    Agent -->|"tool_call"| Resolver
    Resolver -->|"tool(memory_case)"| ToolPack
    ToolPack --> Service
    Service --> DAO
    Service --> Vector
    MCP --> Service

    style derisk_core fill:#e1f5fe,stroke:#0288d1
    style derisk_ext fill:#fff3e0,stroke:#f57c00
    style derisk_serve fill:#e8f5e9,stroke:#388e3c
Loading

关键:Agent 代码中不出现 from derisk_ext.plugin.memory_case import ...,插件替换/卸载对 Agent 完全透明。

2. 混合检索:DB 全文 + 向量语义

graph TD
    Query["memory_case_search('Pod OOM JVM 堆内存')"]
    
    Query --> DB["Step 1: MySQL FULLTEXT<br/>5 列索引"]
    Query --> Vec["Step 2: ChromaDB 向量检索"]
    
    DB --> DBok{"成功?"}
    DBok -->|"是"| DBResult["MATCH ... AGAINST<br/>NATURAL LANGUAGE MODE"]
    DBok -->|"否 (err 1191)"| Fallback["LIKE '%query%' 回退"]
    
    Vec --> VecOk{"ChromaDB 可用?"}
    VecOk -->|"是"| VecResult["embedding(query) →<br/>语义相似度搜索"]
    VecOk -->|"否"| Degraded["degraded: true<br/>不阻塞主路径"]
    
    DBResult --> Merge["Step 3: 合并去重"]
    Fallback --> Merge
    VecResult --> Merge
    Degraded --> Merge
    
    Merge --> Rank["Wilson 分数排序<br/>+ scope 感知<br/>+ 时间衰减"]
    Rank --> Filter["过滤: REJECTED<br/>+ confidence < 0.5"]
    Filter --> Summary["返回轻量摘要<br/>~500 字符/条"]
    
    Summary --> Backfill["懒回填: DB 命中但<br/>向量缺失的案例<br/>自动补入 ChromaDB"]

    style DB fill:#e3f2fd,stroke:#1565c0
    style Vec fill:#fce4ec,stroke:#c62828
    style Merge fill:#fff3e0,stroke:#ef6c00
    style Summary fill:#e8f5e9,stroke:#2e7d32
Loading

向量在系统中的定位:DB 全文检索是主路径,向量是语义补充。向量不可用时服务降级不中断。这与通用 RAG 系统中以向量为主的模式不同——SRE 场景下精确的关键词匹配(OOM、GC、线程池)往往比语义相似度更可靠。

3. 两步交互:Search 摘要发现 → Render 按需深入

memory_case_search 只返回 ~500 字符的摘要(不含完整 markdown_summary 和 diagnosis),Agent 快速扫读后选中相关案例,再用 memory_case_render 按 case_id 加载完整 Markdown。

设计理由:5 条案例全量返回约 12,500 字符,Agent 容易被大量文本淹没。摘要化后只需 ~2,500 字符即可完成初筛,再用 render 按需深入 1-2 条(~6,500 字符)。render 时自动附带 similar_cases 关联案例,形成「发现 → 深入 → 展开关联」的三步信息获取。

Search 摘要字段_to_summary()):

字段 说明
symptom_summary 完整,Agent 靠它判断是否同一类故障
diagnosis_preview 前 300 字,足够判断排查思路是否相关
diagnosis_len 数字,让 Agent 知道全量多大
root_cause 完整,一句话信号密度最高
resolution 完整,Agent 判断方案是否可复用
confidence + lifecycle 判断可信度
feedback_h / feedback_u / feedback_cv_count 反馈统计,辅助判断经验可靠性
similar_count 数字,判断案例贫瘠还是丰富
markdown_summary ❌ 砍掉(最大头),用 render 按需加载
全量 diagnosis ❌ 砍掉,只给 300 字 preview
image image

4. 三层案例关系自动关联

写入案例时自动发现语义相似案例,层层过滤防止文本相似但排查路径不同的案例被误关联:

graph TD
    NewCase["新写入案例<br/>upsert() 后触发"]
    
    NewCase --> L1["Layer 1: 分段加权向量粗筛 ✅"]
    
    L1 --> L1S["symptom 段 (权重 0.1)<br/>ChromaDB 独立搜索"]
    L1 --> L1D["diagnosis 段 (权重 0.5)<br/>ChromaDB 独立搜索"]
    L1 --> L1R["root_cause 段 (权重 0.4)<br/>ChromaDB 独立搜索"]
    
    L1S --> Merge["加权合并分数<br/>symptom×0.1 + diagnosis×0.5 + root_cause×0.4"]
    L1D --> Merge
    L1R --> Merge
    
    Merge --> Filter1{"weighted >= 0.6?"}
    Filter1 -->|"否"| Drop["丢弃"]
    Filter1 -->|"是"| L2
    
    L2["Layer 2: 结构化交叉验证 ✅"]
    L2 --> Check{"failure_layer 一致?<br/>runtime 一致?<br/>related_services 有交集?"}
    
    Check -->|"全部一致"| StructOK["struct_match: true"]
    Check -->|"字段缺失"| StructPass["struct_match: false<br/>(缺字段不惩罚)"]
    Check -->|"明确冲突"| StructFail["struct_match: false<br/>阻断误关联"]
    
    StructOK --> L3
    StructPass --> L3
    StructFail --> MarkSurface["标记: surface_similar<br/>仅文本像,排查路径不同"]
    
    L3["Layer 3: 关系类型分类"]
    L3 --> RootHigh{"root_cause >= 0.8<br/>AND struct_match?"}
    RootHigh -->|"是"| SameRoot["same_root_cause"]
    RootHigh -->|"否"| DiagHigh{"diagnosis >= 0.7<br/>AND struct_match?"}
    DiagHigh -->|"是"| SimDiag["similar_diagnosis"]
    DiagHigh -->|"否"| SurfSim["surface_similar"]

    style L1 fill:#e3f2fd,stroke:#1565c0
    style L2 fill:#fff3e0,stroke:#ef6c00
    style L3 fill:#f3e5f5,stroke:#7b1fa2
    style SameRoot fill:#c8e6c9,stroke:#2e7d32
    style SimDiag fill:#c8e6c9,stroke:#2e7d32
    style MarkSurface fill:#ffcdd2,stroke:#c62828
    style SurfSim fill:#ffcdd2,stroke:#c62828
Loading

具体案例:为什么需要 Layer 2 结构性验证?

case-A: order-svc Pod OOMKilled
  → JVM 堆配置 512M 过小 → 调大 -Xmx → 解决
  failure_layer: jvm, runtime: java, related_services: [order-svc]

case-B: trade-svc Pod OOMKilled
  → 下游 risk-engine 内存泄漏,请求堆积 → 升级 risk-engine → 解决
  failure_layer: application, runtime: go, related_services: [risk-engine]

Layer 1 向量得分: 0.89(反复出现 "Pod / OOMKilled / JVM / 内存 / GC")
Layer 2 交叉验证: failure_layer 不同(jvm≠application) + runtime 不同(java≠go)
                 + related_services 无交集 → struct_match: false
最终判定: surface_similar — 文本像,排查路径完全不同

5. Structured Feedback + Wilson 分数排序

Feedback 以结构化方式记录在三层 scope 中:

"fb": {
  "global":           {"h": 8, "u": 2, "ts": "...", "cv": ["conv-a", "conv-c"]},
  "by_app":           {"order-svc": {"h": 6, "u": 0, "ts": "..."}},
  "by_app_env":       {"order-svc:production": {"h": 5, "u": 0, "ts": "..."}}
}

排序公式

rank_score = base × exp(-λ × days)

base = weight × wilson_score(h, total) + (1-weight) × confidence
       ↑                                       ↑
   经验分数(样本越多越信任)                LLM 先验(feedback 不足时兜底)

weight = min(1.0, total / 10)

Wilson 分数效果:样本量越小,下界越保守:

helpful/total 简单比例 Wilson 下界 含义
2/2 1.00 0.34 样本太小,不可信
8/10 0.80 0.49 有争议,压低
15/15 1.00 0.82 大量验证,高置信

Scope 感知的三层降级:同一案例在不同场景下自动差异化排名。

graph LR
    Search["Agent 搜索<br/>app_code=order-svc<br/>env=production"]
    
    Search --> L1["查 fb.by_app_env<br/>['order-svc:production']"]
    L1 --> L1Check{"total >= 3?"}
    L1Check -->|"是"| Use1["用这层的 Wilson 分数"]
    L1Check -->|"不足"| L2
    
    L2["降级: fb.by_app<br/>['order-svc']"]
    L2 --> L2Check{"total >= 3?"}
    L2Check -->|"是"| Use2["用这层的 Wilson 分数"]
    L2Check -->|"不足"| L3
    
    L3["降级: fb.global"]
    L3 --> L3Check{"total > 0?"}
    L3Check -->|"是"| Use3["用这层的 Wilson 分数"]
    L3Check -->|"无"| Fallback["兜底: LLM confidence 先验"]

    style Use1 fill:#c8e6c9,stroke:#2e7d32
    style Use2 fill:#fff9c4,stroke:#f9a825
    style Use3 fill:#fff3e0,stroke:#ef6c00
    style Fallback fill:#ffcdd2,stroke:#c62828
Loading

DRAFT → ACCEPTED 门槛

需同时满足:
  confidence >= 0.8
  AND fb.global.h >= 2          (至少 2 次 helpful)
  AND fb.global.cv 包含非 source_conv_id 的会话  (跨会话独立验证)

防止同一轮排查中 Agent 自产自评。同会话 feedback 只调 confidence 不改 lifecycle。

image image

6. 零侵入 Scope 注入

Agent 完全不感知 scope 逻辑。三级优先级:

1. LLM 显式传入的 scope(最高优先)
2. ContextVar 中的 scope(Agent 构建时 bind 的 app_code / conv_id 自动注入)
3. 兜底 "default"(通配,不过滤)

inject_memory_scope() 是 scope 注入的唯一入口,被 MCP 服务路径和 ToolPack 路径共享,逻辑不重复。

写入时自动将 scope 中的 app_code / environment 注入 case_contextconv_id 注入 source_conv_id,LLM 无需手动填写。

文件结构

packages/derisk-ext/src/derisk_ext/plugin/memory_case/
├── __init__.py             # 公共 API 导出
├── models.py               # CandidateCase, lifecycle, FeedbackStats, wilson_score
├── service.py              # 核心服务:search / upsert / feedback / render
├── sqlalchemy_dao.py       # MySQL/SQLite 持久化 (FULLTEXT + LIKE fallback)
├── vector_index.py         # ChromaDB 向量索引 (lazy init + 优雅降级)
├── case_context.py         # Scope 过滤 + 结构化交叉验证
├── markdown.py             # Markdown 渲染
├── tool_pack.py            # Agent ToolPack 适配 + scope 注入
├── integration.py          # ResourceResolver 注册
├── scope_binding.py        # ContextVar scope 绑定
├── plugin_resolver.py      # 插件解析器
├── resource_resolve.py     # 资源解析
├── dao_protocol.py         # DAO 抽象协议
├── tests/
│   ├── test_memory_case_service.py  # 核心服务测试 (38 个)
│   ├── test_dao_upsert.py           # DAO 测试
│   └── test_vector_search.py        # 向量检索测试
└── skill/memory_case_agent/
    └── SKILL.md             # Agent 使用技能 (v2.0)

# 跨包集成点
packages/derisk-core/       # ConversationScopeHooks, ResourceResolver
packages/derisk-serve/       # MCP Service, Agent scope binding
packages/derisk-app/         # Feature plugin 装配

nishenghao.nsh and others added 6 commits May 5, 2026 16:42
## 问题严重性

Agent 在 Function Calling 模式下会陷入无限循环,LLM 不断重复调用同一个工具
却无法自行跳出。这导致用户请求长时间无响应,只能手动 Ctrl+C 终止进程,
大量无效 LLM 调用消耗 token 和时间。已到了不得不修复的程度。

## 根因

1. **DoomLoopDetector 阻断无效**:react_master_agent 中 doom loop 检测到
   重复调用后返回 ActionOutput,但 terminate=False、have_retry=True,
   导致 generate_reply 循环继续,LLM 重新发起相同调用,形成"元死循环"
2. **core/agent/base.py 无任何死循环检测**:ProductionAgent 路径完全没有
   doom loop 防护,只能靠 max_retry_count(默认3)兜底,但 ReActMasterAgent
   设为 300,几乎等于无限制
3. **core_v2 DoomLoopDetector 返回的 ActionResult 无终止信号**:
   ReActReasoningAgent.act() 返回 ActionResult(success=False) 但
   enhanced_agent.py 主循环不检查 doom loop metadata,继续执行
4. **按工具名计数导致误判**:连续调用 read 读取 5 个不同文件被误判为死循环,
   应按 tool_name+args hash 计数,只有相同工具+相同参数才算

## 修复内容

1. **ReActMasterAgent**:doom loop 阻断时设 terminate=True + have_retry=False,
   确保 generate_reply 循环立即退出;新增 _last_tool_name/_last_tool_args
   追踪最后执行的工具
2. **core_v2 enhanced_agent.py**:主循环新增基于 hash 的连续相同调用检测,
   同一工具+同一参数连续调用超过阈值(5次)时强制终止;检查 ActionResult
   的 doom_loop metadata 终止信号
3. **core_v2 react_reasoning_agent.py**:doom loop ActionResult 增加
   metadata={"doom_loop": True, "terminate": True}
4. **core/agent/base.py (ProductionAgent)**:新增基于 hash 的连续相同调用检测

## 检测算法

采用 tool_name + args hash 作为检测 key(与现有 DoomLoopDetector 一致),
只有连续调用相同工具+相同参数才触发。避免误判合法的连续不同参数调用
(如连续 read 5 个不同文件)。

## 对原有流程的影响

- 正常工具调用流程不受影响:不同参数的调用各自计数为 1,不会触发
- 已有 DoomLoopDetector 逻辑不变,本次修复只是补全终止信号
- 阈值设为 5(比 DoomLoopDetector 的 3 更宽松),给 agent 合理空间
- max_retry_count 等原有退出机制不受影响

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
## 问题严重性

Agent 在 Function Calling 模式下完成工具调用后,最终输出为空字符串。
VIS 层收到空消息导致前端"影响面分析"、"根因分析"全部为空,用户看到的
是 Agent 跑了 72 秒却没有任何分析结论。这是生产环境必须修复的问题。

## 根因

Function Calling 循环中,每轮 reply_message.content = llm_out.content,
但 LLM 返回 tool_calls 时 content 为空。当循环因 max_retry_count 达到上限、
doom loop 终止、或其他原因退出时,reply_message.content 仍然是空字符串。
VIS 层通过 output_message_id 查找最终消息,发现 content 为空,
导致 output message is "" → 前端显示空的分析结果。

## 修复内容

1. **强制 LLM 生成总结**:当 FC 循环退出后 reply_message.content 为空时,
   额外调用一次 LLM(不带工具定义),强制生成最终分析总结。
   提示词明确要求"Based on the tool execution results above, please
   provide a comprehensive analysis and summary. Do NOT call any
   more tools - provide your final answer directly."
2. **Action report fallback**:如果强制 LLM 调用失败,从 action_report
   中拼接内容作为兜底
3. **SKILL.md 读取后阻止 BlankAction 终止**:追踪 _last_tool_name 和
   _last_tool_args,当 BlankAction terminate=True 但上一轮是读取 SKILL.md
   时,阻止终止并继续循环(这是 skill 不往下执行问题的代码层防护)

## 对原有流程的影响

- 正常流程(LLM 最终返回纯文本)不受影响:reply_message.content 有值,
  不会触发额外 LLM 调用
- 只有 content 为空时才触发强制总结,且使用 try-except 保护
- _last_tool_name 追踪只记录非 blank 的工具调用,不影响 BlankAction 逻辑
- SKILL.md 读取保护只在 terminate=True + 上一轮是 view/read SKILL.md
  时生效,其他场景完全不影响

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
## 问题严重性

Agent 读取技能文件后直接输出技能内容的总结就停止了,不按照技能中的方法论
继续调用工具执行任务。用户看到的是"读取了技能"而非"执行了分析",
Agent 把"读技能"当成了"做完了"。这是技能体系可用性的根本性问题,
不修复则技能加载机制形同虚设。

## 根因

1. **Prompt 缺少约束**:skills.md.j2 模板只说"用 view 读取 SKILL.md,
   内化指令并立即应用",但没有明确"读取是前置步骤,不是任务完成"
2. **Sandbox prompt 缺少约束**:AGENT_SKILL_SYSTEM_PROMPT 同样没有
   强调读取后必须继续执行
3. **BlankAction 无条件终止**:LLM 读取 SKILL.md 后返回纯文本总结
   (无 tool_calls),触发 BlankAction(terminate=True),
   generate_reply 循环直接退出。(代码层防护已在上一条 commit 中加入)

## 修复内容

1. **shared/skills.md.j2**:增加"关键约束"段落,明确:
   - 读取 SKILL.md 是前置准备步骤,不是任务完成
   - 读取技能后,必须按照技能中的方法论和工具链继续执行工具调用
   - 禁止在读取技能后直接输出总结或结论
2. **react_master_agent/skills.md.j2**:同步增加相同约束
3. **sandbox/prompt.py AGENT_SKILL_SYSTEM_PROMPT**:增加
   "读取不是完成"约束

## 对原有流程的影响

- Prompt 变更只增加约束性指令,不修改任何工具调用逻辑
- 不影响不使用技能的 Agent 流程
- 不影响技能加载和解析机制
- 代码层防护(上一条 commit 中的 SKILL.md 读取后阻止 BlankAction
  终止)作为双重保险,即使 LLM 忽略 prompt 约束也能阻止提前终止

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
设计理念:
- 案例记忆(memory_case)作为内置 MCP 插件,通过标准 MCP 列表 API 可视化展示
- 用户在 UI 中按需选择/取消,实现可插拔接入,替代原先的自动注入模式
- 虚拟 MCP 条目(无 DB 行)注入到列表 API,前端区分 Built-in/External MCP
- 后端资源归一化层确保 mcp(derisk):memory_case → tool(memory_case) 正确路由

核心变更:

后端 - MCP 服务层:
- mcp/service/service.py: filter_list_page 注入虚拟 memory_case 条目,
  _is_builtin_memory_mcp() 同时接受 mcp_code 和显示名,
  connect_mcp/list_tools/call_tool 统一使用辅助方法判断内置 MCP
- mcp/api/schemas.py: ServeRequest/ServerResponse 添加 is_builtin 虚拟字段
- mcp/models/models.py: from_request 剥离 is_builtin,to_response 默认 False
- mcp/config.py: memory_plugin_enabled 控制插件基础设施(服务启动+虚拟条目可见)
- mcp/memory_case/: 完整的 MemoryCasePluginService 实现(4工具+向量索引+DB持久化)

后端 - Agent 资源路由:
- agent/agents/chat/agent_chat.py: 移除 ext_cfg.memory_plugin_enabled 自动注入,
  添加 mcp(derisk):memory_case → tool(memory_case) 归一化,
  chat_in_params 中 memory_case 重定向
- agent/core_v2_adapter.py: 同步移除自动注入,添加归一化
- agent/resource/tool/memory_case.py: MemoryCaseToolPack 实现
- derisk-core/agent/core_v2/agent_binding.py: tool(memory_case) 资源类型解析

前端 - UI 可视化选择:
- tab-skills.tsx: MCP 分组视图(内置/外部),memory_case 用 mcp_code 判断类型
- connectors-modal.tsx: Built-in MCP 紫色标识
- home-chat.tsx: McpChip 内置/外部样式区分
- chat/page.tsx: initMessage.mcps 路由 tool(memory_case) sub_type
- unified-chat-input.tsx: 保留 tool(memory_case) 参数跨消息传递

文档:
- docs/MEMORY_CASE_MCP_PLUGIN.md: 完整设计文档,接入方式改为 UI 可视化选择

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
## 问题

GET /api/v1/app/resources/list?type=app 在构建 App 资源动态参数时,会同步调用 AppManager.get_derisks() -> AppService.sync_app_list()。App 列表查询已迁移为异步 async_app_list/app_list,但同步入口 sync_app_list 缺失,触发 AttributeError: Service 无 sync_app_list。

## 不改的影响

- 每次拉取 type=app 的资源列表都会在后台打 ERROR,堆栈指向 app_agent_manage / building app Service。

- 应用类资源下拉/参数发现拿不到真实应用列表:controller 捕获异常后退回空列表,界面表现为可选 App 为空或缺失,影响应用绑定与编排配置。

- get_app 中对同步方法 sync_app_detail 错误使用 await,若在异步路径调用会触发 TypeError(await 非 awaitable)。

## 改动

- building/app/service/service.py: 新增 sync_app_list,过滤与分页语义与 async_app_list 对齐,使用同步 ORM 会话供仅能在同步上下文调用的资源发现代码使用。

- agent/agents/app_agent_manage.py: get_app 改为直接返回 sync_app_detail(app_code),去掉无效的 await。

Co-authored-by: Cursor <cursoragent@cursor.com>
…earch

Case memory (案例记忆) is consolidated under derisk_ext.plugin.memory_case:
- MemoryCasePluginService (MCP tools: search/upsert/feedback/render), DAO protocol,
  SQLAlchemy MemoryCaseDao on derisk_plugin_memory_case, optional Chroma vector index.
- Routing and provenance live only in metadata_json.case_context (no app_code/env
  table columns). Search narrows via JSON_EXTRACT; app_code/environment omitted or
  'default' act as wildcards; lexical query uses InnoDB FULLTEXT ft_memory_case_nl
  (8 columns incl. hypotheses/actions) with LIKE fallback on MySQL 1191.
- ResourceResolver registration + MemoryCaseToolPack; skill memory-case-agent.

Zero-intrusive wiring for core/ serve:
- derisk-core conversation_scope_hooks: bind_conversation_scope_for_agent; ext
  integration registers bind_memory_case_scope_for_agent once.
- agent_chat / core_v2_adapter call only core hook (no derisk_ext import).
- chat_in_params maps generic tool(...) to AgentResource for built-in tools.

Also: derisk-serve MCP builtin wiring, mcp_utils in core+serve for in-process tools,
vis converter guard for tests, docs (MEMORY_CASE_MCP_PLUGIN, CASE_MEMORY_AGENT_PLAYBOOK),
assets/schema/derisk.sql, optional derisk-serve[memory_case] + uv.lock.

Excludes: static web bundles (not committed).
Co-authored-by: Cursor <cursoragent@cursor.com>
@github-actions github-actions Bot added the enhancement New feature or request label May 7, 2026
nishenghao.nsh and others added 8 commits May 8, 2026 11:32
… auto backfill

- LazyCandidateCaseVectorIndex: defer ChromaDB creation until first tool call,
  so WorkerManagerFactory has time to register during startup
- Partial field update in MemoryCaseDao.upsert: only overwrite fields with
  meaningful values, merge metadata_json to prevent context loss
- Best-effort vector upsert: ChromaDB failure does not block MySQL write
- Lazy backfill during search: DB-only results get auto-reindexed
- Fix similar_search_with_scores() parameter names (query→text, +score_threshold)
- Tool description: clarify scope is for routing isolation, not business context
- 14 new tests covering DAO upsert, vector search, scope filtering, stress, lazy init

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…, add case relations and feedback tracking

- Add search_with_scores to CandidateCaseVectorIndex for scored vector search
- Auto-populate similar_cases in metadata on upsert via vector similarity (Layer 1)
- Track search hits and feedback calls per conversation, inject unreviewed_cases
  hint into upsert response for feedback closure (Route C)
- Extract inject_memory_scope as single entry point for scope injection, shared
  by both MemoryCaseToolPack._make_caller and McpService.call_tool
- Fix source_conv_id missing for BAIZE agents: MCP path now injects conv_id
  from ContextVar before calling plugin
- Fix V2 AgentContext conv_id lookup: try conversation_id as fallback

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
OpenDerisk is not a SaaS multi-tenant platform. tenant_id and team_id
were designed for multi-tenant isolation but have never been used in
practice—the ContextVar always injects "default", DB data shows zero
cases with meaningful tenant/team values, and LLM agents naturally use
tags/region/application_name in case_context instead.

Removing them simplifies scope to only app_code and environment (both
default=wildcard), eliminates dead SQL/vector filter code, and makes
room for future business-semantic scope dimensions (region, tags).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…earch

Replace single full-markdown vector similarity with a three-layer pipeline:

Layer 1 — Weighted section embedding: symptom(0.1) + diagnosis(0.5) +
root_cause(0.4) instead of one monolithic markdown_summary vector.
High root_cause weight prevents "textually similar, logically unrelated"
false associations.

Layer 2 — Structured cross-validation: new cross_validate_relation()
checks failure_layer, runtime, and related_services overlap before
accepting a candidate pair. Missing fields pass (no penalty).

Layer 3 — Relation type classification: _classify_relation() labels
each pair as same_root_cause / similar_diagnosis / surface_similar
instead of the single "similar" tag. struct_match flag exposed in result.

Also add search-result summarization (_to_summary) so memory_case_search
returns ~500-char lightweight summaries (symptom, diagnosis_preview 300
chars, root_cause, resolution, confidence, lifecycle, similar_count)
instead of injecting full markdown_summary + metadata into Agent context.
Agent calls memory_case_render with chosen case_ids for full content.

27/27 tests passing.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
… ranking, and lifecycle gating

Replace binary confidence bump with structured fb recording (global/by_app/by_app_env),
Wilson score interval for ranking, scope-aware feedback lookup with three-level fallback,
and time decay. DRAFT→ACCEPTED now requires confidence>=0.8 + feedback_count>=2 +
cross-session validation to prevent self-reinforcing loops. LLM-injected fb and
similar_cases are stripped during upsert.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…kdown renderer

Drop incident_title, hypotheses, actions, handling_path, effectiveness,
and source_session_id from MemoryCaseEntity and render_case_markdown().
These were replaced by the diagnosis free-form Markdown field.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Align with diagnosis free-form Markdown, two-step search→render flow,
cross-case matching fields (failure_layer/runtime/related_services),
feedback lifecycle gating, and LLM injection protection.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…eted app

Add null check after self.get() in _resource_to_app_detail to prevent
AttributeError when a referenced app no longer exists.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@niiish32x niiish32x changed the title feat(memory_case): ext plugin, zero-intrusive scope; fix agent FC/skill; restore sync app list feat: 案例记忆(Memory Case)— SRE/AIOps Agent 跨会话故障排查经验记忆 May 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant