feat: 案例记忆（Memory Case）— SRE/AIOps Agent 跨会话故障排查经验记忆 by niiish32x · Pull Request #194 · derisk-ai/OpenDerisk

niiish32x · 2026-05-07T09:05:17Z

feat: 案例记忆（Memory Case）— SRE/AIOps Agent 跨会话故障排查经验记忆

Summary

为 OpenDerisk Agent 提供跨会话的故障案例记忆能力。Agent 排查故障前自动检索历史相似案例获得启发，排查后将结论回写积累经验。技术上是 MySQL FULLTEXT + ChromaDB 向量语义检索的混合系统，通过 MCP 内置插件以 4 个 LLM 可调用工具暴露给 Agent，以 零侵入可插拔 方式集成到任意 Agent 实现。

一句话

「每次排障前先看一眼历史上类似的坑，排完把经验写回去，下次别人遇到就少走弯路。」

核心设计

1. 零侵入可插拔架构

Agent 层完全不 import derisk-ext，通过三层解耦实现：

ResourceResolver：启动时 derisk-ext 注册 tool(memory_case) → MemoryCaseToolPack 映射
ConversationScopeHooks：core 定义 hook 协议，ext 实现，serve 装配。Agent 只调 bind_conversation_scope_for_agent(app_code, conv_id)
MCP Service：依赖注入构建 DAO + 向量索引 + 配置

graph TB
    subgraph derisk_core["derisk-core (Agent 层)"]
        Agent["BAIZE / ReAct / Core v2 Agent"]
        Hooks["ConversationScopeHooks<br/>(hook 协议定义)"]
        Resolver["ResourceResolver<br/>(注册入口)"]
    end

    subgraph derisk_ext["derisk-ext (插件实现)"]
        ToolPack["MemoryCaseToolPack<br/>4 个工具注册 + scope 注入"]
        Service["MemoryCasePluginService<br/>search / upsert / feedback / render"]
        DAO["MemoryCaseDao<br/>(SQLAlchemy)"]
        Vector["CandidateCaseVectorIndex<br/>(ChromaDB, lazy init)"]
    end

    subgraph derisk_serve["derisk-serve (装配层)"]
        MCP["McpService.init_app()<br/>依赖注入 + 注册"]
        Bind["bind_memory_case_scope_for_agent()"]
    end

    Agent -->|"bind_conversation_scope_for_agent(app_code, conv_id)"| Hooks
    Hooks -.->|"遍历已注册 hook"| Bind
    Bind -->|"ContextVar.set(scope)"| ToolPack
    Agent -->|"tool_call"| Resolver
    Resolver -->|"tool(memory_case)"| ToolPack
    ToolPack --> Service
    Service --> DAO
    Service --> Vector
    MCP --> Service

    style derisk_core fill:#e1f5fe,stroke:#0288d1
    style derisk_ext fill:#fff3e0,stroke:#f57c00
    style derisk_serve fill:#e8f5e9,stroke:#388e3c

关键：Agent 代码中不出现 from derisk_ext.plugin.memory_case import ...，插件替换/卸载对 Agent 完全透明。

2. 混合检索：DB 全文 + 向量语义

graph TD
    Query["memory_case_search('Pod OOM JVM 堆内存')"]
    
    Query --> DB["Step 1: MySQL FULLTEXT<br/>5 列索引"]
    Query --> Vec["Step 2: ChromaDB 向量检索"]
    
    DB --> DBok{"成功?"}
    DBok -->|"是"| DBResult["MATCH ... AGAINST<br/>NATURAL LANGUAGE MODE"]
    DBok -->|"否 (err 1191)"| Fallback["LIKE '%query%' 回退"]
    
    Vec --> VecOk{"ChromaDB 可用?"}
    VecOk -->|"是"| VecResult["embedding(query) →<br/>语义相似度搜索"]
    VecOk -->|"否"| Degraded["degraded: true<br/>不阻塞主路径"]
    
    DBResult --> Merge["Step 3: 合并去重"]
    Fallback --> Merge
    VecResult --> Merge
    Degraded --> Merge
    
    Merge --> Rank["Wilson 分数排序<br/>+ scope 感知<br/>+ 时间衰减"]
    Rank --> Filter["过滤: REJECTED<br/>+ confidence < 0.5"]
    Filter --> Summary["返回轻量摘要<br/>~500 字符/条"]
    
    Summary --> Backfill["懒回填: DB 命中但<br/>向量缺失的案例<br/>自动补入 ChromaDB"]

    style DB fill:#e3f2fd,stroke:#1565c0
    style Vec fill:#fce4ec,stroke:#c62828
    style Merge fill:#fff3e0,stroke:#ef6c00
    style Summary fill:#e8f5e9,stroke:#2e7d32

向量在系统中的定位：DB 全文检索是主路径，向量是语义补充。向量不可用时服务降级不中断。这与通用 RAG 系统中以向量为主的模式不同——SRE 场景下精确的关键词匹配（OOM、GC、线程池）往往比语义相似度更可靠。

3. 两步交互：Search 摘要发现 → Render 按需深入

memory_case_search 只返回 ~500 字符的摘要（不含完整 markdown_summary 和 diagnosis），Agent 快速扫读后选中相关案例，再用 memory_case_render 按 case_id 加载完整 Markdown。

设计理由：5 条案例全量返回约 12,500 字符，Agent 容易被大量文本淹没。摘要化后只需 ~2,500 字符即可完成初筛，再用 render 按需深入 1-2 条（~6,500 字符）。render 时自动附带 similar_cases 关联案例，形成「发现 → 深入 → 展开关联」的三步信息获取。

Search 摘要字段（_to_summary()）：

字段	说明
`symptom_summary`	完整，Agent 靠它判断是否同一类故障
`diagnosis_preview`	前 300 字，足够判断排查思路是否相关
`diagnosis_len`	数字，让 Agent 知道全量多大
`root_cause`	完整，一句话信号密度最高
`resolution`	完整，Agent 判断方案是否可复用
`confidence` + `lifecycle`	判断可信度
`feedback_h` / `feedback_u` / `feedback_cv_count`	反馈统计，辅助判断经验可靠性
`similar_count`	数字，判断案例贫瘠还是丰富
`markdown_summary`	❌ 砍掉（最大头），用 render 按需加载
全量 `diagnosis`	❌ 砍掉，只给 300 字 preview

4. 三层案例关系自动关联

写入案例时自动发现语义相似案例，层层过滤防止文本相似但排查路径不同的案例被误关联：

graph TD
    NewCase["新写入案例<br/>upsert() 后触发"]
    
    NewCase --> L1["Layer 1: 分段加权向量粗筛 ✅"]
    
    L1 --> L1S["symptom 段 (权重 0.1)<br/>ChromaDB 独立搜索"]
    L1 --> L1D["diagnosis 段 (权重 0.5)<br/>ChromaDB 独立搜索"]
    L1 --> L1R["root_cause 段 (权重 0.4)<br/>ChromaDB 独立搜索"]
    
    L1S --> Merge["加权合并分数<br/>symptom×0.1 + diagnosis×0.5 + root_cause×0.4"]
    L1D --> Merge
    L1R --> Merge
    
    Merge --> Filter1{"weighted >= 0.6?"}
    Filter1 -->|"否"| Drop["丢弃"]
    Filter1 -->|"是"| L2
    
    L2["Layer 2: 结构化交叉验证 ✅"]
    L2 --> Check{"failure_layer 一致?<br/>runtime 一致?<br/>related_services 有交集?"}
    
    Check -->|"全部一致"| StructOK["struct_match: true"]
    Check -->|"字段缺失"| StructPass["struct_match: false<br/>(缺字段不惩罚)"]
    Check -->|"明确冲突"| StructFail["struct_match: false<br/>阻断误关联"]
    
    StructOK --> L3
    StructPass --> L3
    StructFail --> MarkSurface["标记: surface_similar<br/>仅文本像，排查路径不同"]
    
    L3["Layer 3: 关系类型分类"]
    L3 --> RootHigh{"root_cause >= 0.8<br/>AND struct_match?"}
    RootHigh -->|"是"| SameRoot["same_root_cause"]
    RootHigh -->|"否"| DiagHigh{"diagnosis >= 0.7<br/>AND struct_match?"}
    DiagHigh -->|"是"| SimDiag["similar_diagnosis"]
    DiagHigh -->|"否"| SurfSim["surface_similar"]

    style L1 fill:#e3f2fd,stroke:#1565c0
    style L2 fill:#fff3e0,stroke:#ef6c00
    style L3 fill:#f3e5f5,stroke:#7b1fa2
    style SameRoot fill:#c8e6c9,stroke:#2e7d32
    style SimDiag fill:#c8e6c9,stroke:#2e7d32
    style MarkSurface fill:#ffcdd2,stroke:#c62828
    style SurfSim fill:#ffcdd2,stroke:#c62828

具体案例：为什么需要 Layer 2 结构性验证？

case-A: order-svc Pod OOMKilled
  → JVM 堆配置 512M 过小 → 调大 -Xmx → 解决
  failure_layer: jvm, runtime: java, related_services: [order-svc]

case-B: trade-svc Pod OOMKilled
  → 下游 risk-engine 内存泄漏，请求堆积 → 升级 risk-engine → 解决
  failure_layer: application, runtime: go, related_services: [risk-engine]

Layer 1 向量得分: 0.89（反复出现 "Pod / OOMKilled / JVM / 内存 / GC"）
Layer 2 交叉验证: failure_layer 不同(jvm≠application) + runtime 不同(java≠go)
                 + related_services 无交集 → struct_match: false
最终判定: surface_similar — 文本像，排查路径完全不同

5. Structured Feedback + Wilson 分数排序

Feedback 以结构化方式记录在三层 scope 中：

"fb": {
  "global":           {"h": 8, "u": 2, "ts": "...", "cv": ["conv-a", "conv-c"]},
  "by_app":           {"order-svc": {"h": 6, "u": 0, "ts": "..."}},
  "by_app_env":       {"order-svc:production": {"h": 5, "u": 0, "ts": "..."}}
}

排序公式：

rank_score = base × exp(-λ × days)

base = weight × wilson_score(h, total) + (1-weight) × confidence
       ↑                                       ↑
   经验分数（样本越多越信任）                LLM 先验（feedback 不足时兜底）

weight = min(1.0, total / 10)

Wilson 分数效果：样本量越小，下界越保守：

helpful/total	简单比例	Wilson 下界	含义
2/2	1.00	0.34	样本太小，不可信
8/10	0.80	0.49	有争议，压低
15/15	1.00	0.82	大量验证，高置信

Scope 感知的三层降级：同一案例在不同场景下自动差异化排名。

graph LR
    Search["Agent 搜索<br/>app_code=order-svc<br/>env=production"]
    
    Search --> L1["查 fb.by_app_env<br/>['order-svc:production']"]
    L1 --> L1Check{"total >= 3?"}
    L1Check -->|"是"| Use1["用这层的 Wilson 分数"]
    L1Check -->|"不足"| L2
    
    L2["降级: fb.by_app<br/>['order-svc']"]
    L2 --> L2Check{"total >= 3?"}
    L2Check -->|"是"| Use2["用这层的 Wilson 分数"]
    L2Check -->|"不足"| L3
    
    L3["降级: fb.global"]
    L3 --> L3Check{"total > 0?"}
    L3Check -->|"是"| Use3["用这层的 Wilson 分数"]
    L3Check -->|"无"| Fallback["兜底: LLM confidence 先验"]

    style Use1 fill:#c8e6c9,stroke:#2e7d32
    style Use2 fill:#fff9c4,stroke:#f9a825
    style Use3 fill:#fff3e0,stroke:#ef6c00
    style Fallback fill:#ffcdd2,stroke:#c62828

DRAFT → ACCEPTED 门槛：

需同时满足:
  confidence >= 0.8
  AND fb.global.h >= 2          (至少 2 次 helpful)
  AND fb.global.cv 包含非 source_conv_id 的会话  (跨会话独立验证)

防止同一轮排查中 Agent 自产自评。同会话 feedback 只调 confidence 不改 lifecycle。

6. 零侵入 Scope 注入

Agent 完全不感知 scope 逻辑。三级优先级：

1. LLM 显式传入的 scope（最高优先）
2. ContextVar 中的 scope（Agent 构建时 bind 的 app_code / conv_id 自动注入）
3. 兜底 "default"（通配，不过滤）

inject_memory_scope() 是 scope 注入的唯一入口，被 MCP 服务路径和 ToolPack 路径共享，逻辑不重复。

写入时自动将 scope 中的 app_code / environment 注入 case_context，conv_id 注入 source_conv_id，LLM 无需手动填写。

文件结构

packages/derisk-ext/src/derisk_ext/plugin/memory_case/
├── __init__.py             # 公共 API 导出
├── models.py               # CandidateCase, lifecycle, FeedbackStats, wilson_score
├── service.py              # 核心服务：search / upsert / feedback / render
├── sqlalchemy_dao.py       # MySQL/SQLite 持久化 (FULLTEXT + LIKE fallback)
├── vector_index.py         # ChromaDB 向量索引 (lazy init + 优雅降级)
├── case_context.py         # Scope 过滤 + 结构化交叉验证
├── markdown.py             # Markdown 渲染
├── tool_pack.py            # Agent ToolPack 适配 + scope 注入
├── integration.py          # ResourceResolver 注册
├── scope_binding.py        # ContextVar scope 绑定
├── plugin_resolver.py      # 插件解析器
├── resource_resolve.py     # 资源解析
├── dao_protocol.py         # DAO 抽象协议
├── tests/
│   ├── test_memory_case_service.py  # 核心服务测试 (38 个)
│   ├── test_dao_upsert.py           # DAO 测试
│   └── test_vector_search.py        # 向量检索测试
└── skill/memory_case_agent/
    └── SKILL.md             # Agent 使用技能 (v2.0)

# 跨包集成点
packages/derisk-core/       # ConversationScopeHooks, ResourceResolver
packages/derisk-serve/       # MCP Service, Agent scope binding
packages/derisk-app/         # Feature plugin 装配

## 问题严重性 Agent 在 Function Calling 模式下会陷入无限循环，LLM 不断重复调用同一个工具却无法自行跳出。这导致用户请求长时间无响应，只能手动 Ctrl+C 终止进程，大量无效 LLM 调用消耗 token 和时间。已到了不得不修复的程度。 ## 根因 1. **DoomLoopDetector 阻断无效**：react_master_agent 中 doom loop 检测到重复调用后返回 ActionOutput，但 terminate=False、have_retry=True，导致 generate_reply 循环继续，LLM 重新发起相同调用，形成"元死循环" 2. **core/agent/base.py 无任何死循环检测**：ProductionAgent 路径完全没有 doom loop 防护，只能靠 max_retry_count(默认3)兜底，但 ReActMasterAgent 设为 300，几乎等于无限制 3. **core_v2 DoomLoopDetector 返回的 ActionResult 无终止信号**： ReActReasoningAgent.act() 返回 ActionResult(success=False) 但 enhanced_agent.py 主循环不检查 doom loop metadata，继续执行 4. **按工具名计数导致误判**：连续调用 read 读取 5 个不同文件被误判为死循环，应按 tool_name+args hash 计数，只有相同工具+相同参数才算 ## 修复内容 1. **ReActMasterAgent**：doom loop 阻断时设 terminate=True + have_retry=False，确保 generate_reply 循环立即退出；新增 _last_tool_name/_last_tool_args 追踪最后执行的工具 2. **core_v2 enhanced_agent.py**：主循环新增基于 hash 的连续相同调用检测，同一工具+同一参数连续调用超过阈值(5次)时强制终止；检查 ActionResult 的 doom_loop metadata 终止信号 3. **core_v2 react_reasoning_agent.py**：doom loop ActionResult 增加 metadata={"doom_loop": True, "terminate": True} 4. **core/agent/base.py (ProductionAgent)**：新增基于 hash 的连续相同调用检测 ## 检测算法采用 tool_name + args hash 作为检测 key（与现有 DoomLoopDetector 一致），只有连续调用相同工具+相同参数才触发。避免误判合法的连续不同参数调用（如连续 read 5 个不同文件）。 ## 对原有流程的影响 - 正常工具调用流程不受影响：不同参数的调用各自计数为 1，不会触发 - 已有 DoomLoopDetector 逻辑不变，本次修复只是补全终止信号 - 阈值设为 5（比 DoomLoopDetector 的 3 更宽松），给 agent 合理空间 - max_retry_count 等原有退出机制不受影响 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

## 问题严重性 Agent 在 Function Calling 模式下完成工具调用后，最终输出为空字符串。 VIS 层收到空消息导致前端"影响面分析"、"根因分析"全部为空，用户看到的是 Agent 跑了 72 秒却没有任何分析结论。这是生产环境必须修复的问题。 ## 根因 Function Calling 循环中，每轮 reply_message.content = llm_out.content，但 LLM 返回 tool_calls 时 content 为空。当循环因 max_retry_count 达到上限、 doom loop 终止、或其他原因退出时，reply_message.content 仍然是空字符串。 VIS 层通过 output_message_id 查找最终消息，发现 content 为空，导致 output message is "" → 前端显示空的分析结果。 ## 修复内容 1. **强制 LLM 生成总结**：当 FC 循环退出后 reply_message.content 为空时，额外调用一次 LLM（不带工具定义），强制生成最终分析总结。提示词明确要求"Based on the tool execution results above, please provide a comprehensive analysis and summary. Do NOT call any more tools - provide your final answer directly." 2. **Action report fallback**：如果强制 LLM 调用失败，从 action_report 中拼接内容作为兜底 3. **SKILL.md 读取后阻止 BlankAction 终止**：追踪 _last_tool_name 和 _last_tool_args，当 BlankAction terminate=True 但上一轮是读取 SKILL.md 时，阻止终止并继续循环（这是 skill 不往下执行问题的代码层防护） ## 对原有流程的影响 - 正常流程（LLM 最终返回纯文本）不受影响：reply_message.content 有值，不会触发额外 LLM 调用 - 只有 content 为空时才触发强制总结，且使用 try-except 保护 - _last_tool_name 追踪只记录非 blank 的工具调用，不影响 BlankAction 逻辑 - SKILL.md 读取保护只在 terminate=True + 上一轮是 view/read SKILL.md 时生效，其他场景完全不影响 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

## 问题严重性 Agent 读取技能文件后直接输出技能内容的总结就停止了，不按照技能中的方法论继续调用工具执行任务。用户看到的是"读取了技能"而非"执行了分析"， Agent 把"读技能"当成了"做完了"。这是技能体系可用性的根本性问题，不修复则技能加载机制形同虚设。 ## 根因 1. **Prompt 缺少约束**：skills.md.j2 模板只说"用 view 读取 SKILL.md，内化指令并立即应用"，但没有明确"读取是前置步骤，不是任务完成" 2. **Sandbox prompt 缺少约束**：AGENT_SKILL_SYSTEM_PROMPT 同样没有强调读取后必须继续执行 3. **BlankAction 无条件终止**：LLM 读取 SKILL.md 后返回纯文本总结（无 tool_calls），触发 BlankAction(terminate=True)， generate_reply 循环直接退出。（代码层防护已在上一条 commit 中加入） ## 修复内容 1. **shared/skills.md.j2**：增加"关键约束"段落，明确： - 读取 SKILL.md 是前置准备步骤，不是任务完成 - 读取技能后，必须按照技能中的方法论和工具链继续执行工具调用 - 禁止在读取技能后直接输出总结或结论 2. **react_master_agent/skills.md.j2**：同步增加相同约束 3. **sandbox/prompt.py AGENT_SKILL_SYSTEM_PROMPT**：增加 "读取不是完成"约束 ## 对原有流程的影响 - Prompt 变更只增加约束性指令，不修改任何工具调用逻辑 - 不影响不使用技能的 Agent 流程 - 不影响技能加载和解析机制 - 代码层防护（上一条 commit 中的 SKILL.md 读取后阻止 BlankAction 终止）作为双重保险，即使 LLM 忽略 prompt 约束也能阻止提前终止 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

设计理念： - 案例记忆（memory_case）作为内置 MCP 插件，通过标准 MCP 列表 API 可视化展示 - 用户在 UI 中按需选择/取消，实现可插拔接入，替代原先的自动注入模式 - 虚拟 MCP 条目（无 DB 行）注入到列表 API，前端区分 Built-in/External MCP - 后端资源归一化层确保 mcp(derisk):memory_case → tool(memory_case) 正确路由核心变更：后端 - MCP 服务层： - mcp/service/service.py: filter_list_page 注入虚拟 memory_case 条目， _is_builtin_memory_mcp() 同时接受 mcp_code 和显示名， connect_mcp/list_tools/call_tool 统一使用辅助方法判断内置 MCP - mcp/api/schemas.py: ServeRequest/ServerResponse 添加 is_builtin 虚拟字段 - mcp/models/models.py: from_request 剥离 is_builtin，to_response 默认 False - mcp/config.py: memory_plugin_enabled 控制插件基础设施（服务启动+虚拟条目可见） - mcp/memory_case/: 完整的 MemoryCasePluginService 实现（4工具+向量索引+DB持久化）后端 - Agent 资源路由： - agent/agents/chat/agent_chat.py: 移除 ext_cfg.memory_plugin_enabled 自动注入，添加 mcp(derisk):memory_case → tool(memory_case) 归一化， chat_in_params 中 memory_case 重定向 - agent/core_v2_adapter.py: 同步移除自动注入，添加归一化 - agent/resource/tool/memory_case.py: MemoryCaseToolPack 实现 - derisk-core/agent/core_v2/agent_binding.py: tool(memory_case) 资源类型解析前端 - UI 可视化选择： - tab-skills.tsx: MCP 分组视图（内置/外部），memory_case 用 mcp_code 判断类型 - connectors-modal.tsx: Built-in MCP 紫色标识 - home-chat.tsx: McpChip 内置/外部样式区分 - chat/page.tsx: initMessage.mcps 路由 tool(memory_case) sub_type - unified-chat-input.tsx: 保留 tool(memory_case) 参数跨消息传递文档： - docs/MEMORY_CASE_MCP_PLUGIN.md: 完整设计文档，接入方式改为 UI 可视化选择 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

## 问题 GET /api/v1/app/resources/list?type=app 在构建 App 资源动态参数时，会同步调用 AppManager.get_derisks() -> AppService.sync_app_list()。App 列表查询已迁移为异步 async_app_list/app_list，但同步入口 sync_app_list 缺失，触发 AttributeError: Service 无 sync_app_list。 ## 不改的影响 - 每次拉取 type=app 的资源列表都会在后台打 ERROR，堆栈指向 app_agent_manage / building app Service。 - 应用类资源下拉/参数发现拿不到真实应用列表：controller 捕获异常后退回空列表，界面表现为可选 App 为空或缺失，影响应用绑定与编排配置。 - get_app 中对同步方法 sync_app_detail 错误使用 await，若在异步路径调用会触发 TypeError（await 非 awaitable）。 ## 改动 - building/app/service/service.py: 新增 sync_app_list，过滤与分页语义与 async_app_list 对齐，使用同步 ORM 会话供仅能在同步上下文调用的资源发现代码使用。 - agent/agents/app_agent_manage.py: get_app 改为直接返回 sync_app_detail(app_code)，去掉无效的 await。 Co-authored-by: Cursor <cursoragent@cursor.com>

…earch Case memory (案例记忆) is consolidated under derisk_ext.plugin.memory_case: - MemoryCasePluginService (MCP tools: search/upsert/feedback/render), DAO protocol, SQLAlchemy MemoryCaseDao on derisk_plugin_memory_case, optional Chroma vector index. - Routing and provenance live only in metadata_json.case_context (no app_code/env table columns). Search narrows via JSON_EXTRACT; app_code/environment omitted or 'default' act as wildcards; lexical query uses InnoDB FULLTEXT ft_memory_case_nl (8 columns incl. hypotheses/actions) with LIKE fallback on MySQL 1191. - ResourceResolver registration + MemoryCaseToolPack; skill memory-case-agent. Zero-intrusive wiring for core/ serve: - derisk-core conversation_scope_hooks: bind_conversation_scope_for_agent; ext integration registers bind_memory_case_scope_for_agent once. - agent_chat / core_v2_adapter call only core hook (no derisk_ext import). - chat_in_params maps generic tool(...) to AgentResource for built-in tools. Also: derisk-serve MCP builtin wiring, mcp_utils in core+serve for in-process tools, vis converter guard for tests, docs (MEMORY_CASE_MCP_PLUGIN, CASE_MEMORY_AGENT_PLAYBOOK), assets/schema/derisk.sql, optional derisk-serve[memory_case] + uv.lock. Excludes: static web bundles (not committed). Co-authored-by: Cursor <cursoragent@cursor.com>

… auto backfill - LazyCandidateCaseVectorIndex: defer ChromaDB creation until first tool call, so WorkerManagerFactory has time to register during startup - Partial field update in MemoryCaseDao.upsert: only overwrite fields with meaningful values, merge metadata_json to prevent context loss - Best-effort vector upsert: ChromaDB failure does not block MySQL write - Lazy backfill during search: DB-only results get auto-reindexed - Fix similar_search_with_scores() parameter names (query→text, +score_threshold) - Tool description: clarify scope is for routing isolation, not business context - 14 new tests covering DAO upsert, vector search, scope filtering, stress, lazy init Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…, add case relations and feedback tracking - Add search_with_scores to CandidateCaseVectorIndex for scored vector search - Auto-populate similar_cases in metadata on upsert via vector similarity (Layer 1) - Track search hits and feedback calls per conversation, inject unreviewed_cases hint into upsert response for feedback closure (Route C) - Extract inject_memory_scope as single entry point for scope injection, shared by both MemoryCaseToolPack._make_caller and McpService.call_tool - Fix source_conv_id missing for BAIZE agents: MCP path now injects conv_id from ContextVar before calling plugin - Fix V2 AgentContext conv_id lookup: try conversation_id as fallback Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

OpenDerisk is not a SaaS multi-tenant platform. tenant_id and team_id were designed for multi-tenant isolation but have never been used in practice—the ContextVar always injects "default", DB data shows zero cases with meaningful tenant/team values, and LLM agents naturally use tags/region/application_name in case_context instead. Removing them simplifies scope to only app_code and environment (both default=wildcard), eliminates dead SQL/vector filter code, and makes room for future business-semantic scope dimensions (region, tags). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…earch Replace single full-markdown vector similarity with a three-layer pipeline: Layer 1 — Weighted section embedding: symptom(0.1) + diagnosis(0.5) + root_cause(0.4) instead of one monolithic markdown_summary vector. High root_cause weight prevents "textually similar, logically unrelated" false associations. Layer 2 — Structured cross-validation: new cross_validate_relation() checks failure_layer, runtime, and related_services overlap before accepting a candidate pair. Missing fields pass (no penalty). Layer 3 — Relation type classification: _classify_relation() labels each pair as same_root_cause / similar_diagnosis / surface_similar instead of the single "similar" tag. struct_match flag exposed in result. Also add search-result summarization (_to_summary) so memory_case_search returns ~500-char lightweight summaries (symptom, diagnosis_preview 300 chars, root_cause, resolution, confidence, lifecycle, similar_count) instead of injecting full markdown_summary + metadata into Agent context. Agent calls memory_case_render with chosen case_ids for full content. 27/27 tests passing. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

… ranking, and lifecycle gating Replace binary confidence bump with structured fb recording (global/by_app/by_app_env), Wilson score interval for ranking, scope-aware feedback lookup with three-level fallback, and time decay. DRAFT→ACCEPTED now requires confidence>=0.8 + feedback_count>=2 + cross-session validation to prevent self-reinforcing loops. LLM-injected fb and similar_cases are stripped during upsert. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…kdown renderer Drop incident_title, hypotheses, actions, handling_path, effectiveness, and source_session_id from MemoryCaseEntity and render_case_markdown(). These were replaced by the diagnosis free-form Markdown field. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Align with diagnosis free-form Markdown, two-step search→render flow, cross-case matching fields (failure_layer/runtime/related_services), feedback lifecycle gating, and LLM injection protection. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…eted app Add null check after self.get() in _resource_to_app_detail to prevent AttributeError when a referenced app no longer exists. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

nishenghao.nsh and others added 6 commits May 5, 2026 16:42

github-actions Bot added the enhancement New feature or request label May 7, 2026

nishenghao.nsh and others added 8 commits May 8, 2026 11:32

fix(serve): handle null app_info when resource_agent references a del…

ac5e521

…eted app Add null check after self.get() in _resource_to_app_detail to prevent AttributeError when a referenced app no longer exists. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

niiish32x changed the title ~~feat(memory_case): ext plugin, zero-intrusive scope; fix agent FC/skill; restore sync app list~~ feat: 案例记忆（Memory Case）— SRE/AIOps Agent 跨会话故障排查经验记忆 May 9, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: 案例记忆（Memory Case）— SRE/AIOps Agent 跨会话故障排查经验记忆#194

feat: 案例记忆（Memory Case）— SRE/AIOps Agent 跨会话故障排查经验记忆#194
niiish32x wants to merge 14 commits into
derisk-ai:mainfrom
niiish32x:feat/nsh_memory

niiish32x commented May 7, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

niiish32x commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

feat: 案例记忆（Memory Case）— SRE/AIOps Agent 跨会话故障排查经验记忆

Summary

一句话

核心设计

1. 零侵入可插拔架构

2. 混合检索：DB 全文 + 向量语义

3. 两步交互：Search 摘要发现 → Render 按需深入

4. 三层案例关系自动关联

5. Structured Feedback + Wilson 分数排序

6. 零侵入 Scope 注入

文件结构

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

niiish32x commented May 7, 2026 •

edited

Loading