fix(memory): filter structural noise from graph entity extraction#1920
Merged
fix(memory): filter structural noise from graph entity extraction#1920
Conversation
Prevent TOML config keys, file paths, tool names, and short generic tokens from polluting zeph_graph_entities (closes #1912). - Skip graph extraction for Role::User messages containing ToolResult parts — tool outputs are structural data, not conversational content - Exclude ToolResult user messages from the LLM extraction context window - Add min_entity_name_bytes = 3 to MemoryWriteValidationConfig and enforce it in validate_graph_extraction and EntityResolver::resolve() - Restrict extraction prompt entity types to person/project/technology/ organization/concept; add explicit rules against structural tokens, config keys, file paths, and raw command output
b9bda10 to
62115f5
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #1912. The
zeph_graph_entitiesQdrant collection was being polluted with structural tokens (TOML config keys, file paths, tool names likeread_file,wget, generic terms likego,type,src/) extracted from tool result messages rather than meaningful semantic entities.Root causes and fixes:
persist_message()now skips graph extraction entirely when the message containsToolResultparts — tool outputs (TOML, JSON, command output) are structural data, not conversational contentRole::Usermessages withToolResultpartsmin_entity_name_bytes = 3toMemoryWriteValidationConfig, enforced in bothvalidate_graph_extractionandEntityResolver::resolve()viaMIN_ENTITY_NAME_BYTESconstant — rejects tokens likego,cd,typeperson,project,technology,organization,concept; explicit rules against extracting config keys, file paths, tool names, TOML/JSON keys, and short tokensTests: 3 new unit tests added (2569 → 6049 total pass after merge with main), covering:
context_filter_excludes_tool_result_messagesresolve_short_name_below_min_returns_errorresolve_name_at_min_length_passesTest plan
cargo nextest run --config-file .github/nextest.toml -p zeph-memory -p zeph-core --libpassescargo clippy --workspace --features full -- -D warningscleancargo +nightly fmt --checkcleanzeph_graph_entities