Skip to content

fix(memory): filter structural noise from graph entity extraction#1920

Merged
bug-ops merged 1 commit intomainfrom
1912-graph-entity-extraction-noise
Mar 16, 2026
Merged

fix(memory): filter structural noise from graph entity extraction#1920
bug-ops merged 1 commit intomainfrom
1912-graph-entity-extraction-noise

Conversation

@bug-ops
Copy link
Owner

@bug-ops bug-ops commented Mar 16, 2026

Summary

Fixes #1912. The zeph_graph_entities Qdrant collection was being polluted with structural tokens (TOML config keys, file paths, tool names like read_file, wget, generic terms like go, type, src/) extracted from tool result messages rather than meaningful semantic entities.

Root causes and fixes:

  • FIX-1: persist_message() now skips graph extraction entirely when the message contains ToolResult parts — tool outputs (TOML, JSON, command output) are structural data, not conversational content
  • FIX-2: The context window passed to the extraction LLM call now excludes Role::User messages with ToolResult parts
  • FIX-3: Added min_entity_name_bytes = 3 to MemoryWriteValidationConfig, enforced in both validate_graph_extraction and EntityResolver::resolve() via MIN_ENTITY_NAME_BYTES constant — rejects tokens like go, cd, type
  • FIX-4: Revised extraction prompt — entity types restricted to person, project, technology, organization, concept; explicit rules against extracting config keys, file paths, tool names, TOML/JSON keys, and short tokens

Tests: 3 new unit tests added (2569 → 6049 total pass after merge with main), covering:

  • context_filter_excludes_tool_result_messages
  • resolve_short_name_below_min_returns_error
  • resolve_name_at_min_length_passes

Test plan

  • cargo nextest run --config-file .github/nextest.toml -p zeph-memory -p zeph-core --lib passes
  • cargo clippy --workspace --features full -- -D warnings clean
  • cargo +nightly fmt --check clean
  • Live session: send a message referencing a config file, verify no config keys appear in zeph_graph_entities

@github-actions github-actions bot added bug Something isn't working size/L documentation Improvements or additions to documentation memory Persistence and memory rust core and removed bug Something isn't working labels Mar 16, 2026
@bug-ops bug-ops enabled auto-merge (squash) March 16, 2026 16:25
@github-actions github-actions bot added the bug Something isn't working label Mar 16, 2026
Prevent TOML config keys, file paths, tool names, and short generic
tokens from polluting zeph_graph_entities (closes #1912).

- Skip graph extraction for Role::User messages containing ToolResult
  parts — tool outputs are structural data, not conversational content
- Exclude ToolResult user messages from the LLM extraction context window
- Add min_entity_name_bytes = 3 to MemoryWriteValidationConfig and
  enforce it in validate_graph_extraction and EntityResolver::resolve()
- Restrict extraction prompt entity types to person/project/technology/
  organization/concept; add explicit rules against structural tokens,
  config keys, file paths, and raw command output
@bug-ops bug-ops force-pushed the 1912-graph-entity-extraction-noise branch from b9bda10 to 62115f5 Compare March 16, 2026 16:46
@bug-ops bug-ops merged commit 5160aaa into main Mar 16, 2026
20 checks passed
@bug-ops bug-ops deleted the 1912-graph-entity-extraction-noise branch March 16, 2026 16:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working core documentation Improvements or additions to documentation memory Persistence and memory rust size/L

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fix(memory): graph entity extraction populates zeph_graph_entities with structural noise instead of semantic facts

1 participant