Add heuristic ingestion and community detection upgrades#2
Conversation
There was a problem hiding this comment.
Pull Request Overview
This PR adds heuristic-based maintenance capabilities to the GraphRag indexing pipeline. It introduces intelligent text unit deduplication, token budget management, orphan entity linking, and relationship enhancement to improve graph quality and reduce redundancy during ingestion.
- Introduces a new
HeuristicMaintenanceWorkflowin the indexing pipeline - Adds semantic deduplication and token budget filtering for text units
- Implements orphan entity linking and relationship enhancement using co-occurrence heuristics
- Adds Fast Label Propagation as an alternative community detection algorithm
Reviewed Changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
| ServiceCollectionExtensions.cs | Registers the new HeuristicMaintenanceWorkflow in the DI container |
| HeuristicMaintenanceWorkflow.cs | New workflow that applies heuristics to text units between chunking and graph extraction |
| ExtractGraphWorkflow.cs | Integrates graph extraction heuristics to enhance relationships and link orphan entities |
| CreateBaseTextUnitsWorkflow.cs | Adjusts chunk overlap based on heuristic configuration |
| IndexingPipelineDefinitions.cs | Adds HeuristicMaintenanceWorkflow to the default pipeline sequence |
| TextUnitHeuristicProcessor.cs | Implements token budget filtering and semantic deduplication logic |
| GraphExtractionHeuristics.cs | Implements relationship enhancement and orphan entity linking |
| HeuristicMaintenanceConfig.cs | Configuration class defining heuristic parameters and defaults |
| GraphRagConfig.cs | Adds Heuristics property to main configuration |
| Enums.cs | Adds CommunityDetectionAlgorithm enum |
| ClusterGraphConfig.cs | Adds algorithm selection and max iterations configuration |
| FastLabelPropagationCommunityDetector.cs | Implements fast label propagation algorithm for community detection |
| CommunityBuilder.cs | Refactored to support multiple community detection algorithms |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| var chunkingConfig = config.Chunks; | ||
| var heuristicConfig = config.Heuristics ?? new HeuristicMaintenanceConfig(); | ||
| if (heuristicConfig.MinimumChunkOverlap > 0 && chunkingConfig.Overlap < heuristicConfig.MinimumChunkOverlap) | ||
| { | ||
| chunkingConfig.Overlap = heuristicConfig.MinimumChunkOverlap; | ||
| } |
There was a problem hiding this comment.
Mutating the shared chunkingConfig object can cause side effects across multiple workflow invocations since config.Chunks may be reused. Consider creating a copy of the config before modification or using a local variable to hold the effective overlap value that gets applied in CreateEffectiveConfig.
| namespace GraphRag.Config; | ||
|
|
||
| /// <summary> | ||
| /// Represents heuristic controls that fine-tune ingestion and graph maintenance behaviour. |
There was a problem hiding this comment.
Corrected spelling of 'behaviour' to 'behavior'.
| /// Represents heuristic controls that fine-tune ingestion and graph maintenance behaviour. | |
| /// Represents heuristic controls that fine-tune ingestion and graph maintenance behavior. |
| public void Update(TextUnitRecord incoming) | ||
| { | ||
| var mergedDocuments = MergeLists(Record.DocumentIds, incoming.DocumentIds); | ||
| var tokenCount = Math.Min(Record.TokenCount, incoming.TokenCount); |
There was a problem hiding this comment.
When merging duplicate text units, taking the minimum token count may not accurately represent the merged record. Consider documenting why the minimum is chosen rather than the average or keeping the original record's token count, as this choice affects downstream token budget calculations.
| var tokenCount = Math.Min(Record.TokenCount, incoming.TokenCount); | |
| // Use the sum of token counts to more accurately represent the merged record's token budget. | |
| var tokenCount = Record.TokenCount + incoming.TokenCount; |
| return incoming; | ||
| } | ||
|
|
||
| return incoming.Length < existing.Length ? incoming : existing; |
There was a problem hiding this comment.
The logic for selecting the shorter description when merging relationships lacks documentation explaining why shorter is preferred. This heuristic choice should be documented to explain the rationale (e.g., shorter descriptions are typically more concise or less likely to contain noise).
| foreach (var neighbor in orderedNeighbors) | ||
| { | ||
| if (visited.Add(neighbor)) | ||
| { | ||
| queue.Enqueue(neighbor); | ||
| } |
There was a problem hiding this comment.
This foreach loop implicitly filters its target sequence - consider filtering the sequence explicitly using '.Where(...)'.
| foreach (var neighbor in orderedNeighbors) | |
| { | |
| if (visited.Add(neighbor)) | |
| { | |
| queue.Enqueue(neighbor); | |
| } | |
| foreach (var neighbor in orderedNeighbors.Where(neighbor => visited.Add(neighbor))) | |
| { | |
| queue.Enqueue(neighbor); |
| foreach (var textUnit in seed.TextUnitIds) | ||
| { | ||
| if (!string.IsNullOrWhiteSpace(textUnit)) | ||
| { | ||
| _textUnits.Add(textUnit); | ||
| } | ||
| } |
There was a problem hiding this comment.
This foreach loop implicitly filters its target sequence - consider filtering the sequence explicitly using '.Where(...)'.
| foreach (var item in source) | ||
| { | ||
| if (seen.Add(item)) | ||
| { | ||
| result.Add(item); | ||
| } | ||
| } |
There was a problem hiding this comment.
This foreach loop implicitly filters its target sequence - consider filtering the sequence explicitly using '.Where(...)'.
| foreach (var item in first) | ||
| { | ||
| if (seen.Add(item)) | ||
| { | ||
| merged.Add(item); | ||
| } | ||
| } | ||
|
|
||
| foreach (var item in second) | ||
| { | ||
| if (seen.Add(item)) | ||
| { | ||
| merged.Add(item); | ||
| } | ||
| } | ||
|
|
There was a problem hiding this comment.
This foreach loop implicitly filters its target sequence - consider filtering the sequence explicitly using '.Where(...)'.
| foreach (var item in first) | |
| { | |
| if (seen.Add(item)) | |
| { | |
| merged.Add(item); | |
| } | |
| } | |
| foreach (var item in second) | |
| { | |
| if (seen.Add(item)) | |
| { | |
| merged.Add(item); | |
| } | |
| } | |
| foreach (var item in first.Concat(second).Where(item => seen.Add(item))) | |
| { | |
| merged.Add(item); | |
| } |
| foreach (var item in second) | ||
| { | ||
| if (seen.Add(item)) | ||
| { | ||
| merged.Add(item); | ||
| } | ||
| } |
There was a problem hiding this comment.
This foreach loop implicitly filters its target sequence - consider filtering the sequence explicitly using '.Where(...)'.
| catch (Exception ex) | ||
| { | ||
| logger?.LogWarning(ex, "Failed to execute semantic deduplication heuristics. Retaining filtered text units only."); | ||
| return filtered; | ||
| } |
There was a problem hiding this comment.
Generic catch clause.
Summary
Testing
https://chatgpt.com/codex/tasks/task_e_6903ed1e62bc8326827b5c86e8f6f5de