Add heuristic ingestion and community detection upgrades by KSemenenko · Pull Request #2 · managedcode/graphrag

KSemenenko · 2025-10-31T09:10:51Z

Summary

add heuristic maintenance configuration plus a new ingestion maintenance workflow that trims token budgets, deduplicates text units, and enhances relationships
enforce minimum chunk overlap and apply relationship/orphan heuristics during graph extraction
switch community detection to fast label propagation with configurable iterations and register the workflow in the indexing pipeline

Testing

/root/.dotnet/dotnet build GraphRag.slnx
/root/.dotnet/dotnet test GraphRag.slnx (fails: Docker endpoint unavailable in CI environment)
/root/.dotnet/dotnet format GraphRag.slnx

https://chatgpt.com/codex/tasks/task_e_6903ed1e62bc8326827b5c86e8f6f5de

Copilot

Pull Request Overview

This PR adds heuristic-based maintenance capabilities to the GraphRag indexing pipeline. It introduces intelligent text unit deduplication, token budget management, orphan entity linking, and relationship enhancement to improve graph quality and reduce redundancy during ingestion.

Introduces a new HeuristicMaintenanceWorkflow in the indexing pipeline
Adds semantic deduplication and token budget filtering for text units
Implements orphan entity linking and relationship enhancement using co-occurrence heuristics
Adds Fast Label Propagation as an alternative community detection algorithm

Reviewed Changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 10 comments.

Show a summary per file

File	Description
ServiceCollectionExtensions.cs	Registers the new HeuristicMaintenanceWorkflow in the DI container
HeuristicMaintenanceWorkflow.cs	New workflow that applies heuristics to text units between chunking and graph extraction
ExtractGraphWorkflow.cs	Integrates graph extraction heuristics to enhance relationships and link orphan entities
CreateBaseTextUnitsWorkflow.cs	Adjusts chunk overlap based on heuristic configuration
IndexingPipelineDefinitions.cs	Adds HeuristicMaintenanceWorkflow to the default pipeline sequence
TextUnitHeuristicProcessor.cs	Implements token budget filtering and semantic deduplication logic
GraphExtractionHeuristics.cs	Implements relationship enhancement and orphan entity linking
HeuristicMaintenanceConfig.cs	Configuration class defining heuristic parameters and defaults
GraphRagConfig.cs	Adds Heuristics property to main configuration
Enums.cs	Adds CommunityDetectionAlgorithm enum
ClusterGraphConfig.cs	Adds algorithm selection and max iterations configuration
FastLabelPropagationCommunityDetector.cs	Implements fast label propagation algorithm for community detection
CommunityBuilder.cs	Refactored to support multiple community detection algorithms

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-10-31T09:15:25Z

src/ManagedCode.GraphRag/Indexing/Workflows/CreateBaseTextUnitsWorkflow.cs

            var chunkingConfig = config.Chunks;
+            var heuristicConfig = config.Heuristics ?? new HeuristicMaintenanceConfig();
+            if (heuristicConfig.MinimumChunkOverlap > 0 && chunkingConfig.Overlap < heuristicConfig.MinimumChunkOverlap)
+            {
+                chunkingConfig.Overlap = heuristicConfig.MinimumChunkOverlap;
+            }


Mutating the shared chunkingConfig object can cause side effects across multiple workflow invocations since config.Chunks may be reused. Consider creating a copy of the config before modification or using a local variable to hold the effective overlap value that gets applied in CreateEffectiveConfig.

Copilot · 2025-10-31T09:15:25Z

src/ManagedCode.GraphRag/Config/HeuristicMaintenanceConfig.cs

+namespace GraphRag.Config;
+
+/// <summary>
+/// Represents heuristic controls that fine-tune ingestion and graph maintenance behaviour.


Corrected spelling of 'behaviour' to 'behavior'.

Suggested change

/// Represents heuristic controls that fine-tune ingestion and graph maintenance behaviour.

/// Represents heuristic controls that fine-tune ingestion and graph maintenance behavior.

Copilot · 2025-10-31T09:15:26Z

src/ManagedCode.GraphRag/Indexing/Heuristics/TextUnitHeuristicProcessor.cs

+        public void Update(TextUnitRecord incoming)
+        {
+            var mergedDocuments = MergeLists(Record.DocumentIds, incoming.DocumentIds);
+            var tokenCount = Math.Min(Record.TokenCount, incoming.TokenCount);


When merging duplicate text units, taking the minimum token count may not accurately represent the merged record. Consider documenting why the minimum is chosen rather than the average or keeping the original record's token count, as this choice affects downstream token budget calculations.

Suggested change

var tokenCount = Math.Min(Record.TokenCount, incoming.TokenCount);

// Use the sum of token counts to more accurately represent the merged record's token budget.

var tokenCount = Record.TokenCount + incoming.TokenCount;

Copilot · 2025-10-31T09:15:26Z

src/ManagedCode.GraphRag/Indexing/Heuristics/GraphExtractionHeuristics.cs

+                return incoming;
+            }
+
+            return incoming.Length < existing.Length ? incoming : existing;


The logic for selecting the shorter description when merging relationships lacks documentation explaining why shorter is preferred. This heuristic choice should be documented to explain the rationale (e.g., shorter descriptions are typically more concise or less likely to contain noise).

Copilot · 2025-10-31T09:15:26Z

src/ManagedCode.GraphRag/Community/CommunityBuilder.cs

+                foreach (var neighbor in orderedNeighbors)
+                {
+                    if (visited.Add(neighbor))
+                    {
+                        queue.Enqueue(neighbor);
+                    }


This foreach loop implicitly filters its target sequence - consider filtering the sequence explicitly using '.Where(...)'.

Suggested change

foreach (var neighbor in orderedNeighbors)

{

if (visited.Add(neighbor))

{

queue.Enqueue(neighbor);

}

foreach (var neighbor in orderedNeighbors.Where(neighbor => visited.Add(neighbor)))

{

queue.Enqueue(neighbor);

Copilot · 2025-10-31T09:15:27Z

src/ManagedCode.GraphRag/Indexing/Heuristics/GraphExtractionHeuristics.cs

+            foreach (var textUnit in seed.TextUnitIds)
+            {
+                if (!string.IsNullOrWhiteSpace(textUnit))
+                {
+                    _textUnits.Add(textUnit);
+                }
+            }


This foreach loop implicitly filters its target sequence - consider filtering the sequence explicitly using '.Where(...)'.

Copilot · 2025-10-31T09:15:27Z

src/ManagedCode.GraphRag/Indexing/Heuristics/TextUnitHeuristicProcessor.cs

+        foreach (var item in source)
+        {
+            if (seen.Add(item))
+            {
+                result.Add(item);
+            }
+        }


This foreach loop implicitly filters its target sequence - consider filtering the sequence explicitly using '.Where(...)'.

Copilot · 2025-10-31T09:15:27Z

src/ManagedCode.GraphRag/Indexing/Heuristics/TextUnitHeuristicProcessor.cs

+            foreach (var item in first)
+            {
+                if (seen.Add(item))
+                {
+                    merged.Add(item);
+                }
+            }
+
+            foreach (var item in second)
+            {
+                if (seen.Add(item))
+                {
+                    merged.Add(item);
+                }
+            }
+


This foreach loop implicitly filters its target sequence - consider filtering the sequence explicitly using '.Where(...)'.

Suggested change

foreach (var item in first)

{

if (seen.Add(item))

{

merged.Add(item);

}

}

foreach (var item in second)

{

if (seen.Add(item))

{

merged.Add(item);

}

}

foreach (var item in first.Concat(second).Where(item => seen.Add(item)))

{

merged.Add(item);

}

Copilot · 2025-10-31T09:15:28Z

src/ManagedCode.GraphRag/Indexing/Heuristics/TextUnitHeuristicProcessor.cs

+            foreach (var item in second)
+            {
+                if (seen.Add(item))
+                {
+                    merged.Add(item);
+                }
+            }


This foreach loop implicitly filters its target sequence - consider filtering the sequence explicitly using '.Where(...)'.

Copilot · 2025-10-31T09:15:28Z

src/ManagedCode.GraphRag/Indexing/Heuristics/TextUnitHeuristicProcessor.cs

+        catch (Exception ex)
+        {
+            logger?.LogWarning(ex, "Failed to execute semantic deduplication heuristics. Retaining filtered text units only.");
+            return filtered;
+        }


Generic catch clause.

Add heuristic ingestion and community detection upgrades

0109b41

Copilot AI review requested due to automatic review settings October 31, 2025 09:10

KSemenenko added the codex label Oct 31, 2025 — with ChatGPT Codex Connector

Copilot AI reviewed Oct 31, 2025

View reviewed changes

KSemenenko and others added 3 commits October 31, 2025 11:13

Add integration coverage for heuristic maintenance

67a96f0

Rewrite README with heuristic guidance

6c2a6ac

code reveiw fix

ef2f8e3

KSemenenko merged commit d481406 into main Oct 31, 2025
3 checks passed

KSemenenko deleted the codex/analyze-graphrag.net-features-and-gaps branch October 31, 2025 11:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add heuristic ingestion and community detection upgrades#2

Add heuristic ingestion and community detection upgrades#2
KSemenenko merged 4 commits intomainfrom
codex/analyze-graphrag.net-features-and-gaps

KSemenenko commented Oct 31, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Oct 31, 2025

Uh oh!

Copilot AI Oct 31, 2025

Uh oh!

Copilot AI Oct 31, 2025

Uh oh!

Copilot AI Oct 31, 2025

Uh oh!

Copilot AI Oct 31, 2025

Uh oh!

Copilot AI Oct 31, 2025

Uh oh!

Copilot AI Oct 31, 2025

Uh oh!

Copilot AI Oct 31, 2025

Uh oh!

Copilot AI Oct 31, 2025

Uh oh!

Copilot AI Oct 31, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	/// Represents heuristic controls that fine-tune ingestion and graph maintenance behaviour.
	/// Represents heuristic controls that fine-tune ingestion and graph maintenance behavior.

	var tokenCount = Math.Min(Record.TokenCount, incoming.TokenCount);
	// Use the sum of token counts to more accurately represent the merged record's token budget.
	var tokenCount = Record.TokenCount + incoming.TokenCount;

Conversation

KSemenenko commented Oct 31, 2025

Summary

Testing

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Oct 31, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 31, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 31, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 31, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 31, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 31, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 31, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 31, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 31, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 31, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants