Fix set_batch_keys() return value handling and metadata isolation #38

Copilot · 2025-10-29T12:05:45Z

Addresses critical bugs from PR #36 code review where set_batch_keys() return values were ignored and metadata dicts were shared between CodeChunk instances.

Core Issues

Return value ignored: CodeChunk.set_batch_keys() returns a new immutable instance, but 5 call sites discarded the return value, causing batch keys to never be set:

# Before - batch keys never set
chunk.set_batch_keys(batch_id)
deduplicated.append(chunk)

# After - properly captures immutable return
batch_keys = BatchKeys(id=batch_id, idx=idx)
deduplicated.append(chunk.set_batch_keys(batch_keys))

Type mismatch: Method expects BatchKeys (id, idx, sparse) but was passed raw UUID7.

Shared metadata: Shallow copy caused metadata dict references to be shared between instances, enabling race conditions:

# Before - metadata dict shared
return self.model_copy(update={...}, deep=False)

# After - explicit dict copy
metadata_copy = {k: v for k, v in self.metadata.items()} if self.metadata else None
return self.model_copy(update={..., "metadata": metadata_copy}, deep=False)

Changes

semantic.py: Fixed return value capture at 5 call sites, added BatchKeys creation with enumerated indices
chunks.py: Added explicit metadata dict copy in set_batch_keys() to prevent shared references
semantic.py: Removed redundant iteration in _finalize_chunks() (O(2n) → O(n))
semantic.py: Tightened type annotation batch_id: Any → batch_id: UUID7
Added test suite verifying immutability, metadata isolation, and batch key behavior

Impact

Batch keys are now actually set on chunks (previously never set). Metadata mutations are isolated per-instance (previously shared).

Original prompt

This section details on the original issue you should resolve

<issue_title>@claude please address the issues you identified in your review and make the suggested changes. Your review:</issue_title>
<issue_description>@claude please address the issues you identified in your review and make the suggested changes. Your review:

Claude Code Report

🚀 System Initialization

Available Tools: 20 tools loaded

I'll review PR #36 for bugs, security issues, and performance problems. Let me first fetch the pull request details.

Token usage: 18450 input, 3 output

🔧 `Bash`

Parameters:

{
  "command": "gh pr view 36 --json title,body,files,commits",
  "description": "Fetch PR knitli/codeweaver-mcp#36 details"
}

Result:

{"body":"Tests on `003-our-aim-to` branch were failing due to fixture configuration and model serialization issues. Fixed validation errors and pickle incompatibilities blocking test execution.\n\n## Changes\n\n### Test Fixtures\n- **Removed duplicate `governor` fixture** in `test_delimiter_basic.py` - now uses `chunk_governor` from `conftest.py` with proper `EmbeddingModelCapabilities` \n- **Added `CodeChunk.model_rebuild()`** in `conftest.py` with `EmbeddingBatchInfo` import to resolve forward references\n\n### Method Name Updates\n- **Replaced `set_batch_id()` → `set_batch_keys()`** in `semantic.py` (5 locations) - aligns with current `CodeChunk` API\n\n### Pickle Compatibility\n- **Changed `model_copy(deep=True)` → `model_copy(deep=False)`** in `CodeChunk.set_batch_keys()` - ast-grep `SgNode` objects in metadata cannot be pickled\n\n```python\n# Before: deep copy attempted to pickle unpicklable SgNode\nreturn self.model_copy(update={...}, deep=True)\n\n# After: shallow copy avoids pickling metadata\nreturn self.model_copy(update={...}, deep=False)\n```\n\n## Results\n- **37/43 unit tests passing** (was 0/43 before fixes)\n- All delimiter chunker tests passing (18/18)\n- All semantic error tests passing (11/11)\n- 5 remaining failures are edge cases unrelated to these fixes\n\n## Outstanding Issues\n- Integration tests timeout during collection (heavy imports suspected)\n- 5 edge case test failures need separate investigation\n\n> [!WARNING]\n>\n> <details>\n> <summary>Firewall rules blocked me from connecting to one or more addresses (expand for details)</summary>\n>\n> #### I tried to connect to the following addresses, but was blocked by firewall rules:\n>\n> - `astral.sh`\n>   - Triggering command: `curl -LsSf REDACTED` (dns block)\n>\n> If you need me to access, download, or install something from one of these locations, you can either:\n>\n> - Configure [Actions setup steps](https://gh.io/copilot/actions-setup-steps) to set up my environment, which run before the firewall is enabled\n> - Add the appropriate URLs or hosts to the custom allowlist in this repository's [Copilot coding agent settings](https://github.com/knitli/codeweaver-mcp/settings/copilot/coding_agent) (admins only)\n>\n> </details>\n\n<!-- START COPILOT CODING AGENT SUFFIX -->\n\n\n\n<details>\n\n<summary>Original prompt</summary>\n\n> Repo: knitli/codeweaver-mcp\n> Use Branch: 003-our-aim-to\n> \n> We merged in the last of the specs for this branch (specs/003-our-aim-to/spec.md) last night but the tests haven't ran. I want you to run the tests and assess the situation if it seems to much to resolve, then I want you to audit the implementation and provide an assessment of the situation and remaining issues or missing pieces, and of course document your test observations. Otherwise, try to resolve failed tests.\n\n\n</details>\n\n\n\n<!-- START COPILOT CODING AGENT TIPS -->\n---\n\n💡 You can make Copilot smarter by setting up custom instructions, customizing its develo...

Token usage: 18450 input, 3 output

Now let me examine the actual code changes in the PR:

Token usage: 20781 input, 1 output

🔧 `Bash`

Parameters:

{
  "command": "gh pr diff 36",
  "description": "Get diff for PR knitli/codeweaver-mcp#36"
}

Result:

diff --git a/coverage.xml b/coverage.xml
index 06d7219..fc3b750 100644
--- a/coverage.xml
+++ b/coverage.xml
@@ -1,20 +1,12 @@
 <?xml version="1.0" ?>
-
-<!--
-SPDX-FileCopyrightText: 2025 Knitli Inc.
-SPDX-FileContributor: Adam Poulemanos <adam@knit.li>
-
-SPDX-License-Identifier: MIT OR Apache-2.0
--->
-
-<coverage version="7.11.0" timestamp="1761706409191" lines-valid="14312" lines-covered="3568" line-rate="0.2493" branches-covered="0" branches-valid="0" branch-rate="0" complexity="0">
+<coverage version="7.11.0" timestamp="1761735978313" lines-valid="14313" lines-covered="3420" line-rate="0.2389" branches-covered="0" branches-valid="0" branch-rate="0" complexity="0">
 	<!-- Generated by coverage.py: https://coverage.readthedocs.io/en/7.11.0 -->
 	<!-- Based on https://raw.githubusercontent.com/cobertur...

</details>

- Fixes knitli/codeweaver-mcp#37

<!-- START COPILOT CODING AGENT TIPS -->
---

💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more [Copilot coding agent tips](https://gh.io/copilot-coding-agent-tips) in the docs.

Co-authored-by: bashandbone <89049923+bashandbone@users.noreply.github.com>

Copilot

Pull Request Overview

This PR refactors the set_batch_keys method to return a new CodeChunk instance instead of mutating the existing one, ensuring immutability. The change addresses a metadata sharing issue where instances were inadvertently sharing references to the same metadata dictionary.

Key changes:

set_batch_keys now returns a new CodeChunk with an explicit metadata copy to prevent shared references
Updated all callers in semantic.py to use the returned instance instead of discarding it
Added comprehensive tests validating immutability, metadata isolation, and batch key deduplication behavior

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.

File	Description
src/codeweaver/core/chunks.py	Enhanced `set_batch_keys` to explicitly copy metadata dict when creating new instances
src/codeweaver/engine/chunker/semantic.py	Updated `_deduplicate_chunks` to properly use returned instances and removed redundant batch key setting in `_finalize_chunks`
tests/unit/core/init.py	Added test module initialization file
tests/unit/core/test_chunk_batch_keys.py	Added comprehensive test suite for batch keys functionality

Comments suppressed due to low confidence (4)

src/codeweaver/core/chunks.py:261

This test may raise an Exception as the target may be of non-container class EmbeddingRegistry.

        return registry[self.chunk_id].dense if self.chunk_id in registry else None

src/codeweaver/core/chunks.py:269

This test may raise an Exception as the target may be of non-container class EmbeddingRegistry.

        return registry[self.chunk_id].sparse if self.chunk_id in registry else None

src/codeweaver/core/chunks.py:402

This test may raise an Exception as the target may be of non-container class EmbeddingRegistry.

        return registry[self.chunk_id].dense if self.chunk_id in registry else None

src/codeweaver/engine/chunker/semantic.py:402

Mixing implicit and explicit returns may indicate an error, as implicit returns always return None.

    def _parse_file(self, content: str, file_path: Path | None) -> FileThing[SgRoot]:

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Initial plan

bc9e539

Copilot AI assigned Copilot and bashandbone Oct 29, 2025

Copilot started work on behalf of bashandbone October 29, 2025 12:06 View session

Copilot AI and others added 2 commits October 29, 2025 12:14

Fix set_batch_keys return value handling and metadata isolation

78e346c

Co-authored-by: bashandbone <89049923+bashandbone@users.noreply.github.com>

Clean up extra blank line in _finalize_chunks

fc5f5a1

Co-authored-by: bashandbone <89049923+bashandbone@users.noreply.github.com>

Copilot AI changed the title ~~[WIP] Fix issues with tests on 003-our-aim-to branch~~ Fix set_batch_keys() return value handling and metadata isolation Oct 29, 2025

Copilot AI requested a review from bashandbone October 29, 2025 12:21

Copilot finished work on behalf of bashandbone October 29, 2025 12:21

bashandbone marked this pull request as ready for review October 29, 2025 12:22

Copilot AI review requested due to automatic review settings October 29, 2025 12:22

bashandbone merged commit 1808f18 into 003-our-aim-to Oct 29, 2025
6 checks passed

bashandbone deleted the copilot/fix-003-our-aim-to-issues branch October 29, 2025 12:22

Copilot AI reviewed Oct 29, 2025

View reviewed changes

bashandbone mentioned this pull request Oct 30, 2025

@claude please address the issues you identified in your review and make the suggested changes. Your review: #37

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix set_batch_keys() return value handling and metadata isolation #38

Fix set_batch_keys() return value handling and metadata isolation #38

Uh oh!

Copilot AI commented Oct 29, 2025 •

edited

Loading

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fix set_batch_keys() return value handling and metadata isolation #38

Fix set_batch_keys() return value handling and metadata isolation #38

Uh oh!

Conversation

Copilot AI commented Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Core Issues

Changes

Impact

Claude Code Report

🚀 System Initialization

🔧 Bash

🔧 Bash

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Copilot AI commented Oct 29, 2025 •

edited

Loading

🔧 `Bash`

🔧 `Bash`