Skip to content

Conversation

Copy link
Contributor

Copilot AI commented Oct 29, 2025

Addresses critical bugs from PR #36 code review where set_batch_keys() return values were ignored and metadata dicts were shared between CodeChunk instances.

Core Issues

Return value ignored: CodeChunk.set_batch_keys() returns a new immutable instance, but 5 call sites discarded the return value, causing batch keys to never be set:

# Before - batch keys never set
chunk.set_batch_keys(batch_id)
deduplicated.append(chunk)

# After - properly captures immutable return
batch_keys = BatchKeys(id=batch_id, idx=idx)
deduplicated.append(chunk.set_batch_keys(batch_keys))

Type mismatch: Method expects BatchKeys (id, idx, sparse) but was passed raw UUID7.

Shared metadata: Shallow copy caused metadata dict references to be shared between instances, enabling race conditions:

# Before - metadata dict shared
return self.model_copy(update={...}, deep=False)

# After - explicit dict copy
metadata_copy = {k: v for k, v in self.metadata.items()} if self.metadata else None
return self.model_copy(update={..., "metadata": metadata_copy}, deep=False)

Changes

  • semantic.py: Fixed return value capture at 5 call sites, added BatchKeys creation with enumerated indices
  • chunks.py: Added explicit metadata dict copy in set_batch_keys() to prevent shared references
  • semantic.py: Removed redundant iteration in _finalize_chunks() (O(2n) → O(n))
  • semantic.py: Tightened type annotation batch_id: Anybatch_id: UUID7
  • Added test suite verifying immutability, metadata isolation, and batch key behavior

Impact

Batch keys are now actually set on chunks (previously never set). Metadata mutations are isolated per-instance (previously shared).

Original prompt

This section details on the original issue you should resolve

<issue_title>@claude please address the issues you identified in your review and make the suggested changes. Your review:</issue_title>
<issue_description>@claude please address the issues you identified in your review and make the suggested changes. Your review:

Claude Code Report

🚀 System Initialization

Available Tools: 20 tools loaded


I'll review PR #36 for bugs, security issues, and performance problems. Let me first fetch the pull request details.

Token usage: 18450 input, 3 output


🔧 Bash

Parameters:

{
  "command": "gh pr view 36 --json title,body,files,commits",
  "description": "Fetch PR knitli/codeweaver-mcp#36 details"
}

Result:

{"body":"Tests on `003-our-aim-to` branch were failing due to fixture configuration and model serialization issues. Fixed validation errors and pickle incompatibilities blocking test execution.\n\n## Changes\n\n### Test Fixtures\n- **Removed duplicate `governor` fixture** in `test_delimiter_basic.py` - now uses `chunk_governor` from `conftest.py` with proper `EmbeddingModelCapabilities` \n- **Added `CodeChunk.model_rebuild()`** in `conftest.py` with `EmbeddingBatchInfo` import to resolve forward references\n\n### Method Name Updates\n- **Replaced `set_batch_id()` → `set_batch_keys()`** in `semantic.py` (5 locations) - aligns with current `CodeChunk` API\n\n### Pickle Compatibility\n- **Changed `model_copy(deep=True)` → `model_copy(deep=False)`** in `CodeChunk.set_batch_keys()` - ast-grep `SgNode` objects in metadata cannot be pickled\n\n```python\n# Before: deep copy attempted to pickle unpicklable SgNode\nreturn self.model_copy(update={...}, deep=True)\n\n# After: shallow copy avoids pickling metadata\nreturn self.model_copy(update={...}, deep=False)\n```\n\n## Results\n- **37/43 unit tests passing** (was 0/43 before fixes)\n- All delimiter chunker tests passing (18/18)\n- All semantic error tests passing (11/11)\n- 5 remaining failures are edge cases unrelated to these fixes\n\n## Outstanding Issues\n- Integration tests timeout during collection (heavy imports suspected)\n- 5 edge case test failures need separate investigation\n\n> [!WARNING]\n>\n> <details>\n> <summary>Firewall rules blocked me from connecting to one or more addresses (expand for details)</summary>\n>\n> #### I tried to connect to the following addresses, but was blocked by firewall rules:\n>\n> - `astral.sh`\n>   - Triggering command: `curl -LsSf REDACTED` (dns block)\n>\n> If you need me to access, download, or install something from one of these locations, you can either:\n>\n> - Configure [Actions setup steps](https://gh.io/copilot/actions-setup-steps) to set up my environment, which run before the firewall is enabled\n> - Add the appropriate URLs or hosts to the custom allowlist in this repository's [Copilot coding agent settings](https://github.com/knitli/codeweaver-mcp/settings/copilot/coding_agent) (admins only)\n>\n> </details>\n\n<!-- START COPILOT CODING AGENT SUFFIX -->\n\n\n\n<details>\n\n<summary>Original prompt</summary>\n\n> Repo: knitli/codeweaver-mcp\n> Use Branch: 003-our-aim-to\n> \n> We merged in the last of the specs for this branch (specs/003-our-aim-to/spec.md) last night but the tests haven't ran. I want you to run the tests and assess the situation if it seems to much to resolve, then I want you to audit the implementation and provide an assessment of the situation and remaining issues or missing pieces, and of course document your test observations. Otherwise, try to resolve failed tests.\n\n\n</details>\n\n\n\n<!-- START COPILOT CODING AGENT TIPS -->\n---\n\n💡 You can make Copilot smarter by setting up custom instructions, customizing its develo...

Token usage: 18450 input, 3 output


Now let me examine the actual code changes in the PR:

Token usage: 20781 input, 1 output


🔧 Bash

Parameters:

{
  "command": "gh pr diff 36",
  "description": "Get diff for PR knitli/codeweaver-mcp#36"
}

Result:

diff --git a/coverage.xml b/coverage.xml
index 06d7219..fc3b750 100644
--- a/coverage.xml
+++ b/coverage.xml
@@ -1,20 +1,12 @@
 <?xml version="1.0" ?>
-
-<!--
-SPDX-FileCopyrightText: 2025 Knitli Inc.
-SPDX-FileContributor: Adam Poulemanos <adam@knit.li>
-
-SPDX-License-Identifier: MIT OR Apache-2.0
--->
-
-<coverage version="7.11.0" timestamp="1761706409191" lines-valid="14312" lines-covered="3568" line-rate="0.2493" branches-covered="0" branches-valid="0" branch-rate="0" complexity="0">
+<coverage version="7.11.0" timestamp="1761735978313" lines-valid="14313" lines-covered="3420" line-rate="0.2389" branches-covered="0" branches-valid="0" branch-rate="0" complexity="0">
 	<!-- Generated by coverage.py: https://coverage.readthedocs.io/en/7.11.0 -->
 	<!-- Based on https://raw.githubusercontent.com/cobertur...

</details>

- Fixes knitli/codeweaver-mcp#37

<!-- START COPILOT CODING AGENT TIPS -->
---

💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more [Copilot coding agent tips](https://gh.io/copilot-coding-agent-tips) in the docs.

Copilot AI and others added 2 commits October 29, 2025 12:14
Co-authored-by: bashandbone <89049923+bashandbone@users.noreply.github.com>
Co-authored-by: bashandbone <89049923+bashandbone@users.noreply.github.com>
Copilot AI changed the title [WIP] Fix issues with tests on 003-our-aim-to branch Fix set_batch_keys() return value handling and metadata isolation Oct 29, 2025
Copilot AI requested a review from bashandbone October 29, 2025 12:21
@bashandbone bashandbone marked this pull request as ready for review October 29, 2025 12:22
Copilot AI review requested due to automatic review settings October 29, 2025 12:22
@bashandbone bashandbone merged commit 1808f18 into 003-our-aim-to Oct 29, 2025
6 checks passed
@bashandbone bashandbone deleted the copilot/fix-003-our-aim-to-issues branch October 29, 2025 12:22
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR refactors the set_batch_keys method to return a new CodeChunk instance instead of mutating the existing one, ensuring immutability. The change addresses a metadata sharing issue where instances were inadvertently sharing references to the same metadata dictionary.

Key changes:

  • set_batch_keys now returns a new CodeChunk with an explicit metadata copy to prevent shared references
  • Updated all callers in semantic.py to use the returned instance instead of discarding it
  • Added comprehensive tests validating immutability, metadata isolation, and batch key deduplication behavior

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.

File Description
src/codeweaver/core/chunks.py Enhanced set_batch_keys to explicitly copy metadata dict when creating new instances
src/codeweaver/engine/chunker/semantic.py Updated _deduplicate_chunks to properly use returned instances and removed redundant batch key setting in _finalize_chunks
tests/unit/core/init.py Added test module initialization file
tests/unit/core/test_chunk_batch_keys.py Added comprehensive test suite for batch keys functionality
Comments suppressed due to low confidence (4)

src/codeweaver/core/chunks.py:261

        return registry[self.chunk_id].dense if self.chunk_id in registry else None

src/codeweaver/core/chunks.py:269

        return registry[self.chunk_id].sparse if self.chunk_id in registry else None

src/codeweaver/core/chunks.py:402

        return registry[self.chunk_id].dense if self.chunk_id in registry else None

src/codeweaver/engine/chunker/semantic.py:402

  • Mixing implicit and explicit returns may indicate an error, as implicit returns always return None.
    def _parse_file(self, content: str, file_path: Path | None) -> FileThing[SgRoot]:

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants