Skip to content

Conversation

@bashandbone
Copy link
Contributor

@bashandbone bashandbone commented Dec 3, 2025

This pull request introduces several improvements and fixes across configuration, embedding reconciliation, and health reporting in the codebase. The main highlights are the addition of automatic reconciliation for missing embeddings during indexing, enhancements to the health endpoint's uptime calculation.

Embedding and Indexing Improvements:

  • Added automatic reconciliation in prime_index to detect and add missing dense and sparse embeddings to files after batch indexing, improving data consistency and reducing manual intervention.
  • Enhanced add_missing_embeddings_to_existing_chunks to check for the presence of dense and sparse vectors before adding new embeddings, preventing redundant computation and storage. [1] [2] [3]
  • Updated the embedding provider base to tag batch keys with a sparse indicator, aiding in correct embedding management.

Health Endpoint Enhancements:

  • Refactored the HealthResponse model to compute uptime_seconds dynamically using a monotonic clock, ensuring accurate uptime reporting and reducing error-prone manual timestamp management; this corrected an error preventing the health endpoint from responding. [1] [2] [3] [4]

Configuration Updates:

  • Clarified and separated management server and MCP server configuration blocks in codeweaver.toml, ensuring correct server settings and reducing potential misconfiguration. [1] [2]

Miscellaneous Improvements:

  • Improved formatting and clarity in server initialization and test code for better readability and maintainability. [1] [2] [3]

Summary by Sourcery

Introduce automatic reconciliation of missing embeddings during indexing, improve health endpoint uptime reporting, refine provider metadata, and clarify server configuration.

New Features:

  • Automatically reconcile missing dense and sparse embeddings in the indexer after batch indexing based on current embedding configuration.
  • Tag embedding batches with a sparse indicator in the provider base to distinguish sparse embeddings in downstream processing.

Bug Fixes:

  • Ensure the health endpoint computes uptime dynamically from a monotonic clock, fixing failures caused by manual timestamp and uptime handling.
  • Prevent redundant regeneration and storage of embeddings by checking existing vectors before adding dense or sparse embeddings in the vector store.

Enhancements:

  • Improve server initialization formatting and telemetry metadata resolution for better readability and observability.
  • Refine the health response model to compute uptime as a derived field while keeping creation timestamps simple and consistent.

Tests:

  • Add unit tests covering embedding reconciliation logic and its automatic invocation during index priming, including various existing-vector scenarios.

Chores:

  • Clarify and separate management and MCP server configuration in the default configuration file to reduce misconfiguration risk.

…lated tests, fix issue preventing sparse embedding storage
Copilot AI review requested due to automatic review settings December 3, 2025 15:31
@sourcery-ai
Copy link
Contributor

sourcery-ai bot commented Dec 3, 2025

Reviewer's Guide

Implements automatic post-indexing embedding reconciliation in the indexer, refines how missing dense/sparse vectors are detected and filled without duplication, switches health uptime reporting to a monotonic, computed field, clarifies management vs MCP server configuration, and adds focused tests around reconciliation plus minor formatting improvements.

Sequence diagram for automatic embedding reconciliation in prime_index

sequenceDiagram
    participant Indexer
    participant FileManifest
    participant VectorStore
    participant EmbeddingProvider
    participant SparseProvider

    Indexer->>Indexer: _perform_batch_indexing_async(files_to_index)
    Note over Indexer: After batch indexing completes

    Indexer->>Indexer: Check force_reindex and providers
    alt CanReconcile
        Indexer->>Indexer: _get_current_embedding_models()
        Indexer->>FileManifest: get_files_needing_embeddings(current_dense_provider, current_dense_model, current_sparse_provider, current_sparse_model)
        FileManifest-->>Indexer: files_needing

        Indexer->>Indexer: needs_dense = bool(files_needing.dense_only and EmbeddingProvider)
        Indexer->>Indexer: needs_sparse = bool(files_needing.sparse_only and SparseProvider)

        alt NeedsDenseOrSparse
            Indexer->>Indexer: add_missing_embeddings_to_existing_chunks(add_dense, add_sparse)
            activate Indexer
            loop ForEachBatchOfChunks
                Indexer->>VectorStore: scroll or retrieve points(with_payload=True, with_vectors=True)
                VectorStore-->>Indexer: points

                loop ForEachPoint
                    Indexer->>Indexer: Determine existing_vector_names
                    alt DenseMissing
                        Indexer->>EmbeddingProvider: embed_document([chunk_text])
                        EmbeddingProvider-->>Indexer: dense_emb
                        Indexer->>VectorStore: upsert dense vector for point
                    end
                    alt SparseMissing
                        Indexer->>SparseProvider: embed_document([chunk_text])
                        SparseProvider-->>Indexer: sparse_emb
                        Indexer->>VectorStore: upsert sparse vector for point
                    end
                end
            end
            deactivate Indexer
        else NoFilesNeedReconciliation
            Indexer->>Indexer: Skip reconciliation
        end
    else CannotReconcile
        Indexer->>Indexer: Skip reconciliation
    end

    Indexer->>Indexer: _finalize_indexing()
Loading

Sequence diagram for embedding provider tagging sparse batch keys

sequenceDiagram
    participant Caller
    participant EmbeddingProviderBase
    participant BatchKeys

    Caller->>EmbeddingProviderBase: _process_input(chunk_list, starter_chunks, key, for_backup)
    activate EmbeddingProviderBase
    EmbeddingProviderBase->>EmbeddingProviderBase: is_sparse_provider = type(self).__name__.lower().startswith(sparse) or contains sparse

    loop ForEachStarterChunk
        EmbeddingProviderBase->>EmbeddingProviderBase: original_idx = chunk_list.index(chunk)
        EmbeddingProviderBase->>BatchKeys: create(id=key, idx=i, sparse=is_sparse_provider)
        BatchKeys-->>EmbeddingProviderBase: batch_keys
        EmbeddingProviderBase->>EmbeddingProviderBase: chunk.set_batch_keys(batch_keys, secondary=for_backup)
    end
    deactivate EmbeddingProviderBase
    EmbeddingProviderBase-->>Caller: final_chunks
Loading

Class diagram for HealthResponse uptime computation and embedding batch key tagging

classDiagram
    class BasedModel

    class IndexingInfo
    class ServicesInfo
    class StatisticsInfo
    class ResourceInfo

    class HealthResponse {
        status Literal_healthy_degraded_unhealthy
        indexing IndexingInfo
        services ServicesInfo
        statistics StatisticsInfo
        resources ResourceInfo
        timestamp str
        stopwatch_start NonNegativeInt
        +create_with_current_timestamp(status, uptime_seconds, indexing, services, statistics, resources) HealthResponse
        +uptime_seconds() NonNegativeInt
    }

    BasedModel <|-- HealthResponse
    HealthResponse --> IndexingInfo
    HealthResponse --> ServicesInfo
    HealthResponse --> StatisticsInfo
    HealthResponse --> ResourceInfo

    class EmbeddingProviderBase {
        -_process_input(chunk_list, starter_chunks, key, for_backup) list
    }

    class BatchKeys {
        id str
        idx int
        sparse bool
        +__init__(id, idx, sparse)
    }

    EmbeddingProviderBase --> BatchKeys : uses

    note for HealthResponse "uptime_seconds is now a computed field based on stopwatch_start using time.monotonic_ns"
Loading

File-Level Changes

Change Details Files
Add automatic embedding reconciliation after batch indexing and avoid regenerating existing dense/sparse vectors.
  • Extend prime_index to, after batch indexing, inspect the manifest for files missing dense and/or sparse embeddings under the current models and conditionally trigger reconciliation when not force-reindexing and a vector store plus providers are available.
  • Guard the reconciliation flow with logging and exception handling so indexing completion is not blocked by reconciliation failures.
  • Modify add_missing_embeddings_to_existing_chunks to retrieve existing vectors with points, normalize the vector representation, and only generate dense or sparse embeddings when those specific vectors are absent before calling update_vectors.
src/codeweaver/engine/indexer/indexer.py
Tag embedding batches with a sparse flag to distinguish sparse providers at the batch-key level.
  • Determine whether a provider instance is sparse by inspecting its class name for a sparse prefix or substring.
  • Include a sparse boolean field in BatchKeys when assigning batch keys to chunks during _process_input, enabling downstream logic to differentiate sparse batches.
src/codeweaver/providers/embedding/providers/base.py
Rework health response uptime calculation to use a monotonic stopwatch and computed field, fixing the health endpoint failure.
  • Remove explicit uptime_seconds field from HealthResponse in favor of a computed property that derives uptime from a monotonic_ns stopwatch_start value.
  • Add timestamp and stopwatch_start fields with defaults, ensure stopwatch_start is excluded from responses, and update create_with_current_timestamp to no longer require an explicit uptime value.
src/codeweaver/server/health/models.py
Clarify and separate management vs MCP server configuration and clean up server initialization formatting.
  • Move management_host and management_port to the top-level of codeweaver.toml while introducing an [mcp_server] section strictly for the MCP HTTP server settings.
  • Reformat long argument lists and telemetry initialization in server.py for better readability without changing behavior.
codeweaver.toml
src/codeweaver/server/server.py
Add comprehensive unit tests for automatic embedding reconciliation behavior and adjust integration test formatting.
  • Introduce tests for add_missing_embeddings_to_existing_chunks covering combinations of existing/missing dense and sparse vectors, multiple chunks per file, and mixed vector states, verifying when embeddings are generated and when vector-store updates occur.
  • Add tests around prime_index to ensure reconciliation is invoked (or skipped) based on force_reindex and manifest state, and reformat a CLI integration test init_workflows argument list for clarity.
tests/unit/test_indexer_reconciliation.py
tests/integration/cli/test_init_workflows.py

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@github-actions
Copy link
Contributor

github-actions bot commented Dec 3, 2025

👋 Hey @bashandbone,

Thanks for your contribution to codeweaver! 🧵

You need to agree to the CLA first... 🖊️

Before we can accept your contribution, you need to agree to our Contributor License Agreement (CLA).

To agree to the CLA, please comment:

I read the contributors license agreement and I agree to it.

Those exact words are important1, so please don't change them. 😉

You can read the full CLA here: Contributor License Agreement


@bashandbone has signed the CLA.


0 out of 2 committers have signed the CLA.
@bashandbone
❌ @Copilot
You can retrigger this bot by commenting recheck in this Pull Request. Posted by the CLA Assistant Lite bot.

Footnotes

  1. Our bot needs those exact words to recognize that you agree to the CLA.

Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey there - I've reviewed your changes - here's some feedback:

  • The HealthResponse uptime calculation is currently based on a per-instance stopwatch_start defaulting to time.monotonic_ns() at model creation, which will reset on each response; consider wiring this to a process/server-level start time (e.g., passed in or stored globally) so uptime_seconds reflects true server uptime rather than response age.
  • The sparse vs dense detection in _process_input relies on heuristics from type(self).__name__ containing 'sparse', which is brittle; consider adding an explicit capability flag or property on embedding providers (e.g., is_sparse_provider or vector_type) instead of inferring from the class name.
  • The automatic reconciliation block in prime_index wraps the entire reconciliation flow in a broad except Exception and only logs a generic warning; consider narrowing the try/except to the minimum necessary and including key contextual details (e.g., collection name, file count) in the log to aid debugging when reconciliation fails.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- The `HealthResponse` uptime calculation is currently based on a per-instance `stopwatch_start` defaulting to `time.monotonic_ns()` at model creation, which will reset on each response; consider wiring this to a process/server-level start time (e.g., passed in or stored globally) so `uptime_seconds` reflects true server uptime rather than response age.
- The sparse vs dense detection in `_process_input` relies on heuristics from `type(self).__name__` containing 'sparse', which is brittle; consider adding an explicit capability flag or property on embedding providers (e.g., `is_sparse_provider` or `vector_type`) instead of inferring from the class name.
- The automatic reconciliation block in `prime_index` wraps the entire reconciliation flow in a broad `except Exception` and only logs a generic warning; consider narrowing the try/except to the minimum necessary and including key contextual details (e.g., collection name, file count) in the log to aid debugging when reconciliation fails.

## Individual Comments

### Comment 1
<location> `src/codeweaver/engine/indexer/indexer.py:2459-2465` </location>
<code_context>

+                    # Check which vector types already exist in this point
+                    # We just need to check for presence, not extract the actual vectors
+                    existing_vectors = point.vector if hasattr(point, "vector") else {}
+                    if not isinstance(existing_vectors, dict):
+                        # If it's a single vector (unnamed), treat it as dense
+                        existing_vectors = {"": existing_vectors} if existing_vectors else {}
+
+                    # Get set of existing vector names for quick lookup
+                    existing_vector_names = (
+                        set(existing_vectors.keys()) if existing_vectors else set()
+                    )
</code_context>

<issue_to_address>
**issue (bug_risk):** Handling of `point.vector` for non-dict vector types may mis-detect existing named vectors.

Because we treat any non-`dict` `point.vector` as a single unnamed vector, a mapping-like container (e.g. `NamedVectors`) that isn’t literally a `dict` but holds multiple named vectors (including `""` and "sparse") would be collapsed into `{"": existing_vectors}`. That loses the individual names and can make us think `"sparse"` is missing, so we re-write it unnecessarily.

To avoid this, first detect mapping-like objects (e.g. `isinstance(existing_vectors, Mapping)` or checking for a `keys` attribute) and only fall back to the "single unnamed vector" wrapping when `point.vector` is a raw sequence / numeric array, not a richer container type.
</issue_to_address>

### Comment 2
<location> `src/codeweaver/server/health/models.py:227-231` </location>
<code_context>
+        description="Health check timestamp (ISO8601)",
+        default_factory=lambda: datetime.now(UTC).isoformat(),
+    )
+    stopwatch_start: NonNegativeInt = Field(default_factory=time.monotonic_ns, exclude=True)
+
+    @computed_field
+    @property
+    def uptime_seconds(self) -> NonNegativeInt:
+        """Calculate uptime in seconds based on stopwatch_start."""
+        elapsed_ns = time.monotonic_ns() - self.stopwatch_start
</code_context>

<issue_to_address>
**issue (bug_risk):** The new `uptime_seconds` implementation measures time since model instantiation, not server uptime.

This now computes time since the `HealthResponse` instance was created (`stopwatch_start` defaults to `time.monotonic_ns()` at construction) rather than true server uptime, which diverges from the previous behavior and the field’s description. If you want real server uptime, `stopwatch_start` should be initialized from a shared, long‑lived monotonic "server started at" timestamp instead of being set per response instance, otherwise clients will see misleading values.
</issue_to_address>

### Comment 3
<location> `src/codeweaver/server/health/models.py:237` </location>
<code_context>
+        return elapsed_ns // 1_000_000_000

     @classmethod
     def create_with_current_timestamp(
</code_context>

<issue_to_address>
**issue:** `uptime_seconds` parameter in `create_with_current_timestamp` is now unused.

The `create_with_current_timestamp` factory still takes an `uptime_seconds: NonNegativeInt` argument but no longer uses it now that `uptime_seconds` is computed. This can mislead callers and obscure the semantic change. Consider removing (or deprecating) the parameter and updating all call sites so uptime is clearly derived rather than supplied.
</issue_to_address>

### Comment 4
<location> `tests/unit/test_indexer_reconciliation.py:183-88` </location>
<code_context>
+            has_sparse_embeddings=False,
+        )
+
+        # Mock retrieve to return point with no vectors
+        mock_point = MagicMock()
+        mock_point.id = chunk_id
+        mock_point.payload = {"text": "def test(): pass"}
+        mock_point.vector = {}  # No vectors
+        mock_indexer._vector_store.client.retrieve.return_value = [mock_point]
+
+        # Mock set_relative_path
</code_context>

<issue_to_address>
**suggestion (testing):** Add a test that covers the `existing_vectors` normalization when `point.vector` is not a dict (e.g. a single unnamed vector or `None`).

The normalization branch in `add_missing_embeddings_to_existing_chunks` for non-dict `point.vector` is currently untested, since all tests use dicts (`{}`, `{"": ...}`, `{"sparse": ...}`, etc.). Please add coverage for:

- `point.vector` as a list (single unnamed vector) and assert it’s treated as dense and skipped when `add_dense=False`.
- Optionally, `point.vector` as `None`, asserting that dense and sparse vectors are created when requested.

This will exercise the normalization logic and confirm reconciliation works across different `vector` representations from the store.
</issue_to_address>

### Comment 5
<location> `tests/unit/test_indexer_reconciliation.py:88-97` </location>
<code_context>
+        mock_indexer._vector_store.client.retrieve.return_value = [mock_point]
</code_context>

<issue_to_address>
**suggestion (testing):** Add tests for cases where `retrieve` returns no points or points without a `text` payload.

Since multiple branches in `add_missing_embeddings_to_existing_chunks` assume `retrieve` returns points with a `payload['text']`, it would be good to add tests for:

1. `retrieve` returning an empty list: assert that neither provider's `embed_document` nor `update_vectors` is called, and `chunks_updated` is `0`.
2. `retrieve` returning points with `payload is None` or missing the `"text"` key: assert that these are skipped (no embeddings generated/updated) and the method completes without error.

This will exercise the reconciliation logic against incomplete or missing data from the vector store.

Suggested implementation:

```python
        mock_indexer._vector_store.client.retrieve.return_value = [mock_point]

```

You’ll need to add two new tests to `tests/unit/test_indexer_reconciliation.py` to cover the scenarios where `retrieve` returns no points and where it returns points without a `text` payload. Place them alongside the other reconciliation tests and adjust fixture names to match your existing tests (e.g., `mock_indexer`, `mock_dense_provider`, `mock_sparse_provider`, etc.).

Here is concrete test code you can add at module scope (adjust fixture names/imports as needed):

```python
import pytest
from unittest.mock import MagicMock, AsyncMock

from pathlib import Path

@pytest.mark.asyncio
async def test_add_missing_embeddings_retrieve_empty_list_no_updates(
    mock_indexer,
    mock_dense_embedding_provider,
    mock_sparse_embedding_provider,
):
    # Arrange: vector store returns no points
    mock_indexer._vector_store.client.retrieve.return_value = []

    # Ensure providers are mocks/async mocks
    mock_dense_embedding_provider.embed_document = AsyncMock()
    mock_sparse_embedding_provider.embed_document = AsyncMock()
    mock_indexer._vector_store.update_vectors = AsyncMock()

    # Act
    result = await mock_indexer.add_missing_embeddings_to_existing_chunks(
        add_dense=True,
        add_sparse=True,
    )

    # Assert: no embeddings requested, no updates performed
    mock_dense_embedding_provider.embed_document.assert_not_awaited()
    mock_sparse_embedding_provider.embed_document.assert_not_awaited()
    mock_indexer._vector_store.update_vectors.assert_not_awaited()
    assert result.chunks_updated == 0


@pytest.mark.asyncio
async def test_add_missing_embeddings_retrieve_points_without_text_payload_skipped(
    mock_indexer,
    mock_dense_embedding_provider,
    mock_sparse_embedding_provider,
):
    # Arrange: create points with missing/invalid payload['text']
    point_without_payload = MagicMock()
    point_without_payload.id = "chunk-without-payload"
    point_without_payload.payload = None
    point_without_payload.vector = {}  # no existing embeddings

    point_without_text_key = MagicMock()
    point_without_text_key.id = "chunk-without-text-key"
    point_without_text_key.payload = {"not_text": "some value"}
    point_without_text_key.vector = {}  # no existing embeddings

    mock_indexer._vector_store.client.retrieve.return_value = [
        point_without_payload,
        point_without_text_key,
    ]

    # Ensure providers are mocks/async mocks
    mock_dense_embedding_provider.embed_document = AsyncMock()
    mock_sparse_embedding_provider.embed_document = AsyncMock()
    mock_indexer._vector_store.update_vectors = AsyncMock()

    # Act
    result = await mock_indexer.add_missing_embeddings_to_existing_chunks(
        add_dense=True,
        add_sparse=True,
    )

    # Assert: points without payload['text'] are ignored
    mock_dense_embedding_provider.embed_document.assert_not_awaited()
    mock_sparse_embedding_provider.embed_document.assert_not_awaited()
    mock_indexer._vector_store.update_vectors.assert_not_awaited()
    assert result.chunks_updated == 0
```

You may need to:

1. Adjust fixture names:
   - If your existing tests use different fixture names than `mock_dense_embedding_provider`, `mock_sparse_embedding_provider`, or `mock_indexer`, rename the parameters accordingly.
   - If the embedding providers are accessed through `mock_indexer` (e.g., `mock_indexer._dense_embedding_provider`), you can set `embed_document` mocks on those instead.

2. Ensure imports are consistent:
   - If `AsyncMock` is already imported elsewhere, you can remove the duplicated import.
   - If you’re not using `Path` here, you can remove that import, or if your other tests use it in the same module, keep it consistent.

3. If `add_missing_embeddings_to_existing_chunks` returns a plain integer instead of an object with `chunks_updated`, change the assertions to:
   ```python
   assert result == 0
   ```

These tests directly exercise the reconciliation logic against incomplete vector-store data: they assert that when no valid `payload['text']` is available, no embedding generation or vector updates are attempted, and the method completes successfully with `chunks_updated == 0`.
</issue_to_address>

### Comment 6
<location> `tests/unit/test_indexer_reconciliation.py:383-392` </location>
<code_context>
+class TestAutomaticReconciliation:
</code_context>

<issue_to_address>
**suggestion (testing):** Extend automatic reconciliation tests to cover cases where only one embedding provider is configured and where reconciliation raises an exception.

In addition to the current tests, it would be useful to cover:

1. **Single provider configured**
   - Case where `_embedding_provider` is `None` but `_sparse_provider` is set (and vice versa), and `files_needing` requests only dense or only sparse.
   - Assert that `add_missing_embeddings_to_existing_chunks` is called with the correct flags:
     - Only dense provider: `add_dense=True, add_sparse=False`.
     - Only sparse provider: `add_dense=False, add_sparse=True`.

2. **Exception handling**
   - Patch `Indexer.add_missing_embeddings_to_existing_chunks` to raise and verify that `prime_index` swallows the exception and completes (no exception propagated).

These additions would better pin down the reconciliation behavior under different provider configurations and failure modes.

Suggested implementation:

```python
class TestAutomaticReconciliation:
    """Test automatic reconciliation during prime_index."""

    @pytest.fixture
    def mock_indexer_with_reconciliation(self, tmp_path: Path):
        """Create indexer for testing automatic reconciliation."""
        indexer = Indexer(project_path=tmp_path, auto_initialize_providers=False)

        indexer._file_manifest = IndexFileManifest(project_path=tmp_path)
        indexer._vector_store = MagicMock()
        indexer._vector_store.collection = "test_collection"

        # Providers will be configured per-test to cover different combinations
        indexer._embedding_provider = None
        indexer._sparse_provider = None

        return indexer

    def test_single_dense_provider_reconciliation(
        self,
        mock_indexer_with_reconciliation: Indexer,
        monkeypatch: pytest.MonkeyPatch,
    ) -> None:
        """When only dense provider is configured, reconciliation only requests dense."""
        indexer = mock_indexer_with_reconciliation

        # Only dense provider is configured
        indexer._embedding_provider = MagicMock()
        indexer._sparse_provider = None

        # files_needing should report only dense files needing reconciliation
        dense_files = [Path("file1.py")]
        sparse_files: list[Path] = []
        indexer._file_manifest.files_needing = MagicMock(return_value=(dense_files, sparse_files))

        # Patch reconciliation method to observe how it's called
        add_missing_mock = MagicMock()
        monkeypatch.setattr(
            indexer,
            "add_missing_embeddings_to_existing_chunks",
            add_missing_mock,
        )

        # Execute prime_index - should not raise
        indexer.prime_index()

        # Ensure reconciliation was called only for dense embeddings
        add_missing_mock.assert_called_once_with(add_dense=True, add_sparse=False)

    def test_single_sparse_provider_reconciliation(
        self,
        mock_indexer_with_reconciliation: Indexer,
        monkeypatch: pytest.MonkeyPatch,
    ) -> None:
        """When only sparse provider is configured, reconciliation only requests sparse."""
        indexer = mock_indexer_with_reconciliation

        # Only sparse provider is configured
        indexer._embedding_provider = None
        indexer._sparse_provider = MagicMock()

        # files_needing should report only sparse files needing reconciliation
        dense_files: list[Path] = []
        sparse_files = [Path("file2.py")]
        indexer._file_manifest.files_needing = MagicMock(return_value=(dense_files, sparse_files))

        # Patch reconciliation method to observe how it's called
        add_missing_mock = MagicMock()
        monkeypatch.setattr(
            indexer,
            "add_missing_embeddings_to_existing_chunks",
            add_missing_mock,
        )

        # Execute prime_index - should not raise
        indexer.prime_index()

        # Ensure reconciliation was called only for sparse embeddings
        add_missing_mock.assert_called_once_with(add_dense=False, add_sparse=True)

    def test_reconciliation_exception_is_swallowed(
        self,
        mock_indexer_with_reconciliation: Indexer,
        monkeypatch: pytest.MonkeyPatch,
    ) -> None:
        """prime_index should swallow reconciliation exceptions and complete."""
        indexer = mock_indexer_with_reconciliation

        # Configure both providers so reconciliation will be attempted
        indexer._embedding_provider = MagicMock()
        indexer._sparse_provider = MagicMock()

        # Ensure there is some work to trigger reconciliation
        dense_files = [Path("file1.py")]
        sparse_files = [Path("file2.py")]
        indexer._file_manifest.files_needing = MagicMock(return_value=(dense_files, sparse_files))

        # Patch reconciliation to raise an exception
        def _raise_reconciliation_error(*args, **kwargs):
            raise RuntimeError("Reconciliation failed")

        monkeypatch.setattr(
            indexer,
            "add_missing_embeddings_to_existing_chunks",
            _raise_reconciliation_error,
        )

        # prime_index should swallow the reconciliation error and not propagate it
        indexer.prime_index()

```

These edits assume:

1. `Indexer.prime_index()` requires no mandatory arguments.  
   - If your implementation requires parameters (e.g., `refresh=True`), update the calls accordingly.

2. `IndexFileManifest.files_needing` is callable and returns a `(dense_files, sparse_files)` tuple of `Path` objects, as implied by the existing tests.
   - If the return type or signature differs, adjust the `MagicMock(return_value=(dense_files, sparse_files))` to match your actual API.

3. `add_missing_embeddings_to_existing_chunks` is an instance method on `Indexer` that accepts keyword arguments `add_dense` and `add_sparse`.
   - If the method signature differs, update the `assert_called_once_with` expectations to match.

If `pytest` type hints (`pytest.MonkeyPatch`) or `Path`/`MagicMock` are not already imported at the top of this test file, add the appropriate imports:

```python
from pathlib import Path
from unittest.mock import MagicMock
import pytest
```

Adjust these additional imports in the file header if they are missing or named differently.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

mock_point.id = chunk_id
mock_point.payload = {"text": "def test(): pass"}
mock_point.vector = {"": [0.1] * 384} # Has dense (empty string key)
mock_indexer._vector_store.client.retrieve.return_value = [mock_point]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (testing): Add a test that covers the existing_vectors normalization when point.vector is not a dict (e.g. a single unnamed vector or None).

The normalization branch in add_missing_embeddings_to_existing_chunks for non-dict point.vector is currently untested, since all tests use dicts ({}, {"": ...}, {"sparse": ...}, etc.). Please add coverage for:

  • point.vector as a list (single unnamed vector) and assert it’s treated as dense and skipped when add_dense=False.
  • Optionally, point.vector as None, asserting that dense and sparse vectors are created when requested.

This will exercise the normalization logic and confirm reconciliation works across different vector representations from the store.

Comment on lines +88 to +97
mock_indexer._vector_store.client.retrieve.return_value = [mock_point]

# Mock set_relative_path
with patch(
"codeweaver.engine.indexer.indexer.set_relative_path", return_value=Path("test.py")
):
# Run reconciliation
result = await mock_indexer.add_missing_embeddings_to_existing_chunks(
add_dense=False, add_sparse=True
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (testing): Add tests for cases where retrieve returns no points or points without a text payload.

Since multiple branches in add_missing_embeddings_to_existing_chunks assume retrieve returns points with a payload['text'], it would be good to add tests for:

  1. retrieve returning an empty list: assert that neither provider's embed_document nor update_vectors is called, and chunks_updated is 0.
  2. retrieve returning points with payload is None or missing the "text" key: assert that these are skipped (no embeddings generated/updated) and the method completes without error.

This will exercise the reconciliation logic against incomplete or missing data from the vector store.

Suggested implementation:

        mock_indexer._vector_store.client.retrieve.return_value = [mock_point]

You’ll need to add two new tests to tests/unit/test_indexer_reconciliation.py to cover the scenarios where retrieve returns no points and where it returns points without a text payload. Place them alongside the other reconciliation tests and adjust fixture names to match your existing tests (e.g., mock_indexer, mock_dense_provider, mock_sparse_provider, etc.).

Here is concrete test code you can add at module scope (adjust fixture names/imports as needed):

import pytest
from unittest.mock import MagicMock, AsyncMock

from pathlib import Path

@pytest.mark.asyncio
async def test_add_missing_embeddings_retrieve_empty_list_no_updates(
    mock_indexer,
    mock_dense_embedding_provider,
    mock_sparse_embedding_provider,
):
    # Arrange: vector store returns no points
    mock_indexer._vector_store.client.retrieve.return_value = []

    # Ensure providers are mocks/async mocks
    mock_dense_embedding_provider.embed_document = AsyncMock()
    mock_sparse_embedding_provider.embed_document = AsyncMock()
    mock_indexer._vector_store.update_vectors = AsyncMock()

    # Act
    result = await mock_indexer.add_missing_embeddings_to_existing_chunks(
        add_dense=True,
        add_sparse=True,
    )

    # Assert: no embeddings requested, no updates performed
    mock_dense_embedding_provider.embed_document.assert_not_awaited()
    mock_sparse_embedding_provider.embed_document.assert_not_awaited()
    mock_indexer._vector_store.update_vectors.assert_not_awaited()
    assert result.chunks_updated == 0


@pytest.mark.asyncio
async def test_add_missing_embeddings_retrieve_points_without_text_payload_skipped(
    mock_indexer,
    mock_dense_embedding_provider,
    mock_sparse_embedding_provider,
):
    # Arrange: create points with missing/invalid payload['text']
    point_without_payload = MagicMock()
    point_without_payload.id = "chunk-without-payload"
    point_without_payload.payload = None
    point_without_payload.vector = {}  # no existing embeddings

    point_without_text_key = MagicMock()
    point_without_text_key.id = "chunk-without-text-key"
    point_without_text_key.payload = {"not_text": "some value"}
    point_without_text_key.vector = {}  # no existing embeddings

    mock_indexer._vector_store.client.retrieve.return_value = [
        point_without_payload,
        point_without_text_key,
    ]

    # Ensure providers are mocks/async mocks
    mock_dense_embedding_provider.embed_document = AsyncMock()
    mock_sparse_embedding_provider.embed_document = AsyncMock()
    mock_indexer._vector_store.update_vectors = AsyncMock()

    # Act
    result = await mock_indexer.add_missing_embeddings_to_existing_chunks(
        add_dense=True,
        add_sparse=True,
    )

    # Assert: points without payload['text'] are ignored
    mock_dense_embedding_provider.embed_document.assert_not_awaited()
    mock_sparse_embedding_provider.embed_document.assert_not_awaited()
    mock_indexer._vector_store.update_vectors.assert_not_awaited()
    assert result.chunks_updated == 0

You may need to:

  1. Adjust fixture names:

    • If your existing tests use different fixture names than mock_dense_embedding_provider, mock_sparse_embedding_provider, or mock_indexer, rename the parameters accordingly.
    • If the embedding providers are accessed through mock_indexer (e.g., mock_indexer._dense_embedding_provider), you can set embed_document mocks on those instead.
  2. Ensure imports are consistent:

    • If AsyncMock is already imported elsewhere, you can remove the duplicated import.
    • If you’re not using Path here, you can remove that import, or if your other tests use it in the same module, keep it consistent.
  3. If add_missing_embeddings_to_existing_chunks returns a plain integer instead of an object with chunks_updated, change the assertions to:

    assert result == 0

These tests directly exercise the reconciliation logic against incomplete vector-store data: they assert that when no valid payload['text'] is available, no embedding generation or vector updates are attempted, and the method completes successfully with chunks_updated == 0.

Comment on lines 383 to 392
class TestAutomaticReconciliation:
"""Test automatic reconciliation during prime_index."""

@pytest.fixture
def mock_indexer_with_reconciliation(self, tmp_path: Path):
"""Create indexer for testing automatic reconciliation."""
indexer = Indexer(project_path=tmp_path, auto_initialize_providers=False)

indexer._file_manifest = IndexFileManifest(project_path=tmp_path)
indexer._vector_store = MagicMock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (testing): Extend automatic reconciliation tests to cover cases where only one embedding provider is configured and where reconciliation raises an exception.

In addition to the current tests, it would be useful to cover:

  1. Single provider configured

    • Case where _embedding_provider is None but _sparse_provider is set (and vice versa), and files_needing requests only dense or only sparse.
    • Assert that add_missing_embeddings_to_existing_chunks is called with the correct flags:
      • Only dense provider: add_dense=True, add_sparse=False.
      • Only sparse provider: add_dense=False, add_sparse=True.
  2. Exception handling

    • Patch Indexer.add_missing_embeddings_to_existing_chunks to raise and verify that prime_index swallows the exception and completes (no exception propagated).

These additions would better pin down the reconciliation behavior under different provider configurations and failure modes.

Suggested implementation:

class TestAutomaticReconciliation:
    """Test automatic reconciliation during prime_index."""

    @pytest.fixture
    def mock_indexer_with_reconciliation(self, tmp_path: Path):
        """Create indexer for testing automatic reconciliation."""
        indexer = Indexer(project_path=tmp_path, auto_initialize_providers=False)

        indexer._file_manifest = IndexFileManifest(project_path=tmp_path)
        indexer._vector_store = MagicMock()
        indexer._vector_store.collection = "test_collection"

        # Providers will be configured per-test to cover different combinations
        indexer._embedding_provider = None
        indexer._sparse_provider = None

        return indexer

    def test_single_dense_provider_reconciliation(
        self,
        mock_indexer_with_reconciliation: Indexer,
        monkeypatch: pytest.MonkeyPatch,
    ) -> None:
        """When only dense provider is configured, reconciliation only requests dense."""
        indexer = mock_indexer_with_reconciliation

        # Only dense provider is configured
        indexer._embedding_provider = MagicMock()
        indexer._sparse_provider = None

        # files_needing should report only dense files needing reconciliation
        dense_files = [Path("file1.py")]
        sparse_files: list[Path] = []
        indexer._file_manifest.files_needing = MagicMock(return_value=(dense_files, sparse_files))

        # Patch reconciliation method to observe how it's called
        add_missing_mock = MagicMock()
        monkeypatch.setattr(
            indexer,
            "add_missing_embeddings_to_existing_chunks",
            add_missing_mock,
        )

        # Execute prime_index - should not raise
        indexer.prime_index()

        # Ensure reconciliation was called only for dense embeddings
        add_missing_mock.assert_called_once_with(add_dense=True, add_sparse=False)

    def test_single_sparse_provider_reconciliation(
        self,
        mock_indexer_with_reconciliation: Indexer,
        monkeypatch: pytest.MonkeyPatch,
    ) -> None:
        """When only sparse provider is configured, reconciliation only requests sparse."""
        indexer = mock_indexer_with_reconciliation

        # Only sparse provider is configured
        indexer._embedding_provider = None
        indexer._sparse_provider = MagicMock()

        # files_needing should report only sparse files needing reconciliation
        dense_files: list[Path] = []
        sparse_files = [Path("file2.py")]
        indexer._file_manifest.files_needing = MagicMock(return_value=(dense_files, sparse_files))

        # Patch reconciliation method to observe how it's called
        add_missing_mock = MagicMock()
        monkeypatch.setattr(
            indexer,
            "add_missing_embeddings_to_existing_chunks",
            add_missing_mock,
        )

        # Execute prime_index - should not raise
        indexer.prime_index()

        # Ensure reconciliation was called only for sparse embeddings
        add_missing_mock.assert_called_once_with(add_dense=False, add_sparse=True)

    def test_reconciliation_exception_is_swallowed(
        self,
        mock_indexer_with_reconciliation: Indexer,
        monkeypatch: pytest.MonkeyPatch,
    ) -> None:
        """prime_index should swallow reconciliation exceptions and complete."""
        indexer = mock_indexer_with_reconciliation

        # Configure both providers so reconciliation will be attempted
        indexer._embedding_provider = MagicMock()
        indexer._sparse_provider = MagicMock()

        # Ensure there is some work to trigger reconciliation
        dense_files = [Path("file1.py")]
        sparse_files = [Path("file2.py")]
        indexer._file_manifest.files_needing = MagicMock(return_value=(dense_files, sparse_files))

        # Patch reconciliation to raise an exception
        def _raise_reconciliation_error(*args, **kwargs):
            raise RuntimeError("Reconciliation failed")

        monkeypatch.setattr(
            indexer,
            "add_missing_embeddings_to_existing_chunks",
            _raise_reconciliation_error,
        )

        # prime_index should swallow the reconciliation error and not propagate it
        indexer.prime_index()

These edits assume:

  1. Indexer.prime_index() requires no mandatory arguments.

    • If your implementation requires parameters (e.g., refresh=True), update the calls accordingly.
  2. IndexFileManifest.files_needing is callable and returns a (dense_files, sparse_files) tuple of Path objects, as implied by the existing tests.

    • If the return type or signature differs, adjust the MagicMock(return_value=(dense_files, sparse_files)) to match your actual API.
  3. add_missing_embeddings_to_existing_chunks is an instance method on Indexer that accepts keyword arguments add_dense and add_sparse.

    • If the method signature differs, update the assert_called_once_with expectations to match.

If pytest type hints (pytest.MonkeyPatch) or Path/MagicMock are not already imported at the top of this test file, add the appropriate imports:

from pathlib import Path
from unittest.mock import MagicMock
import pytest

Adjust these additional imports in the file header if they are missing or named differently.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request implements automatic embedding reconciliation in the indexer to detect and add missing embeddings during indexing, along with refactoring the health endpoint's uptime calculation and clarifying configuration structure.

Key changes:

  • Automatic reconciliation logic in prime_index to detect and add missing dense/sparse embeddings post-batch-indexing
  • Refactored health endpoint to use computed uptime field with monotonic clock (though implementation has a bug)
  • Separated management server and MCP server configuration blocks in codeweaver.toml

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
tests/unit/test_indexer_reconciliation.py New comprehensive test suite for embedding reconciliation functionality with tests for various scenarios (dense-only, sparse-only, both, mixed states)
tests/integration/cli/test_init_workflows.py Formatting improvement: split long argument list across multiple lines for better readability
src/codeweaver/server/server.py Formatting improvements: split long conditional expressions across multiple lines
src/codeweaver/server/health/models.py Refactored HealthResponse with computed uptime_seconds field using monotonic clock (has critical bug), imports computed_field from wrong package
src/codeweaver/providers/embedding/providers/base.py Added sparse provider detection to tag batch keys appropriately
src/codeweaver/engine/indexer/indexer.py Added automatic reconciliation in prime_index and enhanced add_missing_embeddings_to_existing_chunks to check existing vectors before adding new ones
codeweaver.toml Moved management_host and management_port to top-level, renamed [server] to [mcp_server] for clearer separation

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 227 to 234
stopwatch_start: NonNegativeInt = Field(default_factory=time.monotonic_ns, exclude=True)

@computed_field
@property
def uptime_seconds(self) -> NonNegativeInt:
"""Calculate uptime in seconds based on stopwatch_start."""
elapsed_ns = time.monotonic_ns() - self.stopwatch_start
return elapsed_ns // 1_000_000_000
Copy link

Copilot AI Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The uptime_seconds computed field implementation is broken. The field stopwatch_start is set to time.monotonic_ns() at instance creation time (line 227), which means every time a HealthResponse instance is created, the stopwatch starts over. The computed uptime_seconds property (lines 229-234) will always return approximately 0 seconds.

The correct uptime value is calculated in HealthService.get_health_response() (line 115 in health_service.py) as int(time.time() - self._startup_time), but this value is passed to create_with_current_timestamp() and then ignored (line 240 accepts it but line 248 doesn't use it).

To fix this, you should:

  1. Remove the stopwatch_start field and computed property approach
  2. Accept uptime_seconds as a regular field parameter in create_with_current_timestamp() and pass it through to the instance creation:
return cls(
    status=status,
    uptime_seconds=uptime_seconds,  # Add this line
    indexing=indexing,
    services=services,
    statistics=statistics,
    failover=failover,
    resources=resources,
)

Copilot uses AI. Check for mistakes.
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review - PR #188

I've reviewed the changes in this pull request. Overall, this is a well-implemented feature with excellent test coverage and thoughtful design. Here's my detailed assessment:

✅ Strengths

  1. Automatic Embedding Reconciliation - The automatic reconciliation logic in prime_index (indexer.py:1327-1376) is intelligently implemented:

    • Only runs when not force re-indexing
    • Properly checks for missing embeddings before attempting reconciliation
    • Gracefully handles failures without breaking the indexing process
    • Good logging for observability
  2. Smart Vector Detection - The enhanced add_missing_embeddings_to_existing_chunks method (indexer.py:2446-2519) correctly:

    • Retrieves vectors with with_vectors=True to check existing embeddings
    • Prevents redundant computation by checking existing_vector_names
    • Only generates embeddings that are actually missing
    • Uses proper batch updates for efficiency
  3. Health Endpoint Fix - The HealthResponse refactoring (health/models.py:227-234) is excellent:

    • Uses monotonic clock for accurate uptime tracking
    • stopwatch_start field with exclude=True prevents serialization leaks
    • @computed_field ensures uptime is always current
    • Eliminates manual timestamp management errors
  4. Comprehensive Testing - The new test file (test_indexer_reconciliation.py) provides excellent coverage:

    • Tests all vector state combinations (dense only, sparse only, both, neither)
    • Tests mixed states within a single file
    • Proper mocking of async components
    • Clear test documentation
  5. Provider Metadata Enhancement - Tagging batches with sparse indicator (base.py:1052-1062) aids in debugging and observability

🔍 Minor Observations

  1. Sparse Provider Detection (base.py:1053-1056)

    • The detection logic relies on class naming conventions (startswith('sparse') or 'sparse' in name)
    • This works but could be fragile if naming conventions change
    • Suggestion: Consider adding an explicit is_sparse attribute to provider classes for more reliable detection
  2. Configuration Clarity (codeweaver.toml:11-12)

    • Moving management_host and management_port out of [server] section improves clarity
    • The comments clearly explain the architecture
    • Good change that reduces configuration confusion
  3. Error Handling (indexer.py:1375-1376)

    • The broad except Exception with warning log is appropriate for non-critical reconciliation
    • However, specific exceptions (e.g., vector store unavailable) might warrant different handling
    • Minor: Consider logging the exception type for better diagnostics
  4. Test Coverage - While excellent, one edge case could be added:

    • Test behavior when vector store returns points with malformed vector data
    • Test behavior when embedding provider returns empty/invalid embeddings

🔒 Security & Performance

No security concerns identified

  • No credential handling or sensitive data exposure
  • Proper use of async operations
  • No SQL injection or command injection vectors

Performance considerations are sound

  • Batch operations used for vector updates
  • Reconciliation only runs when needed (not on force reindex)
  • Efficient vector existence checking

📝 Code Style & Best Practices

✅ Follows existing codebase patterns
✅ Proper type hints and documentation
✅ Good use of async/await patterns
✅ Appropriate logging levels
✅ Clean separation of concerns

🏗️ Overall Assessment

This PR delivers significant value:

  • Fixes the health endpoint crash (critical bug fix)
  • Adds intelligent embedding reconciliation (reduces manual intervention)
  • Improves configuration clarity
  • Maintains code quality with comprehensive tests

The implementation is production-ready. The test suite is comprehensive and the code follows best practices.

⏳ CI Status

  • ✅ Linting and formatting passed
  • ✅ CodeQL security analysis passed
  • ⏳ Tests are still running (Python 3.12, 3.13, 3.14)
  • ⚠️ CLA check failed (may need resolution)

📊 Recommendation

APPROVE - This PR is ready to merge once CI tests complete successfully. The code quality is excellent, the feature is well-tested, and it addresses real issues in the codebase.

Next Steps:

  1. Wait for test suite completion
  2. Address CLA check if needed
  3. Merge when all checks pass

@github-actions
Copy link
Contributor

github-actions bot commented Dec 3, 2025

Code Review Summary

I've completed a thorough review of PR #188. This is an excellent pull request that addresses critical functionality gaps while maintaining code quality. Here are my findings:

✅ Strengths

1. Embedding Reconciliation Logic (indexer.py:1327-1376, indexer.py:2446-2498)

  • Smart detection: The automatic reconciliation only runs when force_reindex=False and there's an available vector store with embedding providers
  • Efficient checking: Uses with_vectors=True to retrieve existing vectors and checks their presence before generating new embeddings
  • Proper deduplication: Checks existing_vector_names to avoid regenerating embeddings that already exist
  • Graceful error handling: Wraps reconciliation in try-except with logger.warning to prevent failures from breaking the entire indexing process
  • Clear logging: Provides informative logs about what's being reconciled

2. Health Endpoint Fix (models.py:223-234)

  • Elegant solution: Using @computed_field with time.monotonic_ns() for uptime calculation is the correct approach for monotonic time tracking
  • Proper field organization: stopwatch_start is correctly excluded from serialization with exclude=True
  • Maintains API compatibility: The uptime_seconds field remains in the response schema

3. Sparse Embedding Tagging (base.py:1052-1062)

  • Good detection heuristic: Uses class name inspection to identify sparse providers
  • Future-proof: Checks both startswith('sparse') and 'sparse' in name.lower()
  • Proper data model update: The BatchKeys type was already designed to support the sparse field

4. Configuration Improvements (codeweaver.toml)

  • Better organization: Moving management_host and management_port to the root level makes sense for server architecture
  • Clear separation: Renaming [server] to [mcp_server] clarifies the distinction between MCP and management servers

5. Comprehensive Testing (test_indexer_reconciliation.py)

  • Excellent coverage: 472 lines of well-structured unit tests covering all edge cases
  • Realistic scenarios: Tests handle mixed vector states, multiple chunks, and various embedding combinations
  • Good mocking: Properly mocks vector store, embedding providers, and file manifest

🔍 Observations

Performance Considerations:

  1. Reconciliation overhead: The automatic reconciliation runs after every non-forced index, which adds a retrieval operation with with_vectors=True. For large indices, this could be noticeable. However, this is acceptable because:

    • It only runs when needed (checks manifest first)
    • The benefit of automatic consistency far outweighs the cost
    • It's wrapped in error handling so it won't block indexing
  2. Vector retrieval: Line indexer.py:2446 now retrieves vectors for every chunk in files needing reconciliation. This is necessary for correctness but could be memory-intensive for files with many chunks. Consider monitoring memory usage in production.

Security Considerations:

  • No security concerns identified
  • All changes are defensive and internal to the indexing system
  • No user input is processed directly in the new code

Code Quality:

  • Consistent with the codebase's existing patterns and style
  • Good separation of concerns
  • Clear variable naming and comments
  • Proper type annotations

🎯 Minor Suggestions

  1. Consider adding a configuration flag for automatic reconciliation:

    # In config/settings
    auto_reconcile_embeddings: bool = True

    This would allow users to disable it if needed for performance reasons.

  2. Metric tracking: Consider adding metrics for reconciliation operations:

    • Number of chunks reconciled per session
    • Time spent in reconciliation
    • This would help monitor the performance impact
  3. Batch size for reconciliation: For very large files with thousands of chunks, consider processing chunks in batches rather than all at once to reduce memory pressure.

✅ Verdict

APPROVED

This PR is well-designed, thoroughly tested, and solves important problems:

  • Fixes the health endpoint uptime calculation bug
  • Implements automatic embedding reconciliation to maintain data consistency
  • Improves configuration clarity
  • Includes comprehensive test coverage

The code quality is excellent, performance implications are acceptable, and there are no security concerns. Once CI passes (tests are still pending), this is ready to merge.

CI Status: Linting passed ✓, tests pending, CLA check has a minor failure but that appears to be a separate issue.

Great work! 🎉

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review for PR #188

I've completed a thorough review of the changes in this pull request. Overall, this is a well-structured and valuable contribution that adds automatic embedding reconciliation, fixes a critical health endpoint bug, and improves configuration clarity. However, there are several important issues that should be addressed before merging.


🎯 Summary

This PR introduces automatic post-indexing embedding reconciliation, fixes health endpoint uptime reporting, clarifies server configuration, and adds comprehensive tests. The main features are:

  1. Automatic Embedding Reconciliation - Detects and fixes missing dense/sparse embeddings after batch indexing
  2. Health Endpoint Fix - Corrects uptime calculation that was preventing the health endpoint from responding
  3. Configuration Improvements - Separates management vs MCP server configuration for clarity
  4. Comprehensive Testing - Adds 472 lines of new test coverage for reconciliation logic

⚠️ Critical Issues

1. Health Endpoint Uptime Calculation (Bug)

Location: src/codeweaver/server/health/models.py:227-234

Issue: The new uptime_seconds computed field measures time since the HealthResponse instance was created, not actual server uptime. The stopwatch_start field defaults to time.monotonic_ns() at model instantiation, which means every health check response will show a near-zero uptime.

Evidence: The create_with_current_timestamp method at line 240 still accepts an uptime_seconds parameter but no longer uses it. The health service at health_service.py:115 calculates uptime_seconds = int(time.time() - self._startup_time) and passes it to the factory method, but it's now ignored.

Impact: This breaks the health endpoint's uptime reporting, which is critical for monitoring and observability.

Recommendation:

  • Either revert to passing uptime_seconds explicitly and remove the computed field
  • Or pass a server-level startup_time to the HealthResponse and compute uptime from that
  • Update create_with_current_timestamp to accept and use stopwatch_start as a parameter initialized from self._startup_time

2. Brittle Sparse Provider Detection

Location: src/codeweaver/providers/embedding/providers/base.py:1052-1056

Issue: The code detects sparse providers by checking if the class name contains "sparse":

is_sparse_provider = (
    type(self).__name__.lower().startswith("sparse")
    or "sparse" in type(self).__name__.lower()
)

Problems:

  • Relies on naming conventions rather than explicit behavior
  • Fragile - breaks if class is renamed or doesn't follow convention
  • Could false-positive on classes with "sparse" in name for other reasons

Recommendation: Add an explicit is_sparse_provider property or vector_type attribute to embedding provider classes. This makes the intent clear and prevents future naming issues.

3. Overly Broad Exception Handling

Location: src/codeweaver/engine/indexer/indexer.py:1375-1376

Issue: The automatic reconciliation is wrapped in except Exception that catches everything:

except Exception:
    logger.warning("Automatic reconciliation failed", exc_info=True)

Problems:

  • Swallows all errors including programming errors, memory issues, etc.
  • Limited context in the warning (no collection name, file count, etc.)
  • Makes debugging difficult

Recommendation:

  • Narrow the exception handling to specific expected exceptions
  • Include contextual information in the log: collection name, number of files needing reconciliation, current models, etc.
  • Consider whether reconciliation failures should stop indexing or just warn

🐛 Code Quality Issues

4. Missing Vector Type Normalization Handling

Location: src/codeweaver/engine/indexer/indexer.py:2459-2465

Issue: The vector normalization code assumes point.vector is either a dict or a list, but Qdrant can return other types like NamedVectors objects:

existing_vectors = point.vector if hasattr(point, "vector") else {}
if not isinstance(existing_vectors, dict):
    existing_vectors = {"":existing_vectors} if existing_vectors else {}

Problem: A NamedVectors object (or other Mapping) that isn't literally a dict would be treated as a single unnamed vector, losing the individual vector names.

Recommendation: Check for mapping-like objects first:

from collections.abc import Mapping

existing_vectors = point.vector if hasattr(point, "vector") else {}
if isinstance(existing_vectors, Mapping):
    # Already a mapping, use as-is
    existing_vector_names = set(existing_vectors.keys())
elif existing_vectors:
    # Single unnamed vector
    existing_vector_names = {""} 
else:
    existing_vector_names = set()

5. Configuration Migration Note Needed

Location: codeweaver.toml:11-12, 40-47

Issue: The configuration structure changed:

  • management_host and management_port moved from [server] to top-level
  • [server] renamed to [mcp_server]

Problem: Existing users will have the old configuration format and may experience issues.

Recommendation:

  • Add a migration note to the PR description or CHANGELOG
  • Consider adding backwards compatibility or a config migration helper
  • Document the new structure clearly

✅ Strengths

  1. Excellent Test Coverage - The new test_indexer_reconciliation.py is comprehensive and well-structured
  2. Smart Reconciliation Design - Checking existing vectors before generating new ones prevents waste
  3. Clear Code Documentation - Good inline comments explaining the reconciliation logic
  4. Proper Async Handling - Correctly uses async/await throughout
  5. Logging - Good informational logging during reconciliation process

🔍 Testing Gaps

Missing Test Scenarios:

  1. Non-dict vector types: Tests only use dict representations ({}, {"":...}), not lists or other types
  2. Empty retrieve results: No test for when vector store returns no points
  3. Missing payload/text: No test for points without payload["text"]
  4. Exception scenarios: No test verifying that prime_index handles reconciliation exceptions
  5. Single provider configs: No test for when only dense OR only sparse provider is configured

These are already highlighted in Sourcery's review and should be added.


📋 Minor Issues

  1. Unused parameter: uptime_seconds parameter in create_with_current_timestamp is unused (line 240)
  2. Import order: Minor - could group anthropic imports with other third-party imports
  3. Type hints: The _telemetry_keys method returns None but is typed as returning None rather than having no return

🚦 Recommendation

REQUEST CHANGES ⚠️

While this is a solid contribution with valuable features, the critical issues (especially #1 health endpoint bug and #2 brittle detection) should be addressed before merging. The health endpoint bug is particularly concerning as it replaces working functionality with broken functionality.

Priority fixes:

  1. Fix the health endpoint uptime calculation (Critical)
  2. Replace sparse provider name-based detection with explicit flag (High)
  3. Improve exception handling context (Medium)
  4. Add test coverage for edge cases (Medium)

Once these are addressed, this will be an excellent addition to the codebase. The reconciliation feature is well-designed and solves a real problem.


💬 Questions for Author

  1. How did you test the health endpoint changes? The uptime calculation appears to be broken.
  2. Have you tested with non-standard embedding provider class names?
  3. What should happen if reconciliation fails - should indexing continue or fail?
  4. Have you tested the config changes with existing configurations?

CI Status Note: Tests are still running, but lint and security checks have passed.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Adam Poulemanos <89049923+bashandbone@users.noreply.github.com>
Copilot AI review requested due to automatic review settings December 3, 2025 15:37
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.

Comments suppressed due to low confidence (1)

src/codeweaver/server/health/models.py:255

  • The create_with_current_timestamp method signature still accepts an uptime_seconds parameter, but this parameter is no longer used in the method body. This creates a misleading API where callers pass an uptime value that is completely ignored.

The computed uptime_seconds field will always calculate elapsed time from when the HealthResponse instance was created (using stopwatch_start), not from the actual server startup time as intended. This means every health response will report an uptime of approximately 0 seconds, regardless of how long the server has been running.

Solution: Either:

  1. Remove the uptime_seconds parameter from the method signature (breaking change), or
  2. Accept and use it by passing stopwatch_start=time.monotonic_ns() - (uptime_seconds * 1_000_000_000) when constructing the instance, or
  3. Revert this refactoring and keep uptime_seconds as a regular field that's passed in by the caller (the original design was correct).
    def create_with_current_timestamp(
        cls,
        status: Literal["healthy", "degraded", "unhealthy"],
        uptime_seconds: int,
        indexing: IndexingInfo,
        services: ServicesInfo,
        statistics: StatisticsInfo,
        failover: FailoverInfo | None = None,
        resources: ResourceInfo | None = None,
    ) -> HealthResponse:
        """Create health response with current timestamp."""
        return cls(
            status=status,
            indexing=indexing,
            services=services,
            statistics=statistics,
            failover=failover,
            resources=resources,
        )

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 1052 to 1056
# Detect if this is a sparse embedding provider
is_sparse_provider = (
type(self).__name__.lower().startswith("sparse")
or "sparse" in type(self).__name__.lower()
)
Copy link

Copilot AI Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sparse provider detection relies on checking if the class name contains "sparse" (case-insensitive). While this works for the current implementations (FastEmbedSparseProvider, SentenceTransformersSparseProvider), it's fragile and could break if:

  1. A future sparse provider doesn't follow this naming convention
  2. A dense provider happens to have "sparse" in its name for some other reason

Recommendation: Use isinstance(self, SparseEmbeddingProvider) instead of string matching on the class name. This is more robust and correctly identifies sparse providers based on their actual inheritance hierarchy.

Copilot uses AI. Check for mistakes.
@bashandbone
Copy link
Contributor Author

@copilot please address the identified issues and recommendations in PR #188 comments.

Specifically these points:

⚠️ Critical Issues

  1. Health Endpoint Uptime Calculation (Bug)

Location: src/codeweaver/server/health/models.py:227-234

Issue: The new uptime_seconds computed field measures time since the HealthResponse instance was created, not actual server uptime. The stopwatch_start field defaults to time.monotonic_ns() at model instantiation, which means every health check response will show a near-zero uptime.

Evidence: The create_with_current_timestamp method at line 240 still accepts an uptime_seconds parameter but no longer uses it. The health service at health_service.py:115 calculates uptime_seconds = int(time.time() - self._startup_time) and passes it to the factory method, but it's now ignored.

Impact: This breaks the health endpoint's uptime reporting, which is critical for monitoring and observability.

Recommendation:

Either revert to passing uptime_seconds explicitly and remove the computed field
Or pass a server-level startup_time to the HealthResponse and compute uptime from that
Update create_with_current_timestamp to accept and use stopwatch_start as a parameter initialized from self._startup_time

USER_NOTE: This was an oversight. We were correcting an issue where the calculation returned negative time. Ensure the calculation is against server uptime vice model/instance uptime. If it doesn't already have one, we may need to introduce a stopwatch_time attribute to the server to capture monotonic time -- in general we want to move elapsed time calculations to use monotonic time to prevent these sorts of issues.

  1. Brittle Sparse Provider Detection

Location: src/codeweaver/providers/embedding/providers/base.py:1052-1056

Issue: The code detects sparse providers by checking if the class name contains "sparse":

is_sparse_provider = (
type(self).name.lower().startswith("sparse")
or "sparse" in type(self).name.lower()
)

Problems:

Relies on naming conventions rather than explicit behavior
Fragile - breaks if class is renamed or doesn't follow convention
Could false-positive on classes with "sparse" in name for other reasons

Recommendation: Add an explicit is_sparse_provider property or vector_type attribute to embedding provider classes. This makes the intent clear and prevents future naming issues.

User note: There is a is_sparse_provider property on codeweaver.providers.provider.Provider enum, but that doesn't inform that the class is sparse or dense because all current sparse providers are also embedding providers (the similar is_embedding_provider would also return True). The simplest thing here is to use an issubclass(SparseEmbeddingProvider) -- just beware that could easily create circular import issues. I believe we used this check to avoid that.

  1. Overly Broad Exception Handling

Location: src/codeweaver/engine/indexer/indexer.py:1375-1376

Issue: The automatic reconciliation is wrapped in except Exception that catches everything:

except Exception:
logger.warning("Automatic reconciliation failed", exc_info=True)

Problems:

Swallows all errors including programming errors, memory issues, etc.
Limited context in the warning (no collection name, file count, etc.)
Makes debugging difficult

Recommendation:

Narrow the exception handling to specific expected exceptions
Include contextual information in the log: collection name, number of files needing reconciliation, current models, etc.
Consider whether reconciliation failures should stop indexing or just warn

🐛 Code Quality Issues
4. Missing Vector Type Normalization Handling

Location: src/codeweaver/engine/indexer/indexer.py:2459-2465

Issue: The vector normalization code assumes point.vector is either a dict or a list, but Qdrant can return other types like NamedVectors objects:

existing_vectors = point.vector if hasattr(point, "vector") else {}
if not isinstance(existing_vectors, dict):
existing_vectors = {"":existing_vectors} if existing_vectors else {}

Problem: A NamedVectors object (or other Mapping) that isn't literally a dict would be treated as a single unnamed vector, losing the individual vector names.

Recommendation: Check for mapping-like objects first:

from collections.abc import Mapping

existing_vectors = point.vector if hasattr(point, "vector") else {}
if isinstance(existing_vectors, Mapping):
# Already a mapping, use as-is
existing_vector_names = set(existing_vectors.keys())
elif existing_vectors:
# Single unnamed vector
existing_vector_names = {""}
else:
existing_vector_names = set()

✅ Strengths

Excellent Test Coverage - The new test_indexer_reconciliation.py is comprehensive and well-structured
Smart Reconciliation Design - Checking existing vectors before generating new ones prevents waste
Clear Code Documentation - Good inline comments explaining the reconciliation logic
Proper Async Handling - Correctly uses async/await throughout
Logging - Good informational logging during reconciliation process

🔍 Testing Gaps
Missing Test Scenarios:

Non-dict vector types: Tests only use dict representations ({}, {"":...}), not lists or other types
Empty retrieve results: No test for when vector store returns no points
Missing payload/text: No test for points without payload["text"]
Exception scenarios: No test verifying that prime_index handles reconciliation exceptions
Single provider configs: No test for when only dense OR only sparse provider is configured

Carefully review sourcery-ai's test improvement recommendations and implement them. Ensure tests pass before submitting a PR. Commit incrementally as you work.

Copy link
Contributor

Copilot AI commented Dec 3, 2025

@bashandbone I've opened a new pull request, #189, to work on those changes. Once the pull request is ready, I'll request review from you.

Copilot AI and others added 4 commits December 3, 2025 16:25
…, vector normalization, and tests

Co-authored-by: bashandbone <89049923+bashandbone@users.noreply.github.com>
Co-authored-by: bashandbone <89049923+bashandbone@users.noreply.github.com>
… comments

Co-authored-by: bashandbone <89049923+bashandbone@users.noreply.github.com>
fix: health endpoint uptime, sparse provider detection, exception handling, vector normalization
@bashandbone
Copy link
Contributor Author

@copilot multiple tests are failing; please correct the issue. Test output:

ERROR tests/integration/test_health_monitoring.py::test_health_endpoint_responds - TypeError: HealthService.init() got an unexpected keyword argument 'startup_time'
ERROR tests/integration/test_health_monitoring.py::test_health_response_schema_validation - TypeError: HealthService.init() got an unexpected keyword argument 'startup_time'
ERROR tests/integration/test_health_monitoring.py::test_health_during_indexing - TypeError: HealthService.init() got an unexpected keyword argument 'startup_time'
ERROR tests/integration/test_health_monitoring.py::test_health_status_healthy - TypeError: HealthService.init() got an unexpected keyword argument 'startup_time'
ERROR tests/integration/test_health_monitoring.py::test_health_status_degraded - TypeError: HealthService.init() got an unexpected keyword argument 'startup_time'
ERROR tests/integration/test_health_monitoring.py::test_health_status_unhealthy - TypeError: HealthService.init() got an unexpected keyword argument 'startup_time'
ERROR tests/integration/test_health_monitoring.py::test_health_indexing_progress - TypeError: HealthService.init() got an unexpected keyword argument 'startup_time'
ERROR tests/integration/test_health_monitoring.py::test_health_service_states - TypeError: HealthService.init() got an unexpected keyword argument 'startup_time'
ERROR tests/integration/test_health_monitoring.py::test_health_circuit_breaker_exposure - TypeError: HealthService.init() got an unexpected keyword argument 'startup_time'
ERROR tests/integration/test_health_monitoring.py::test_health_statistics - TypeError: HealthService.init() got an unexpected keyword argument 'startup_time'
ERROR tests/integration/test_health_monitoring.py::test_health_performance - TypeError: HealthService.init() got an unexpected keyword argument 'startup_time'
ERROR tests/integration/test_health_monitoring.py::test_health_parallel_component_collection - TypeError: HealthService.init() got an unexpected keyword argument 'startup_time'
ERROR tests/integration/test_health_monitoring.py::test_health_json_serialization - TypeError: HealthService.init() got an unexpected keyword argument 'startup_time'
FAILED tests/integration/test_error_recovery.py::test_health_shows_degraded_status - TypeError: HealthService.init() got an unexpected keyword argument 'startup_time'
FAILED tests/integration/test_phase4_status_flow.py::TestPhase4StatusFlow::test_health_endpoint_includes_failover_info - TypeError: HealthService.init() got an unexpected keyword argument 'startup_time'
FAILED tests/integration/test_phase4_status_flow.py::TestPhase4StatusFlow::test_health_endpoint_during_failover - TypeError: HealthService.init() got an unexpected keyword argument 'startup_time'
FAILED tests/integration/test_phase4_status_flow.py::TestPhase4StatusFlow::test_end_to_end_failover_flow - TypeError: HealthService.init() got an unexpected keyword argument 'startup_time'
FAILED tests/unit/cli/test_start_command.py::TestStartDaemonBackground::test_start_daemon_background_finds_executable - AssertionError: assert '--foreground' in ['/usr/local/bin/cw', 'start', '--project', '/tmp/pytest-of-runner/pytest-0/test_start_daemon_background_f0/project', '--mcp-host', '127.0.0.1', ...]
FAILED tests/unit/cli/test_start_command.py::TestStartDaemonBackground::test_start_daemon_background_uses_python_fallback - AssertionError: assert '-m' in ['/home/runner/work/codeweaver/codeweaver/.venv/bin/python3', '/home/runner/work/codeweaver/codeweaver/src/codeweaver/cli/main.py', 'start', '--project', '/tmp/pytest-of-runner/pytest-0/test_start_daemon_background_u0/project', '--mcp-host', ...]
FAILED tests/unit/test_indexer_reconciliation.py::TestAutomaticReconciliation::test_reconciliation_called_during_prime_index - AssertionError: Expected 'mock' to have been called once. Called 0 times.
FAILED tests/unit/test_indexer_reconciliation.py::TestReconciliationExceptionHandling::test_prime_index_handles_provider_error - AssertionError: Expected 'mock' to have been called once. Called 0 times.
FAILED tests/unit/test_indexer_reconciliation.py::TestReconciliationExceptionHandling::test_prime_index_handles_indexing_error - AssertionError: Expected 'mock' to have been called once. Called 0 times.
FAILED tests/unit/test_indexer_reconciliation.py::TestReconciliationExceptionHandling::test_prime_index_handles_connection_error - AssertionError: Expected 'mock' to have been called once. Called 0 times.
= 10 failed, 641 passed, 31 skipped, 125 deselected, 13 errors in 273.07s (0:04:33) =
[test-cov] ERROR task failed

Copy link
Contributor

Copilot AI commented Dec 3, 2025

@bashandbone I've opened a new pull request, #190, to work on those changes. Once the pull request is ready, I'll request review from you.

Copilot AI and others added 2 commits December 3, 2025 17:41
… tests

Co-authored-by: bashandbone <89049923+bashandbone@users.noreply.github.com>
…tions

Co-authored-by: bashandbone <89049923+bashandbone@users.noreply.github.com>
Copilot AI and others added 2 commits December 3, 2025 17:49
Co-authored-by: bashandbone <89049923+bashandbone@users.noreply.github.com>
Fix test failures: HealthService parameter name and test assertion corrections
Copilot AI review requested due to automatic review settings December 3, 2025 17:58
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 12 out of 12 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@bashandbone bashandbone merged commit 7755eb1 into main Dec 3, 2025
25 of 27 checks passed
@bashandbone bashandbone deleted the fix-sparse-retention branch December 3, 2025 18:13
@github-actions github-actions bot locked and limited conversation to collaborators Dec 3, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants