cpdata · cpdata · Oct 14, 2025
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -0,0 +1,48 @@
+name: CI
+
+on:
+  push:
+    branches: ["main", "review", "review-1"]
+  pull_request:
+
+env:
+  PIP_DISABLE_PIP_VERSION_CHECK: "1"
+
+jobs:
+  lint:
+    runs-on: ubuntu-latest
+    steps:
+      - name: Check out
+        uses: actions/checkout@v4
+      - name: Set up Python
+        uses: actions/setup-python@v5
+        with:
+          python-version: "3.11"
+      - name: Install toolchain
+        run: |
+          pip install uv
+          uv pip install --system -e .
+          uv pip install --system ruff pyright typeguard toml-sort yamllint
+      - name: Lint and format checks
+        run: make fmt-check
+      - name: Docs guard
+        env:
+          BASE_REF: ${{ github.event.pull_request.base.sha || 'HEAD~1' }}
+        run: make docs-guard
+
+  tests:
+    runs-on: ubuntu-latest
+    steps:
+      - name: Check out
+        uses: actions/checkout@v4
+      - name: Set up Python
+        uses: actions/setup-python@v5
+        with:
+          python-version: "3.11"
+      - name: Install dependencies
+        run: |
+          pip install uv
+          uv pip install --system -e .
+          uv pip install --system pytest
+      - name: Run pytest
+        run: make test
diff --git a/AGENTS.md b/AGENTS.md
@@ -0,0 +1,18 @@
+# Agent Instructions
+
+## Documentation Workflow
+- After each batch of changes, add a `CHANGELOG.md` entry with an ISO 8601 date/time stamp and developer-facing detail (files, modules, functions, variables, and rationale). Every commit should correspond to a fresh entry.
+- Maintain `README.md` as the canonical description of the project; update it whenever behaviour or workflows change. Archive older versions separately when requested.
+- Keep the `docs/` wiki and provisioning guides (`SETUP.md`, `ENVIRONMENT_NEEDS.md`) in sync with code updates; add or revise the
+  relevant page whenever features, modules, or workflows change.
+- After each iteration, refresh `ISSUES.md`, `SOT.md`, `PLAN.md`, `RECOMMENDATIONS.md`, `TODO.md`, and related documentation to stay in sync with the codebase.
+- Ensure `TODO.md` retains the `Completed`, `Priority Tasks`, and `Recommended Waiting for Approval Tasks` sections, moving finished items under `Completed` at the end of every turn.
+- Update `RESUME_NOTES.md` at the end of every turn so the next session starts with accurate context.
+- When beginning a turn, review `README.md`, `PROJECT.md`, `PLAN.md`, `RECOMMENDATIONS.md`, `ISSUES.md`, and `SOT.md` to harvest new actionable work. Maintain at least ten quantifiable, prioritised items in the `Priority Tasks` section of `TODO.md`, adding context or links when needed.
+- After completing any task, immediately update `TODO.md`, check for the next actionable item, and continue iterating until all unblocked `Priority Tasks` are exhausted for the session.
+- Continuously loop through planning and execution: finish a task, document it, surface new follow-ups, and resume implementation so long as environment blockers allow. If extra guidance would improve throughput, extend these instructions proactively.
+
+## Style Guidelines
+- Use descriptive Markdown headings starting at level 1 for top-level documents.
+- Keep lines to 120 characters or fewer when practical.
+- Prefer bullet lists for enumerations instead of inline commas.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -0,0 +1,126 @@
+# Changelog
+
+## [2025-10-17T18:45:00Z]
+### Added
+- Created a Dockerfile for integration workloads and introduced targeted Compose stacks
+  under `meshmind/tests/docker/` (Memgraph, Neo4j, Redis, full-stack) alongside a
+  developer-facing provisioning guide in `SETUP.md` to document service bootstrapping
+  commands and environment requirements.
+
+### Changed
+- Expanded `pyproject.toml` to install optional dependencies (`fastapi`,
+  `uvicorn[standard]`, `neo4j`, `mgclient`, `redis`) by default and defined extras
+  (`dev`, `docs`, `testing`); updated the `Makefile` `install` target accordingly and
+  regenerated setup documentation across `README.md`, `docs/`, `PROJECT.md`, `PLAN.md`,
+  `SOT.md`, `NEEDED_FOR_TESTING.md`, `ENVIRONMENT_NEEDS.md`, `FINDINGS.md`,
+  `RECOMMENDATIONS.md`, and `RESUME_NOTES.md` to reference the new workflow and
+  credentials.
+- Reworked the root `docker-compose.yml` to provision Memgraph, Neo4j, and Redis with
+  health checks and volumes, added Compose variants in `meshmind/tests/docker/`, and
+  refreshed onboarding materials (`SETUP.md`, `README.md`, `docs/configuration.md`,
+  `docs/operations.md`, `docs/testing.md`) to call out the new ports, credentials, and
+  teardown guidance.
+- Replaced references to `pymgclient` with `mgclient` throughout dependency notes and
+  environment files to match the updated driver import.
+
+### Fixed
+- Patched `meshmind/cli/admin.py` to import `argparse`, restoring CLI admin command
+  registration after the module refactor.
+- Updated `.github/workflows/ci.yml` to pass `--system` to `uv pip install`, resolving
+  the "No virtual environment found" failure during lint/test setup.
+
+## [2025-10-16T18:30:00Z]
+### Fixed
+- Adjusted `meshmind/tests/test_service_interfaces.py::test_memory_service_ingest_and_search` to return a hydrated `Memory`
+  instance from the monkey-patched `list_memories` stub, ensuring pagination-aware search paths remain asserted while avoiding
+  empty result sets during verification.
+
+## [2025-10-16T12:00:00Z]
+### Added
+- Introduced pagination-aware graph access by adding `search_entities` and `count_entities` to every `GraphDriver` implementation, wiring a new `meshmind admin counts` CLI subcommand and REST `/memories/counts` route through `MemoryManager`, `MemoryService`, and the MeshMind client.
+- Added `scripts/check_docs_sync.py` plus a Makefile target, CI step, and pytest coverage to guard documentation updates whenever code under mapped modules changes.
+
+### Changed
+- Extended `MemoryManager.list_memories`, MeshMind client helpers, retrieval graph wrappers, and service adapters to forward `offset`, `limit`, and `query` hints, delegating filtering to the active driver before in-memory scoring.
+- Updated examples and tests (`meshmind/tests/test_db_drivers.py`, `test_service_interfaces.py`, `test_graph_retrieval.py`, `test_cli_admin.py`, `test_client.py`, `test_docs_guard.py`) to cover pagination, counts, and driver-side search semantics.
+
+### Documentation
+- Refreshed `README.md`, `PROJECT.md`, `PLAN.md`, `RECOMMENDATIONS.md`, `ISSUES.md`, `SOT.md`, `FINDINGS.md`, `AGENTS.md`, `TODO.md`, and the developer wiki (`docs/api.md`, `docs/development.md`, `docs/operations.md`, `docs/persistence.md`, `docs/retrieval.md`, `docs/troubleshooting.md`) to describe pagination, counts, docs-guard workflows, and updated service interfaces.
+## [2025-10-15T15:30:00Z]
+### Added
+- Created a developer wiki under `docs/` covering architecture, pipelines, persistence, retrieval, configuration, testing, operations, telemetry, and development workflows so code changes stay synchronized with reference material.
+- Authored `ENVIRONMENT_NEEDS.md` to request optional dependency installs and external services, plus `RESUME_NOTES.md` for session-to-session continuity.
+
+### Changed
+- Expanded the `GraphDriver` contract to accept namespace and entity-label filters when listing entities, updating the in-memory, SQLite, Neo4j, and Memgraph drivers to push filtering into their native query layers.
+- Propagated the new filtering through `MemoryManager`, `MeshMind.list_memories`, graph-backed retrieval wrappers, and service interfaces (REST/gRPC), ensuring hybrid searches hydrate only the required entity types.
+- Updated tests (`meshmind/tests/test_graph_retrieval.py`, `test_pipeline_preprocess_store.py`, `test_service_interfaces.py`) to cover entity-label filtering across client, REST, and gRPC paths.
+
+### Documentation
+- Refreshed `README.md`, `PROJECT.md`, `PLAN.md`, `RECOMMENDATIONS.md`, `ISSUES.md`, `SOT.md`, `DISCREPANCIES.md`, `FINDINGS.md`, `TODO.md`, and `AGENTS.md` to describe the new driver filtering, documentation workflow, environment checklist, and wiki requirements.
+
+## [2025-02-15T00:45:00Z]
+### Added
+- Introduced `meshmind/retrieval/graph.py` with hybrid/vector/regex/exact/BM25/fuzzy wrappers that hydrate candidates from the active `GraphDriver` before delegating to existing scorers, plus `meshmind/tests/test_graph_retrieval.py` to verify namespace filtering and hybrid integration.
+- Added `meshmind/cli/admin.py` and wired `meshmind/cli/__main__.py` to expose `admin` subcommands for predicate management, maintenance telemetry, and graph connectivity checks; created `meshmind/tests/test_cli_admin.py` to cover the new flows.
+- Created `meshmind/tests/test_neo4j_driver.py` and a `Neo4jGraphDriver.verify_connectivity` helper to exercise driver-level sanity checks without a live cluster.
+- Logged importance score distributions via `meshmind/pipeline/preprocess.summarize_importance` so telemetry captures mean/stddev/recency metrics after scoring.
+
+### Changed
+- Updated `MeshMind` search helpers (`meshmind/client.py`) to auto-load memories from the configured driver when `memories` is `None`, reusing the new graph-backed wrappers.
+- Reworked `meshmind/pipeline/consolidate.py` to return a `ConsolidationPlan` with batch/backoff thresholds and skipped-group tracking; `meshmind/tasks/scheduled.consolidate_task` now emits skip counts and returns a structured summary.
+- Tuned Python compatibility metadata to `>=3.11,<3.13` in `pyproject.toml` and refreshed docs (`README.md`, `NEEDED_FOR_TESTING.md`, `SOT.md`) accordingly.
+- Enhanced `meshmind/pipeline/preprocess.py` to emit telemetry gauges for importance scoring and added `meshmind/tests/test_pipeline_preprocess_store.py::test_score_importance_records_metrics`.
+- Expanded retrieval, CLI, and driver test coverage (`meshmind/tests/test_retrieval.py`, `meshmind/tests/test_tasks_scheduled.py`) to account for graph-backed defaults and new return types.
+
+### Documentation
+- Updated `README.md`, `PROJECT.md`, `PLAN.md`, `SOT.md`, `FINDINGS.md`, `DISCREPANCIES.md`, `RECOMMENDATIONS.md`, `NEEDED_FOR_TESTING.md`, `ISSUES.md`, and `TODO.md` to describe graph-backed retrieval wrappers, CLI admin tooling, consolidation backoff behaviour, telemetry metrics, and revised Python support.
+- Copied the refreshed README guidance into `README_OLD.md` as an archival reference while keeping `README.md` as the primary source.
+
+## [2025-10-14T14:57:47Z]
+### Added
+- Introduced `meshmind/_compat/pydantic.py` to emulate `BaseModel`, `Field`, and `ValidationError` when Pydantic is unavailable, enabling tests to run in constrained environments.
+- Added `meshmind/testing/fakes.py` with `FakeMemgraphDriver`, `FakeRedisBroker`, and `FakeEmbeddingEncoder`, plus a package export and dedicated pytest coverage (`meshmind/tests/test_db_drivers.py`, `meshmind/tests/test_tasks_scheduled.py`).
+- Created heuristics-focused test cases for consolidation outcomes, maintenance tasks, and the revised retrieval dispatcher to guarantee behaviour without external services.
+
+### Changed
+- Replaced the constant importance assignment in `meshmind/pipeline/preprocess.score_importance` with a heuristic that factors token diversity, recency, metadata richness, and embedding magnitude.
+- Rebuilt `meshmind/pipeline/consolidate` around a `ConsolidationOutcome` dataclass that merges metadata, averages embeddings, and surfaces removal IDs; `meshmind/tasks/scheduled.consolidate_task` now applies updates and deletes duplicates lazily via `_get_manager`/`_reset_manager` helpers.
+- Hardened Celery maintenance tasks by logging driver initialization failures, tracking update counts, and returning deterministic totals; compression counts now reflect the number of persisted updates.
+- Updated `meshmind/core/similarity`, `meshmind/retrieval/bm25`, and `meshmind/retrieval/fuzzy` with pure-Python fallbacks so numpy, scikit-learn, and rapidfuzz remain optional.
+- Adjusted `meshmind/pipeline/extract.extract_memories` to defer `openai` imports until a default client is required, unblocking DummyLLM-driven tests.
+- Reworked `meshmind/retrieval/search.search` to rerank the original (filtered) candidate ordering, prepend reranked results, and append hybrid-sorted fallbacks, preventing index drift when rerankers return relative positions.
+- Normalised SQLite entity hydration in `meshmind/db/sqlite_driver._row_to_dict` so JSON metadata is decoded only when stored as strings.
+- Refreshed pytest fixtures (`meshmind/tests/conftest.py`, `meshmind/tests/test_pipeline_preprocess_store.py`) to use deterministic encoders and driver doubles, ensuring CRUD and retrieval suites run without live services.
+
+### Documentation
+- Promoted `README.md` as the single source of truth (archiving the previous copy in `README_OLD.md`) and documented the new heuristics, compatibility shims, and test doubles.
+- Updated `NEEDED_FOR_TESTING.md` with notes about the compatibility layer, optional dependencies, and fake drivers.
+- Reconciled `PROJECT.md`, `ISSUES.md`, `PLAN.md`, `SOT.md`, `RECOMMENDATIONS.md`, `DISCREPANCIES.md`, `FINDINGS.md`, `TODO.md`, and `CHANGELOG.md` to capture the new persistence behaviour, heuristics, fallbacks, and remaining roadmap items.
+
+## [Unreleased] - 2025-02-14
+### Added
+- Configurable graph driver factory with in-memory, SQLite, Memgraph, and optional Neo4j implementations plus supporting tests.
+- REST and gRPC service layers (with FastAPI stub fallback) for ingestion and retrieval, including coverage in the test suite.
+- Observability utilities that collect metrics and structured logs across pipelines and scheduled Celery tasks.
+- Docker Compose definition provisioning Memgraph, Redis, and a Celery worker for local development.
+- Vector-only, regex, exact-match, and optional LLM rerank retrieval helpers with reranker utilities and exports.
+- MeshMind client wrappers for hybrid, vector, regex, and exact searches plus driver accessors.
+- Example script demonstrating triplet storage and diverse retrieval flows.
+- Pytest fixtures for encoder and memory factories alongside new retrieval tests that avoid external services.
+- Makefile targets for linting, formatting, type checks, and tests, plus a GitHub Actions workflow running lint and pytest.
+- README_LATEST.md capturing the current implementation and CHANGELOG.md for release notes.
+
+### Changed
+- Settings now surface `GRAPH_BACKEND`, Neo4j, and SQLite options while README/NEEDED_FOR_TESTING document the expanded setup.
+- README, README_LATEST, and NEW_README were consolidated so the promoted README reflects current behaviour.
+- PROJECT, PLAN, SOT, FINDINGS, DISCREPANCIES, ISSUES, RECOMMENDATIONS, and TODO were refreshed to capture new capabilities and
+  re-homed backlog items under a "Later" section.
+- Updated `SearchConfig` to support rerank models and refreshed MeshMind documentation across PROJECT, PLAN, SOT, FINDINGS,
+  DISCREPANCIES, RECOMMENDATIONS, ISSUES, TODO, and NEEDED_FOR_TESTING files.
+- Revised `meshmind.retrieval.search` to apply filters centrally, expose new search helpers, and integrate reranking.
+- Exposed graph driver access on MeshMind and refreshed retrieval-facing examples and docs.
+
+### Fixed
+- Example ingestion script now uses MeshMind APIs correctly and illustrates relationship persistence.
+- Tests rely on fixtures rather than deprecated hooks, improving portability across environments without Memgraph/OpenAI.
diff --git a/DISCREPANCIES.md b/DISCREPANCIES.md
@@ -0,0 +1,53 @@
+# README vs Implementation Discrepancies
+
+## Overview
+- The legacy README has been superseded by `README.md`, which now reflects the implemented feature set.
+- The current codebase delivers extraction, preprocessing, triplet persistence, CRUD helpers, and expanded retrieval strategies
+  that were missing when the README was written.
+- Remaining gaps primarily involve pushing retrieval workloads into the graph backend, exporting observability to external sinks, and automated infrastructure provisioning.
+
+## API Surface
+- ✅ `MeshMind` now exposes CRUD helpers (`create_memory`, `update_memory`, `delete_memory`, `list_memories`, triplet helpers)
+  that the README referenced implicitly.
+- ✅ Triplet storage routes through `store_triplets` and `MemoryManager.add_triplet`, calling `GraphDriver.upsert_edge`.
+- ⚠️ The README still references `register_entity`, `register_allowed_predicates`, and `add_predicate`; predicate management is
+  handled automatically but there is no public API matching those method names.
+- ⚠️ README snippets showing `mesh_mind.store_memory(memory)` should be updated to call `store_memories([memory])` or the new
+  CRUD helpers.
+
+## Retrieval Capabilities
+- ✅ Vector-only, regex, exact-match, hybrid, BM25, fuzzy, and optional LLM rerank searches exist in `meshmind.retrieval.search`
+  and are surfaced through `MeshMind` helpers.
+- ⚠️ README implies retrieval queries the graph directly. Search helpers now fetch candidates from the configured driver when no
+  list is supplied but still score results in Python; Memgraph/Neo4j-native search remains future work.
+- ⚠️ Named helpers like `search_facts` or `search_procedures` never existed; the README should reference the dispatcher plus
+  specialized helpers now available.
+
+## Data & Relationship Modeling
+- ✅ Predicates are persisted automatically when storing triplets and tracked in `PredicateRegistry`.
+- ⚠️ README examples that look up subjects/objects by name still do not match the implementation, which expects UUIDs. Add
+  documentation explaining how to resolve names to UUIDs before storing edges.
+- ⚠️ Consolidation and expiry run via Celery jobs; README narratives should highlight that heuristics require further validation even though persistence is now wired up.
+
+## Configuration & Dependencies
+- ✅ `README.md` and `ENVIRONMENT_NEEDS.md` document required environment variables, dependency guards, and setup steps.
+- ⚠️ README still omits optional tooling now required by the Makefile/CI (ruff, pyright, typeguard, toml-sort, yamllint);
+  highlight these prerequisites more prominently.
+- ✅ Python version support in `pyproject.toml` now pins `>=3.11,<3.13`, matching the dependency landscape documented in the README.
+
+## Example Code Paths
+- ✅ Updated example scripts demonstrate extraction, triplet creation, and multiple retrieval strategies.
+- ⚠️ Legacy README code that instantiates custom Pydantic entities remains inaccurate; extraction returns `Memory` objects and
+  validates `entity_label` names only.
+- ⚠️ Search examples should be updated to show the new helper functions and optional rerank usage instead of nonexistent
+  `search_facts`/`search_procedures` calls.
+
+## Tooling & Operations
+- ✅ Makefile and CI workflows now exist, aligning with README promises about automation once the README is refreshed.
+- ✅ Docker Compose now provisions Memgraph, Redis, and a Celery worker; README sections should highlight the workflow and
+  caveats for environments lacking container tooling.
+- ⚠️ Celery tasks still depend on optional infrastructure; README should clarify that heuristics and scheduling need production hardening even though persistence now works.
+
+## Documentation State
+- Continue promoting `README.md` as the authoritative guide and propagate updates to supporting docs
+  (`SOT.md`, `PLAN.md`, `ENVIRONMENT_NEEDS.md`, `docs/`).
diff --git a/Dockerfile b/Dockerfile
@@ -0,0 +1,24 @@
+FROM python:3.11-slim
+
+ENV PIP_NO_CACHE_DIR=1 \
+    PYTHONDONTWRITEBYTECODE=1 \
+    PYTHONUNBUFFERED=1
+
+WORKDIR /app
+
+RUN apt-get update \
+    && apt-get install -y --no-install-recommends \
+        build-essential \
+        cmake \
+        libssl-dev \
+        libkrb5-dev \
+        curl \
+        git \
+    && rm -rf /var/lib/apt/lists/*
+
+COPY . /app
+
+RUN pip install uv \
+    && uv pip install --system -e .[dev,docs,testing]
+
+CMD ["bash"]