cpdata · cpdata · Oct 14, 2025
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -0,0 +1,44 @@
+name: CI
+
+on:
+  push:
+    branches: ["main", "review", "review-1"]
+  pull_request:
+
+env:
+  PIP_DISABLE_PIP_VERSION_CHECK: "1"
+
+jobs:
+  lint:
+    runs-on: ubuntu-latest
+    steps:
+      - name: Check out
+        uses: actions/checkout@v4
+      - name: Set up Python
+        uses: actions/setup-python@v5
+        with:
+          python-version: "3.11"
+      - name: Install toolchain
+        run: |
+          pip install uv
+          uv pip install -e .
+          uv pip install ruff pyright typeguard toml-sort yamllint
+      - name: Lint and format checks
+        run: make fmt-check
+
+  tests:
+    runs-on: ubuntu-latest
+    steps:
+      - name: Check out
+        uses: actions/checkout@v4
+      - name: Set up Python
+        uses: actions/setup-python@v5
+        with:
+          python-version: "3.11"
+      - name: Install dependencies
+        run: |
+          pip install uv
+          uv pip install -e .
+          uv pip install pytest
+      - name: Run pytest
+        run: make test
diff --git a/AGENTS.md b/AGENTS.md
@@ -0,0 +1,11 @@
+# Agent Instructions
+
+## Documentation Workflow
+- Update `CHANGELOG.md` with a new entry every time code changes are committed.
+- Maintain `README_LATEST.md` so it always reflects the current implementation; refresh it alongside major feature updates.
+- After each iteration, revise `ISSUES.md`, `SOT.md`, `PLAN.md`, `RECOMMENDATIONS.md`, and `TODO.md` to stay in sync with the codebase.
+
+## Style Guidelines
+- Use descriptive Markdown headings starting at level 1 for top-level documents.
+- Keep lines to 120 characters or fewer when practical.
+- Prefer bullet lists for enumerations instead of inline commas.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -0,0 +1,20 @@
+# Changelog
+
+## [Unreleased] - 2025-02-14
+### Added
+- Vector-only, regex, exact-match, and optional LLM rerank retrieval helpers with reranker utilities and exports.
+- MeshMind client wrappers for hybrid, vector, regex, and exact searches plus driver accessors.
+- Example script demonstrating triplet storage and diverse retrieval flows.
+- Pytest fixtures for encoder and memory factories alongside new retrieval tests that avoid external services.
+- Makefile targets for linting, formatting, type checks, and tests, plus a GitHub Actions workflow running lint and pytest.
+- README_LATEST.md capturing the current implementation and CHANGELOG.md for release notes.
+
+### Changed
+- Updated `SearchConfig` to support rerank models and refreshed MeshMind documentation across PROJECT, PLAN, SOT, FINDINGS,
+  DISCREPANCIES, RECOMMENDATIONS, ISSUES, TODO, and NEEDED_FOR_TESTING files.
+- Revised `meshmind.retrieval.search` to apply filters centrally, expose new search helpers, and integrate reranking.
+- Exposed graph driver access on MeshMind and refreshed retrieval-facing examples and docs.
+
+### Fixed
+- Example ingestion script now uses MeshMind APIs correctly and illustrates relationship persistence.
+- Tests rely on fixtures rather than deprecated hooks, improving portability across environments without Memgraph/OpenAI.
diff --git a/DISCREPANCIES.md b/DISCREPANCIES.md
@@ -0,0 +1,55 @@
+# README vs Implementation Discrepancies
+
+## Overview
+- The legacy README still promises a fully featured memory graph with multi-level APIs, relationship storage, and diverse
+  retrieval methods. Many of those features now exist, but the document remains outdated and should be replaced by
+  `README_LATEST.md`.
+- The current codebase delivers extraction, preprocessing, triplet persistence, CRUD helpers, and expanded retrieval strategies
+  that were missing when the README was written.
+- Remaining gaps primarily involve graph-backed retrieval, observability, and automated infrastructure provisioning.
+
+## API Surface
+- ✅ `MeshMind` now exposes CRUD helpers (`create_memory`, `update_memory`, `delete_memory`, `list_memories`, triplet helpers)
+  that the README referenced implicitly.
+- ✅ Triplet storage routes through `store_triplets` and `MemoryManager.add_triplet`, calling `GraphDriver.upsert_edge`.
+- ⚠️ The README still references `register_entity`, `register_allowed_predicates`, and `add_predicate`; predicate management is
+  handled automatically but there is no public API matching those method names.
+- ⚠️ README snippets showing `mesh_mind.store_memory(memory)` should be updated to call `store_memories([memory])` or the new
+  CRUD helpers.
+
+## Retrieval Capabilities
+- ✅ Vector-only, regex, exact-match, hybrid, BM25, fuzzy, and optional LLM rerank searches exist in `meshmind.retrieval.search`
+  and are surfaced through `MeshMind` helpers.
+- ⚠️ README implies retrieval queries the graph directly. Current search helpers operate on in-memory lists supplied by the
+  caller; Memgraph-backed retrieval remains future work.
+- ⚠️ Named helpers like `search_facts` or `search_procedures` never existed; the README should reference the dispatcher plus
+  specialized helpers now available.
+
+## Data & Relationship Modeling
+- ✅ Predicates are persisted automatically when storing triplets and tracked in `PredicateRegistry`.
+- ⚠️ README examples that look up subjects/objects by name still do not match the implementation, which expects UUIDs. Add
+  documentation explaining how to resolve names to UUIDs before storing edges.
+- ⚠️ Consolidation and expiry remain limited to Celery jobs; README narratives about integrated maintenance still overstate the
+  current persistence story.
+
+## Configuration & Dependencies
+- ✅ `README_LATEST.md` and `NEEDED_FOR_TESTING.md` document required environment variables, dependency guards, and setup steps.
+- ⚠️ The legacy README omits optional tooling now required by the Makefile/CI (ruff, pyright, typeguard, toml-sort, yamllint).
+- ⚠️ Python version support in `pyproject.toml` (3.13) still diverges from what many dependencies officially support; update the
+  documentation or relax the requirement.
+
+## Example Code Paths
+- ✅ Updated example scripts demonstrate extraction, triplet creation, and multiple retrieval strategies.
+- ⚠️ Legacy README code that instantiates custom Pydantic entities remains inaccurate; extraction returns `Memory` objects and
+  validates `entity_label` names only.
+- ⚠️ Search examples should be updated to show the new helper functions and optional rerank usage instead of nonexistent
+  `search_facts`/`search_procedures` calls.
+
+## Tooling & Operations
+- ✅ Makefile and CI workflows now exist, aligning with README promises about automation once the README is refreshed.
+- ⚠️ Docker Compose still lacks service definitions for Memgraph/Redis; README setup sections should call this out explicitly.
+- ⚠️ Celery tasks remain best-effort shims; README should clarify that maintenance requires the optional infrastructure.
+
+## Documentation State
+- Promote `README_LATEST.md` as the authoritative guide, archive the legacy README, and ensure future updates propagate to
+  supporting docs (`SOT.md`, `PLAN.md`, `NEEDED_FOR_TESTING.md`).
diff --git a/FINDINGS.md b/FINDINGS.md
@@ -0,0 +1,39 @@
+# Findings
+
+## General Observations
+- Core modules are now wired through the `MeshMind` client, including CRUD, triplet storage, and retrieval helpers. Remaining
+  integration work centers on graph-backed retrieval and maintenance persistence.
+- Optional dependencies are largely guarded behind lazy imports or factory functions, improving portability. Environments still
+  need to install tooling referenced by the Makefile and CI (ruff, pyright, typeguard, toml-sort, yamllint).
+- Documentation artifacts (`README_LATEST.md`, `SOT.md`, `NEEDED_FOR_TESTING.md`) stay current when updated with each iteration;
+  the legacy README should be archived.
+
+## Dependency & Environment Notes
+- `MeshMind` defers Memgraph driver creation until persistence is required, enabling limited workflows without `pymgclient`.
+- Encoder registration occurs during bootstrap, but custom deployments must ensure compatible models are registered before
+  extraction or hybrid search.
+- The OpenAI embedding adapter still expects dictionary-like responses; adapting to SDK objects remains on the backlog.
+- Celery tasks initialize lazily, yet Redis/Memgraph services are still required at runtime. Docker Compose lacks concrete
+  service definitions.
+
+## Data Flow & Persistence
+- Triplet storage now persists relationships and tracks predicates automatically, closing an earlier data-loss gap.
+- Consolidation and compression utilities operate in memory; persistence of maintenance results is still pending.
+- Importance scoring remains a constant fallback; improved heuristics will raise retrieval quality once implemented.
+
+## CLI & Tooling
+- CLI ingestion bootstraps encoders and entities automatically but still assumes Memgraph and OpenAI credentials are configured.
+- The Makefile introduces lint, format, type-check, and test targets, plus a Docker helper. External tooling installation is
+  required before targets succeed.
+- GitHub Actions now run formatting checks and pytest on push/PR, providing basic CI coverage.
+
+## Testing & Quality
+- Pytest suites rely on fixtures (`memory_factory`, `dummy_encoder`) to run without external services. Additional coverage is
+  needed for Celery workflows and graph-backed retrieval when implemented.
+- Type checking via `pyright` and runtime checks via `typeguard` are exposed in the Makefile; dependency installation is
+  necessary for full validation.
+
+## Documentation
+- `README_LATEST.md` supersedes the legacy README and documents setup, pipelines, retrieval, and tooling.
+- Supporting docs (`ISSUES.md`, `PLAN.md`, `RECOMMENDATIONS.md`, `SOT.md`) reflect the latest capabilities and highlight remaining
+  gaps, aiding onboarding and future planning.
diff --git a/ISSUES.md b/ISSUES.md
@@ -0,0 +1,30 @@
+# Issues Checklist
+
+## Blockers
+- [ ] MeshMind client fails without `mgclient`; introduce lazy driver initialization or documented in-memory fallback.
+- [ ] Register a default embedding encoder (OpenAI or sentence-transformers) during startup so extraction and hybrid search can run.
+- [ ] Update OpenAI integration to match the current SDK (Responses API payload, embeddings API response structure).
+- [ ] Replace eager `tiktoken` imports in `meshmind.core.utils` and `meshmind.pipeline.compress` with guarded, optional imports.
+- [ ] Align declared Python requirement with supported dependencies (currently set to Python 3.13 despite ecosystem gaps).
+
+## High Priority
+- [x] Implement relationship persistence (`GraphDriver.upsert_edge`) within the storage pipeline and expose triplet APIs.
+- [x] Restore high-level API methods promised in README (`register_entity`, predicate management, `add_memory`, `update_memory`, `delete_memory`).
+- [x] Ensure CLI ingestion registers entity models and embedding encoders or fails fast with actionable messaging.
+- [x] Provide configuration documentation and examples for Memgraph, Redis, and OpenAI environment variables.
+- [x] Add automated tests or smoke checks that run without external services (mock OpenAI, stub Memgraph driver).
+- [ ] Create real docker-compose services for Memgraph and Redis or remove the placeholder file.
+
+## Medium Priority
+- [ ] Persist results from consolidation and compression tasks back to the database (currently in-memory only).
+- [ ] Refine `Memory.importance` scoring to reflect actual ranking heuristics instead of a constant.
+- [x] Add vector, regex, and exact-match search helpers to match stated feature set or update documentation to demote them.
+- [ ] Harden Celery tasks to initialize dependencies lazily and log failures when the driver is unavailable. (In progress: lazy driver initialization added, persistence pending)
+- [ ] Reconcile tests that depend on `Memory.pre_init` and outdated OpenAI interfaces with the current implementation.
+- [x] Add linting, formatting, and type-checking tooling to improve code quality.
+
+## Low Priority / Nice to Have
+- [ ] Offer alternative storage backends (in-memory driver, SQLite, etc.) for easier local development.
+- [ ] Provide an administrative dashboard or CLI commands for listing namespaces, counts, and maintenance statistics.
+- [ ] Publish onboarding guides and troubleshooting FAQs for contributors.
+- [ ] Explore plugin registration for embeddings and retrieval strategies to reduce manual wiring.
diff --git a/Makefile b/Makefile
@@ -1,17 +1,31 @@
-.PHONY: install lint fmt test docker
+.PHONY: install lint fmt fmt-check typecheck test check docker clean
 
 install:
 	pip install -e .
 
 lint:
-	ruff .
+	ruff check .
 
 fmt:
-	isort .
-	black .
+	ruff format .
+
+fmt-check:
+	ruff format --check .
+	ruff check .
+	toml-sort --check pyproject.toml
+	yamllint .github/workflows
+
+typecheck:
+	pyright
+	python -m typeguard --check meshmind
 
 test:
 	pytest
 
+check: fmt-check lint typecheck test
+
+clean:
+	rm -rf .pytest_cache .ruff_cache
+
 docker:
-	docker-compose up
+	docker compose up
diff --git a/NEEDED_FOR_TESTING.md b/NEEDED_FOR_TESTING.md
@@ -0,0 +1,45 @@
+# Needed for Testing MeshMind
+
+## Python Runtime
+- Python 3.11 or 3.12 is recommended; project metadata currently states 3.13+, but several dependencies (`pymgclient`,
+  `sentence-transformers`) do not yet publish wheels for 3.13.
+- Use a virtual environment (`uv`, `venv`, or `conda`) to isolate dependencies.
+
+## Python Dependencies
+- Install the project editable: `pip install -e .` from the repository root.
+- Required packages declared in `pyproject.toml` include `openai`, `pydantic`, `pydantic-settings`, `numpy`, `scikit-learn`,
+  `rapidfuzz`, `python-dotenv`, `celery[redis]`, `sentence-transformers`, `tiktoken`, and `pymgclient`.
+- Development tooling referenced by the Makefile and CI:
+  - `ruff` for linting and formatting.
+  - `pyright` for static type checks.
+  - `typeguard` for runtime type enforcement (`python -m typeguard --check meshmind`).
+  - `toml-sort` and `yamllint` for configuration validation.
+- Optional helpers for local workflows: `pytest-cov`, `pre-commit`, `httpx` (for future service interfaces).
+
+## External Services and Infrastructure
+- **Memgraph** (or compatible Bolt graph database) reachable via `MEMGRAPH_URI` with credentials exported in
+  `MEMGRAPH_USERNAME`/`MEMGRAPH_PASSWORD`.
+- **Redis** for Celery task queues, referenced through `REDIS_URL`.
+- **OpenAI API access** for extraction, embeddings, and LLM reranking (`OPENAI_API_KEY`).
+- Recommended: Docker Compose or equivalent orchestration to run Memgraph and Redis together when developing locally.
+
+## Environment Variables
+- `OPENAI_API_KEY` — required for extraction, embeddings, and reranking.
+- `MEMGRAPH_URI` — e.g., `bolt://localhost:7687`.
+- `MEMGRAPH_USERNAME` and `MEMGRAPH_PASSWORD` — credentials for the graph database.
+- `REDIS_URL` — optional Redis connection URI (defaults to `redis://localhost:6379/0`).
+- `EMBEDDING_MODEL` — encoder key registered with `EncoderRegistry` (defaults to `text-embedding-3-small`).
+- Optional overrides for Celery broker/backend if using hosted services.
+
+## Local Configuration Steps
+- Ensure an embedding encoder is registered before extraction or hybrid search. The bootstrap utilities invoked by the CLI and
+  `MeshMind` constructor handle this, but custom scripts must call `bootstrap_encoders()`.
+- Seed demo data as needed using the `examples/extract_preprocess_store_example.py` script after configuring environment
+  variables.
+- Create a `.env` file storing the environment variables above for consistent local configuration.
+
+## Current Blockers in This Environment
+- Neo4j/Memgraph binaries and Docker are unavailable in this workspace, preventing local graph provisioning.
+- Redis cannot be installed without container or host-level access; Celery tasks remain untestable locally until a remote
+  instance is provisioned.
+- External network restrictions may limit installation of proprietary packages or access to OpenAI endpoints.