cpdata · cpdata · Oct 14, 2025
diff --git a/DISCREPANCIES.md b/DISCREPANCIES.md
@@ -0,0 +1,20 @@
+# README vs. Codebase Discrepancies
+
+## API Surface
+- **Missing registration APIs** – The README walkthrough relies on `MeshMind.register_entity`, `register_allowed_predicates`, `add_predicate`, `store_memory`, `add_memory`, and `add_triplet`. None of these methods exist on `meshmind.client.MeshMind`, nor do they exist elsewhere in the package.
+- **No triplet storage path** – README describes storing graph triplets (node–edge–node). The shipped pipeline never touches `Triplet` and only calls `GraphDriver.upsert_entity`, so edges/predicates are never persisted.
+- **Entity modelling mismatch** – README expects custom Pydantic models (e.g., `Person`) to be registered and enforced. Extraction currently hardcodes the `Memory` schema and only validates `entity_label` names against the provided classes’ `__name__`, without instantiating those models.
+
+## Feature Claims
+- **CRUD breadth** – README lists add/search/update/delete as supported capabilities. Only `meshmind.api.memory_manager.MemoryManager` exposes update/delete helpers, and they are not surfaced via the documented high-level API.
+- **Retrieval methods** – README promises embedding vector search, BM25, LLM reranking, fuzzy search, exact comparison, regex search, filters, and hybrid search. The code offers BM25, fuzzy, filters, and a simple hybrid scorer; it lacks standalone vector search, regex search, exact match utilities, and any LLM-based reranking.
+- **Memory preprocessing** – README references importance ranking, deduplication, consolidation, compression, and expiry. Implementations exist, but importance ranking is a fixed default of `1.0`, consolidation only keeps the highest importance duplicate, and expiry/compression are isolated Celery tasks that require additional wiring.
+
+## Operational Expectations
+- **Dependency assumptions** – README does not mention that a Memgraph instance, `mgclient`, `tiktoken`, and manual encoder registration are required. In practice, `MeshMind()` raises immediately when `mgclient` is absent, and `extract_memories` fails if the default embedding encoder is not manually registered.
+- **Configuration** – README’s quickstart omits mandatory environment variables (OpenAI API key, Memgraph credentials) that `meshmind.core.config.settings` expects.
+- **Testing/setup instructions** – README suggests features (e.g., graph relationships, rich retrieval) that the tests do not cover, and the declared Python requirement (`>=3.13` in `pyproject.toml`) conflicts with README/Contributing guidance (Python 3.10+).
+
+## Example Code
+- The README’s example code would fail: `MeshMind` lacks the invoked methods, extraction would reject the `Person` label unless registered (yet registration does nothing), and storing custom metadata or edges is unsupported.
+- The low-level `add_triplet` example assumes the driver can create relationship edges given subject/object names; there is no such helper in the codebase, and `MemgraphDriver.upsert_edge` expects UUIDs, not arbitrary entity names.
diff --git a/FINDINGS.md b/FINDINGS.md
@@ -0,0 +1,31 @@
+# Additional Findings
+
+## Dependency Handling
+- `meshmind.client.MeshMind` instantiates `MemgraphDriver` on every construction. When `mgclient` is missing (default in many development environments), the import raises immediately, preventing any other functionality (including retrieval helpers) from being used.
+- `meshmind.pipeline.compress` sets `tiktoken = None` when the import fails, but still calls `tiktoken.get_encoding`, raising `AttributeError`. The helper in `meshmind.pipeline.preprocess.compress` catches `ImportError`, not `AttributeError`, so the failure propagates.
+- `meshmind.core.utils` imports `tiktoken` unconditionally at import time, so simply importing `meshmind.core` without the package installed triggers a crash.
+- The OpenAI SDK usage is inconsistent: `meshmind.core.embeddings.OpenAIEmbeddingEncoder.encode` expects dictionary-style responses, while the modern SDK returns typed objects.
+
+## Encoder Registry
+- No encoders are registered by default. `meshmind.pipeline.extract.extract_memories` calls `EncoderRegistry.get(settings.EMBEDDING_MODEL)`; unless the caller has registered a matching encoder beforehand, the call raises `KeyError`.
+- `MeshMind` does not register an encoder automatically when instantiated with the default OpenAI client, so the quickstart path fails without additional setup.
+
+## Graph Persistence
+- `meshmind.pipeline.store.store_memories` only calls `graph_driver.upsert_entity`. There is no mechanism to create edges, update predicates, or link memories. `Triplet` from `meshmind.core.types` is unused throughout the codebase.
+- `meshmind.db.memgraph_driver.MemgraphDriver.upsert_edge` expects UUIDs for subject/object, but no public API surfaces those identifiers or manages the relationship lifecycle.
+
+## Retrieval & Search
+- `meshmind.retrieval.hybrid.hybrid_search` assumes that every memory already has an embedding and that an encoder is registered under `config.encoder`. It silently assigns a zero score to any memory missing an embedding, which may bury relevant results.
+- There is no helper to pull memories out of Memgraph for retrieval; all search functions expect in-memory lists supplied by the caller.
+
+## CLI & Tasks
+- `meshmind.cli.ingest.ingest_command` registers `entity_types=[Memory]`. Supplying a custom Pydantic model has no effect because extraction only validates `entity_label` names; it never instantiates the provided models.
+- Scheduled tasks import `MemgraphDriver` at module load and swallow any exception, leaving `manager = None`. Task invocations then return early without logging, making failures hard to diagnose.
+
+## Testing & Tooling
+- Test modules import the production code in ways that do not align with the current library versions (`openai.responses.create`, `Memory.pre_init`). Running `pytest` without heavy monkeypatching will fail.
+- `pyproject.toml` pins `requires-python = ">=3.13"`, yet the documentation still references Python 3.10 and Poetry. Tooling commands in `Makefile` (ruff, isort, black) are not listed as dependencies.
+
+## Documentation Gaps
+- Runtime prerequisites (Memgraph, Redis, mgclient, encoder registration) are not described in the existing README.
+- There is no architecture guide explaining how pipeline modules, the CLI, Celery tasks, and retrieval helpers relate, making it hard for new contributors to navigate the codebase.
diff --git a/ISSUES.md b/ISSUES.md
@@ -0,0 +1,16 @@
+# Issue Backlog
+
+- [ ] Restore the high-level client surface promised in the README (entity/predicate registration, `add_memory`, `store_memory`, `add_triplet`).
+- [ ] Persist graph relationships: extend the storage pipeline to write edges/triplets instead of only upserting nodes.
+- [ ] Provide a reliable vector search entrypoint and implement regex/exact-match/LLM rerank retrieval options as documented.
+- [ ] Register a default embedding encoder (e.g., OpenAI) on startup so extraction works out of the box.
+- [ ] Make `MeshMind` initialization resilient when `mgclient` or Memgraph are unavailable (lazy driver creation or optional in-memory driver).
+- [ ] Fix `meshmind.pipeline.compress` to handle missing `tiktoken` gracefully (skip compression or supply a fallback encoder).
+- [ ] Guard `meshmind.core.utils` against importing `tiktoken` at module import time to avoid `ModuleNotFoundError`.
+- [ ] Update `OpenAIEmbeddingEncoder.encode` to use the modern OpenAI SDK response objects (access `.data`, not dictionary keys) and add error handling/tests.
+- [ ] Rework Celery task initialization so `MemgraphDriver` is created lazily within tasks instead of module import time side effects.
+- [ ] Ensure tests are executable: remove assumptions about non-existent hooks (`Memory.pre_init`), patch the OpenAI client correctly, and supply dependency fakes.
+- [ ] Align Python version and tooling guidance across `pyproject.toml`, README, and CONTRIBUTING (e.g., Python >=3.13 vs. 3.10+, missing `ruff/isort/black` dependencies).
+- [ ] Document runtime dependencies explicitly (Memgraph, Redis, OpenAI API key, mgclient) and provide setup scripts or docker-compose services.
+- [ ] Build relationship/query abstractions on top of `MemgraphDriver` (e.g., `get_memories`, `search` APIs) instead of expecting consumers to craft Cypher manually.
+- [ ] Add integration/e2e tests covering the ingestion pipeline end-to-end with a test Memgraph instance or an in-memory driver substitute.
diff --git a/NEW_README.md b/NEW_README.md
@@ -0,0 +1,126 @@
+# MeshMind (Current State)
+
+MeshMind is an experimental memory toolkit that combines LLM-assisted extraction with a property-graph backend. The present implementation focuses on turning raw text into `Memory` records, applying lightweight preprocessing, and storing them through a Memgraph-compatible driver. Retrieval helpers operate on in-memory collections of `Memory` objects and support lexical, fuzzy, and hybrid scoring.
+
+> **Note**
+> The original README described a much richer API (entity/predicate registration, triplet storage, advanced retrieval). Those features are **not** implemented yet. This document reflects the functionality that exists today.
+
+## Features
+
+- `Memory` data model (`meshmind.core.types.Memory`) captures namespace, entity label, metadata, optional embeddings, timestamps, and TTL/importance metadata.
+- `MeshMind` client (`meshmind.client.MeshMind`) wires together an OpenAI client, an embedding model name, and a Memgraph driver, exposing helpers for:
+  - `extract_memories` – LLM-based extraction that validates entity labels and populates embeddings through the encoder registry.
+  - `deduplicate`, `score_importance`, `compress` – preprocessing helpers.
+  - `store_memories` – persists each memory by calling `GraphDriver.upsert_entity`.
+- Pipeline modules under `meshmind.pipeline` implement the extraction, preprocessing, compression, expiry, consolidation, and storage steps.
+- Retrieval utilities under `meshmind.retrieval` provide TF-IDF (BM25-style) search, RapidFuzz fuzzy matching, metadata/namespace/entity-label filters, and a hybrid scorer that blends cosine similarity with lexical scores.
+- `meshmind.api.memory_manager.MemoryManager` offers CRUD-style helpers over an injected graph driver (add/update/delete/list).
+- `meshmind.tasks.scheduled` defines Celery tasks (optional) that invoke expiry, consolidation, and compression routines on a schedule.
+- A CLI entry point (`python -m meshmind` or `meshmind` after installation) exposes an `ingest` subcommand for end-to-end extraction from local files/directories.
+
+## Requirements
+
+- Python 3.10+ (the package metadata currently targets 3.13; align your environment accordingly).
+- A running [Memgraph](https://memgraph.com/) instance accessible via Bolt, plus the `mgclient` Python driver.
+- An OpenAI API key for both extraction and embedding (set `OPENAI_API_KEY`).
+- Optional services: Redis (if Celery tasks are used).
+- Python dependencies listed in `pyproject.toml` (`openai`, `pydantic`, `rapidfuzz`, `scikit-learn`, `numpy`, `celery[redis]`, `sentence-transformers`, `tiktoken`, `pymgclient`, etc.).
+
+## Installation
+
+```bash
+python -m venv .venv
+source .venv/bin/activate
+pip install -e .
+```
+
+Ensure Memgraph is running and environment variables are set:
+
+```bash
+export OPENAI_API_KEY=...  # required by openai.OpenAI()
+export MEMGRAPH_URI=bolt://localhost:7687
+export MEMGRAPH_USERNAME=...
+export MEMGRAPH_PASSWORD=...
+```
+
+Before calling extraction, register an embedding encoder that matches `settings.EMBEDDING_MODEL` (default `text-embedding-3-small`). For example:
+
+```python
+from meshmind.core.embeddings import EncoderRegistry, OpenAIEmbeddingEncoder
+EncoderRegistry.register("text-embedding-3-small", OpenAIEmbeddingEncoder())
+```
+
+## Quickstart
+
+```python
+from meshmind.client import MeshMind
+from meshmind.core.types import Memory
+from meshmind.core.embeddings import EncoderRegistry, OpenAIEmbeddingEncoder
+
+# Register the embedding encoder expected by the pipeline
+EncoderRegistry.register("text-embedding-3-small", OpenAIEmbeddingEncoder())
+
+mm = MeshMind()
+texts = [
+    "Jane Doe is a senior software engineer based in Berlin.",
+    "John Doe manages the infrastructure team in Berlin.",
+]
+memories = mm.extract_memories(
+    instructions="Extract each distinct person as a Memory.",
+    namespace="Company Directory",
+    entity_types=[Memory],  # entity labels are validated against class names
+    content=texts,
+)
+memories = mm.deduplicate(memories)
+memories = mm.score_importance(memories)
+memories = mm.compress(memories)
+mm.store_memories(memories)
+```
+
+## Retrieval Helpers
+
+The retrieval utilities operate on lists of `Memory` objects that are already loaded into Python (for example via `MemoryManager.list_memories`).
+
+```python
+from meshmind.retrieval.search import search, search_bm25, search_fuzzy
+from meshmind.core.types import Memory, SearchConfig
+from meshmind.core.embeddings import EncoderRegistry
+
+# Register an encoder for hybrid/vector scoring
+class DummyEncoder:
+    def encode(self, texts):
+        return [[len(t)] for t in texts]
+
+EncoderRegistry.register("dummy", DummyEncoder())
+
+memories = [
+    Memory(namespace="Company", name="Jane Doe", entity_label="Memory", embedding=[7.0]),
+    Memory(namespace="Company", name="John Doe", entity_label="Memory", embedding=[7.0]),
+]
+config = SearchConfig(encoder="dummy", top_k=5, hybrid_weights=(0.5, 0.5))
+results = search("Jane", memories, namespace="Company", config=config)
+```
+
+## CLI Ingest
+
+```bash
+meshmind ingest \
+  --namespace "Company" \
+  --instructions "Extract personnel facts as Memory objects." \
+  path/to/files
+```
+
+The command reads all text files, extracts memories, runs the preprocessing pipeline, and stores the results using the configured Memgraph connection.
+
+## Maintenance Tasks (Optional)
+
+If Celery and Redis are available, import `meshmind.tasks.scheduled` to register three periodic tasks: `expire_task`, `consolidate_task`, and `compress_task`. Each task fetches memories through `MemoryManager` and applies the corresponding pipeline helper.
+
+## Limitations & Next Steps
+
+- Relationship/edge storage is not implemented; only nodes are persisted.
+- Custom entity models are not instantiated during extraction—only `entity_label` names are validated.
+- Advanced retrieval techniques (regex search, LLM reranking, external vector databases) are not implemented yet.
+- Several modules assume optional dependencies (`mgclient`, `tiktoken`, OpenAI SDK) are installed and configured.
+
+Refer to `PROJECT.md`, `ISSUES.md`, and `PLAN.md` for the detailed roadmap.
diff --git a/PLAN.md b/PLAN.md
@@ -0,0 +1,27 @@
+# Next Steps Plan
+
+## Phase 1 – Stabilize the Existing Surface
+1. **Adopt `NEW_README.md`** as the public README and archive the legacy version so expectations match reality.
+2. **Fix critical dependency traps**:
+   - Wrap `tiktoken` imports in `core/utils.py` and `pipeline/compress.py` with guards.
+   - Update `OpenAIEmbeddingEncoder` to use `.data` from the modern SDK and add tests.
+   - Lazily instantiate `MemgraphDriver` (e.g., in `store_memories` or via a factory) to avoid hard crashes when mgclient is missing.
+3. **Auto-register embeddings**: when MeshMind starts, register an `OpenAIEmbeddingEncoder` using `settings.EMBEDDING_MODEL` so extraction works immediately.
+4. **Repair the test suite**: modernize mocks for the OpenAI client, remove references to `Memory.pre_init`, and add coverage for the dependency fallbacks above.
+
+## Phase 2 – Deliver Promised Graph Features
+5. **Implement entity/predicate registration APIs** on `MeshMind`, backed by `meshmind.models.registry`. Persist this metadata in Memgraph or an internal registry.
+6. **Add triplet storage**: extend the pipeline to transform memory metadata into node/edge upserts (subject–predicate–object) and expose `add_triplet` / `add_memory` helpers.
+7. **Provide graph retrieval helpers**: implement functions to fetch memories by namespace, list predicates, and query neighbors directly via the driver.
+
+## Phase 3 – Enhance Retrieval & Automation
+8. **Expand retrieval modes**: add vector-only search, regex/exact match filters, and an optional LLM reranker stage configurable through `SearchConfig`.
+9. **Integrate retrieval with storage**: allow `MemoryManager` or the driver to return ranked results instead of requiring callers to preload all memories.
+10. **Strengthen maintenance workflows**: initialize Celery tasks lazily with logging, ensure expiry/consolidation/compression can operate when Memgraph is down (e.g., retry/backoff).
+
+## Phase 4 – Developer Experience & Observability
+11. **Align tooling**: reconcile Python version requirements, add dev dependencies (`ruff`, `isort`, `black`) to a `dev` extra, and wire CI to run lint/tests.
+12. **Document architecture**: keep `SOT.md` updated, add diagrams or sequence charts, and document configuration/operations in the README.
+13. **Add logging and metrics**: instrument CLI ingestion, pipeline stages, and graph operations for debugging and future monitoring.
+
+Revisit this plan after Phase 1 to adjust scope based on effort estimates and stakeholder priorities.