Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 20 additions & 0 deletions DISCREPANCIES.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
# README vs. Codebase Discrepancies

## API Surface
- **Missing registration APIs** – The README walkthrough relies on `MeshMind.register_entity`, `register_allowed_predicates`, `add_predicate`, `store_memory`, `add_memory`, and `add_triplet`. None of these methods exist on `meshmind.client.MeshMind`, nor do they exist elsewhere in the package.
- **No triplet storage path** – README describes storing graph triplets (node–edge–node). The shipped pipeline never touches `Triplet` and only calls `GraphDriver.upsert_entity`, so edges/predicates are never persisted.
- **Entity modelling mismatch** – README expects custom Pydantic models (e.g., `Person`) to be registered and enforced. Extraction currently hardcodes the `Memory` schema and only validates `entity_label` names against the provided classes’ `__name__`, without instantiating those models.

## Feature Claims
- **CRUD breadth** – README lists add/search/update/delete as supported capabilities. Only `meshmind.api.memory_manager.MemoryManager` exposes update/delete helpers, and they are not surfaced via the documented high-level API.
- **Retrieval methods** – README promises embedding vector search, BM25, LLM reranking, fuzzy search, exact comparison, regex search, filters, and hybrid search. The code offers BM25, fuzzy, filters, and a simple hybrid scorer; it lacks standalone vector search, regex search, exact match utilities, and any LLM-based reranking.
- **Memory preprocessing** – README references importance ranking, deduplication, consolidation, compression, and expiry. Implementations exist, but importance ranking is a fixed default of `1.0`, consolidation only keeps the highest importance duplicate, and expiry/compression are isolated Celery tasks that require additional wiring.

## Operational Expectations
- **Dependency assumptions** – README does not mention that a Memgraph instance, `mgclient`, `tiktoken`, and manual encoder registration are required. In practice, `MeshMind()` raises immediately when `mgclient` is absent, and `extract_memories` fails if the default embedding encoder is not manually registered.
- **Configuration** – README’s quickstart omits mandatory environment variables (OpenAI API key, Memgraph credentials) that `meshmind.core.config.settings` expects.
- **Testing/setup instructions** – README suggests features (e.g., graph relationships, rich retrieval) that the tests do not cover, and the declared Python requirement (`>=3.13` in `pyproject.toml`) conflicts with README/Contributing guidance (Python 3.10+).

## Example Code
- The README’s example code would fail: `MeshMind` lacks the invoked methods, extraction would reject the `Person` label unless registered (yet registration does nothing), and storing custom metadata or edges is unsupported.
- The low-level `add_triplet` example assumes the driver can create relationship edges given subject/object names; there is no such helper in the codebase, and `MemgraphDriver.upsert_edge` expects UUIDs, not arbitrary entity names.
31 changes: 31 additions & 0 deletions FINDINGS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
# Additional Findings

## Dependency Handling
- `meshmind.client.MeshMind` instantiates `MemgraphDriver` on every construction. When `mgclient` is missing (default in many development environments), the import raises immediately, preventing any other functionality (including retrieval helpers) from being used.
- `meshmind.pipeline.compress` sets `tiktoken = None` when the import fails, but still calls `tiktoken.get_encoding`, raising `AttributeError`. The helper in `meshmind.pipeline.preprocess.compress` catches `ImportError`, not `AttributeError`, so the failure propagates.
- `meshmind.core.utils` imports `tiktoken` unconditionally at import time, so simply importing `meshmind.core` without the package installed triggers a crash.
- The OpenAI SDK usage is inconsistent: `meshmind.core.embeddings.OpenAIEmbeddingEncoder.encode` expects dictionary-style responses, while the modern SDK returns typed objects.

## Encoder Registry
- No encoders are registered by default. `meshmind.pipeline.extract.extract_memories` calls `EncoderRegistry.get(settings.EMBEDDING_MODEL)`; unless the caller has registered a matching encoder beforehand, the call raises `KeyError`.
- `MeshMind` does not register an encoder automatically when instantiated with the default OpenAI client, so the quickstart path fails without additional setup.

## Graph Persistence
- `meshmind.pipeline.store.store_memories` only calls `graph_driver.upsert_entity`. There is no mechanism to create edges, update predicates, or link memories. `Triplet` from `meshmind.core.types` is unused throughout the codebase.
- `meshmind.db.memgraph_driver.MemgraphDriver.upsert_edge` expects UUIDs for subject/object, but no public API surfaces those identifiers or manages the relationship lifecycle.

## Retrieval & Search
- `meshmind.retrieval.hybrid.hybrid_search` assumes that every memory already has an embedding and that an encoder is registered under `config.encoder`. It silently assigns a zero score to any memory missing an embedding, which may bury relevant results.
- There is no helper to pull memories out of Memgraph for retrieval; all search functions expect in-memory lists supplied by the caller.

## CLI & Tasks
- `meshmind.cli.ingest.ingest_command` registers `entity_types=[Memory]`. Supplying a custom Pydantic model has no effect because extraction only validates `entity_label` names; it never instantiates the provided models.
- Scheduled tasks import `MemgraphDriver` at module load and swallow any exception, leaving `manager = None`. Task invocations then return early without logging, making failures hard to diagnose.

## Testing & Tooling
- Test modules import the production code in ways that do not align with the current library versions (`openai.responses.create`, `Memory.pre_init`). Running `pytest` without heavy monkeypatching will fail.
- `pyproject.toml` pins `requires-python = ">=3.13"`, yet the documentation still references Python 3.10 and Poetry. Tooling commands in `Makefile` (ruff, isort, black) are not listed as dependencies.

## Documentation Gaps
- Runtime prerequisites (Memgraph, Redis, mgclient, encoder registration) are not described in the existing README.
- There is no architecture guide explaining how pipeline modules, the CLI, Celery tasks, and retrieval helpers relate, making it hard for new contributors to navigate the codebase.
16 changes: 16 additions & 0 deletions ISSUES.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
# Issue Backlog

- [ ] Restore the high-level client surface promised in the README (entity/predicate registration, `add_memory`, `store_memory`, `add_triplet`).
- [ ] Persist graph relationships: extend the storage pipeline to write edges/triplets instead of only upserting nodes.
- [ ] Provide a reliable vector search entrypoint and implement regex/exact-match/LLM rerank retrieval options as documented.
- [ ] Register a default embedding encoder (e.g., OpenAI) on startup so extraction works out of the box.
- [ ] Make `MeshMind` initialization resilient when `mgclient` or Memgraph are unavailable (lazy driver creation or optional in-memory driver).
- [ ] Fix `meshmind.pipeline.compress` to handle missing `tiktoken` gracefully (skip compression or supply a fallback encoder).
- [ ] Guard `meshmind.core.utils` against importing `tiktoken` at module import time to avoid `ModuleNotFoundError`.
- [ ] Update `OpenAIEmbeddingEncoder.encode` to use the modern OpenAI SDK response objects (access `.data`, not dictionary keys) and add error handling/tests.
- [ ] Rework Celery task initialization so `MemgraphDriver` is created lazily within tasks instead of module import time side effects.
- [ ] Ensure tests are executable: remove assumptions about non-existent hooks (`Memory.pre_init`), patch the OpenAI client correctly, and supply dependency fakes.
- [ ] Align Python version and tooling guidance across `pyproject.toml`, README, and CONTRIBUTING (e.g., Python >=3.13 vs. 3.10+, missing `ruff/isort/black` dependencies).
- [ ] Document runtime dependencies explicitly (Memgraph, Redis, OpenAI API key, mgclient) and provide setup scripts or docker-compose services.
- [ ] Build relationship/query abstractions on top of `MemgraphDriver` (e.g., `get_memories`, `search` APIs) instead of expecting consumers to craft Cypher manually.
- [ ] Add integration/e2e tests covering the ingestion pipeline end-to-end with a test Memgraph instance or an in-memory driver substitute.
126 changes: 126 additions & 0 deletions NEW_README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,126 @@
# MeshMind (Current State)

MeshMind is an experimental memory toolkit that combines LLM-assisted extraction with a property-graph backend. The present implementation focuses on turning raw text into `Memory` records, applying lightweight preprocessing, and storing them through a Memgraph-compatible driver. Retrieval helpers operate on in-memory collections of `Memory` objects and support lexical, fuzzy, and hybrid scoring.

> **Note**
> The original README described a much richer API (entity/predicate registration, triplet storage, advanced retrieval). Those features are **not** implemented yet. This document reflects the functionality that exists today.

## Features

- `Memory` data model (`meshmind.core.types.Memory`) captures namespace, entity label, metadata, optional embeddings, timestamps, and TTL/importance metadata.
- `MeshMind` client (`meshmind.client.MeshMind`) wires together an OpenAI client, an embedding model name, and a Memgraph driver, exposing helpers for:
- `extract_memories` – LLM-based extraction that validates entity labels and populates embeddings through the encoder registry.
- `deduplicate`, `score_importance`, `compress` – preprocessing helpers.
- `store_memories` – persists each memory by calling `GraphDriver.upsert_entity`.
- Pipeline modules under `meshmind.pipeline` implement the extraction, preprocessing, compression, expiry, consolidation, and storage steps.
- Retrieval utilities under `meshmind.retrieval` provide TF-IDF (BM25-style) search, RapidFuzz fuzzy matching, metadata/namespace/entity-label filters, and a hybrid scorer that blends cosine similarity with lexical scores.
- `meshmind.api.memory_manager.MemoryManager` offers CRUD-style helpers over an injected graph driver (add/update/delete/list).
- `meshmind.tasks.scheduled` defines Celery tasks (optional) that invoke expiry, consolidation, and compression routines on a schedule.
- A CLI entry point (`python -m meshmind` or `meshmind` after installation) exposes an `ingest` subcommand for end-to-end extraction from local files/directories.

## Requirements

- Python 3.10+ (the package metadata currently targets 3.13; align your environment accordingly).
- A running [Memgraph](https://memgraph.com/) instance accessible via Bolt, plus the `mgclient` Python driver.
- An OpenAI API key for both extraction and embedding (set `OPENAI_API_KEY`).
- Optional services: Redis (if Celery tasks are used).
- Python dependencies listed in `pyproject.toml` (`openai`, `pydantic`, `rapidfuzz`, `scikit-learn`, `numpy`, `celery[redis]`, `sentence-transformers`, `tiktoken`, `pymgclient`, etc.).

## Installation

```bash
python -m venv .venv
source .venv/bin/activate
pip install -e .
```

Ensure Memgraph is running and environment variables are set:

```bash
export OPENAI_API_KEY=... # required by openai.OpenAI()
export MEMGRAPH_URI=bolt://localhost:7687
export MEMGRAPH_USERNAME=...
export MEMGRAPH_PASSWORD=...
```

Before calling extraction, register an embedding encoder that matches `settings.EMBEDDING_MODEL` (default `text-embedding-3-small`). For example:

```python
from meshmind.core.embeddings import EncoderRegistry, OpenAIEmbeddingEncoder
EncoderRegistry.register("text-embedding-3-small", OpenAIEmbeddingEncoder())
```

## Quickstart

```python
from meshmind.client import MeshMind
from meshmind.core.types import Memory
from meshmind.core.embeddings import EncoderRegistry, OpenAIEmbeddingEncoder

# Register the embedding encoder expected by the pipeline
EncoderRegistry.register("text-embedding-3-small", OpenAIEmbeddingEncoder())

mm = MeshMind()
texts = [
"Jane Doe is a senior software engineer based in Berlin.",
"John Doe manages the infrastructure team in Berlin.",
]
memories = mm.extract_memories(
instructions="Extract each distinct person as a Memory.",
namespace="Company Directory",
entity_types=[Memory], # entity labels are validated against class names
content=texts,
)
memories = mm.deduplicate(memories)
memories = mm.score_importance(memories)
memories = mm.compress(memories)
mm.store_memories(memories)
```

## Retrieval Helpers

The retrieval utilities operate on lists of `Memory` objects that are already loaded into Python (for example via `MemoryManager.list_memories`).

```python
from meshmind.retrieval.search import search, search_bm25, search_fuzzy
from meshmind.core.types import Memory, SearchConfig
from meshmind.core.embeddings import EncoderRegistry

# Register an encoder for hybrid/vector scoring
class DummyEncoder:
def encode(self, texts):
return [[len(t)] for t in texts]

EncoderRegistry.register("dummy", DummyEncoder())

memories = [
Memory(namespace="Company", name="Jane Doe", entity_label="Memory", embedding=[7.0]),
Memory(namespace="Company", name="John Doe", entity_label="Memory", embedding=[7.0]),
]
config = SearchConfig(encoder="dummy", top_k=5, hybrid_weights=(0.5, 0.5))
results = search("Jane", memories, namespace="Company", config=config)
```

## CLI Ingest

```bash
meshmind ingest \
--namespace "Company" \
--instructions "Extract personnel facts as Memory objects." \
path/to/files
```

The command reads all text files, extracts memories, runs the preprocessing pipeline, and stores the results using the configured Memgraph connection.

## Maintenance Tasks (Optional)

If Celery and Redis are available, import `meshmind.tasks.scheduled` to register three periodic tasks: `expire_task`, `consolidate_task`, and `compress_task`. Each task fetches memories through `MemoryManager` and applies the corresponding pipeline helper.

## Limitations & Next Steps

- Relationship/edge storage is not implemented; only nodes are persisted.
- Custom entity models are not instantiated during extraction—only `entity_label` names are validated.
- Advanced retrieval techniques (regex search, LLM reranking, external vector databases) are not implemented yet.
- Several modules assume optional dependencies (`mgclient`, `tiktoken`, OpenAI SDK) are installed and configured.

Refer to `PROJECT.md`, `ISSUES.md`, and `PLAN.md` for the detailed roadmap.
27 changes: 27 additions & 0 deletions PLAN.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
# Next Steps Plan

## Phase 1 – Stabilize the Existing Surface
1. **Adopt `NEW_README.md`** as the public README and archive the legacy version so expectations match reality.
2. **Fix critical dependency traps**:
- Wrap `tiktoken` imports in `core/utils.py` and `pipeline/compress.py` with guards.
- Update `OpenAIEmbeddingEncoder` to use `.data` from the modern SDK and add tests.
- Lazily instantiate `MemgraphDriver` (e.g., in `store_memories` or via a factory) to avoid hard crashes when mgclient is missing.
3. **Auto-register embeddings**: when MeshMind starts, register an `OpenAIEmbeddingEncoder` using `settings.EMBEDDING_MODEL` so extraction works immediately.
4. **Repair the test suite**: modernize mocks for the OpenAI client, remove references to `Memory.pre_init`, and add coverage for the dependency fallbacks above.

## Phase 2 – Deliver Promised Graph Features
5. **Implement entity/predicate registration APIs** on `MeshMind`, backed by `meshmind.models.registry`. Persist this metadata in Memgraph or an internal registry.
6. **Add triplet storage**: extend the pipeline to transform memory metadata into node/edge upserts (subject–predicate–object) and expose `add_triplet` / `add_memory` helpers.
7. **Provide graph retrieval helpers**: implement functions to fetch memories by namespace, list predicates, and query neighbors directly via the driver.

## Phase 3 – Enhance Retrieval & Automation
8. **Expand retrieval modes**: add vector-only search, regex/exact match filters, and an optional LLM reranker stage configurable through `SearchConfig`.
9. **Integrate retrieval with storage**: allow `MemoryManager` or the driver to return ranked results instead of requiring callers to preload all memories.
10. **Strengthen maintenance workflows**: initialize Celery tasks lazily with logging, ensure expiry/consolidation/compression can operate when Memgraph is down (e.g., retry/backoff).

## Phase 4 – Developer Experience & Observability
11. **Align tooling**: reconcile Python version requirements, add dev dependencies (`ruff`, `isort`, `black`) to a `dev` extra, and wire CI to run lint/tests.
12. **Document architecture**: keep `SOT.md` updated, add diagrams or sequence charts, and document configuration/operations in the README.
13. **Add logging and metrics**: instrument CLI ingestion, pipeline stages, and graph operations for debugging and future monitoring.

Revisit this plan after Phase 1 to adjust scope based on effort estimates and stakeholder priorities.
Loading