Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
44 changes: 44 additions & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
name: CI

on:
push:
branches: ["main", "review", "review-1"]
pull_request:

env:
PIP_DISABLE_PIP_VERSION_CHECK: "1"

jobs:
lint:
runs-on: ubuntu-latest
steps:
- name: Check out
uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: "3.11"
- name: Install toolchain
run: |
pip install uv
uv pip install -e .
uv pip install ruff pyright typeguard toml-sort yamllint
- name: Lint and format checks
run: make fmt-check

tests:
runs-on: ubuntu-latest
steps:
- name: Check out
uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: "3.11"
- name: Install dependencies
run: |
pip install uv
uv pip install -e .
uv pip install pytest
- name: Run pytest
run: make test
11 changes: 11 additions & 0 deletions AGENTS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# Agent Instructions

## Documentation Workflow
- Update `CHANGELOG.md` with a new entry every time code changes are committed.
- Maintain `README_LATEST.md` so it always reflects the current implementation; refresh it alongside major feature updates.
- After each iteration, revise `ISSUES.md`, `SOT.md`, `PLAN.md`, `RECOMMENDATIONS.md`, and `TODO.md` to stay in sync with the codebase.

## Style Guidelines
- Use descriptive Markdown headings starting at level 1 for top-level documents.
- Keep lines to 120 characters or fewer when practical.
- Prefer bullet lists for enumerations instead of inline commas.
20 changes: 20 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
# Changelog

## [Unreleased] - 2025-02-14
### Added
- Vector-only, regex, exact-match, and optional LLM rerank retrieval helpers with reranker utilities and exports.
- MeshMind client wrappers for hybrid, vector, regex, and exact searches plus driver accessors.
- Example script demonstrating triplet storage and diverse retrieval flows.
- Pytest fixtures for encoder and memory factories alongside new retrieval tests that avoid external services.
- Makefile targets for linting, formatting, type checks, and tests, plus a GitHub Actions workflow running lint and pytest.
- README_LATEST.md capturing the current implementation and CHANGELOG.md for release notes.

### Changed
- Updated `SearchConfig` to support rerank models and refreshed MeshMind documentation across PROJECT, PLAN, SOT, FINDINGS,
DISCREPANCIES, RECOMMENDATIONS, ISSUES, TODO, and NEEDED_FOR_TESTING files.
- Revised `meshmind.retrieval.search` to apply filters centrally, expose new search helpers, and integrate reranking.
- Exposed graph driver access on MeshMind and refreshed retrieval-facing examples and docs.

### Fixed
- Example ingestion script now uses MeshMind APIs correctly and illustrates relationship persistence.
- Tests rely on fixtures rather than deprecated hooks, improving portability across environments without Memgraph/OpenAI.
55 changes: 55 additions & 0 deletions DISCREPANCIES.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
# README vs Implementation Discrepancies

## Overview
- The legacy README still promises a fully featured memory graph with multi-level APIs, relationship storage, and diverse
retrieval methods. Many of those features now exist, but the document remains outdated and should be replaced by
`README_LATEST.md`.
- The current codebase delivers extraction, preprocessing, triplet persistence, CRUD helpers, and expanded retrieval strategies
that were missing when the README was written.
- Remaining gaps primarily involve graph-backed retrieval, observability, and automated infrastructure provisioning.

## API Surface
- ✅ `MeshMind` now exposes CRUD helpers (`create_memory`, `update_memory`, `delete_memory`, `list_memories`, triplet helpers)
that the README referenced implicitly.
- ✅ Triplet storage routes through `store_triplets` and `MemoryManager.add_triplet`, calling `GraphDriver.upsert_edge`.
- ⚠️ The README still references `register_entity`, `register_allowed_predicates`, and `add_predicate`; predicate management is
handled automatically but there is no public API matching those method names.
- ⚠️ README snippets showing `mesh_mind.store_memory(memory)` should be updated to call `store_memories([memory])` or the new
CRUD helpers.

## Retrieval Capabilities
- ✅ Vector-only, regex, exact-match, hybrid, BM25, fuzzy, and optional LLM rerank searches exist in `meshmind.retrieval.search`
and are surfaced through `MeshMind` helpers.
- ⚠️ README implies retrieval queries the graph directly. Current search helpers operate on in-memory lists supplied by the
caller; Memgraph-backed retrieval remains future work.
- ⚠️ Named helpers like `search_facts` or `search_procedures` never existed; the README should reference the dispatcher plus
specialized helpers now available.

## Data & Relationship Modeling
- ✅ Predicates are persisted automatically when storing triplets and tracked in `PredicateRegistry`.
- ⚠️ README examples that look up subjects/objects by name still do not match the implementation, which expects UUIDs. Add
documentation explaining how to resolve names to UUIDs before storing edges.
- ⚠️ Consolidation and expiry remain limited to Celery jobs; README narratives about integrated maintenance still overstate the
current persistence story.

## Configuration & Dependencies
- ✅ `README_LATEST.md` and `NEEDED_FOR_TESTING.md` document required environment variables, dependency guards, and setup steps.
- ⚠️ The legacy README omits optional tooling now required by the Makefile/CI (ruff, pyright, typeguard, toml-sort, yamllint).
- ⚠️ Python version support in `pyproject.toml` (3.13) still diverges from what many dependencies officially support; update the
documentation or relax the requirement.

## Example Code Paths
- ✅ Updated example scripts demonstrate extraction, triplet creation, and multiple retrieval strategies.
- ⚠️ Legacy README code that instantiates custom Pydantic entities remains inaccurate; extraction returns `Memory` objects and
validates `entity_label` names only.
- ⚠️ Search examples should be updated to show the new helper functions and optional rerank usage instead of nonexistent
`search_facts`/`search_procedures` calls.

## Tooling & Operations
- ✅ Makefile and CI workflows now exist, aligning with README promises about automation once the README is refreshed.
- ⚠️ Docker Compose still lacks service definitions for Memgraph/Redis; README setup sections should call this out explicitly.
- ⚠️ Celery tasks remain best-effort shims; README should clarify that maintenance requires the optional infrastructure.

## Documentation State
- Promote `README_LATEST.md` as the authoritative guide, archive the legacy README, and ensure future updates propagate to
supporting docs (`SOT.md`, `PLAN.md`, `NEEDED_FOR_TESTING.md`).
39 changes: 39 additions & 0 deletions FINDINGS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
# Findings

## General Observations
- Core modules are now wired through the `MeshMind` client, including CRUD, triplet storage, and retrieval helpers. Remaining
integration work centers on graph-backed retrieval and maintenance persistence.
- Optional dependencies are largely guarded behind lazy imports or factory functions, improving portability. Environments still
need to install tooling referenced by the Makefile and CI (ruff, pyright, typeguard, toml-sort, yamllint).
- Documentation artifacts (`README_LATEST.md`, `SOT.md`, `NEEDED_FOR_TESTING.md`) stay current when updated with each iteration;
the legacy README should be archived.

## Dependency & Environment Notes
- `MeshMind` defers Memgraph driver creation until persistence is required, enabling limited workflows without `pymgclient`.
- Encoder registration occurs during bootstrap, but custom deployments must ensure compatible models are registered before
extraction or hybrid search.
- The OpenAI embedding adapter still expects dictionary-like responses; adapting to SDK objects remains on the backlog.
- Celery tasks initialize lazily, yet Redis/Memgraph services are still required at runtime. Docker Compose lacks concrete
service definitions.

## Data Flow & Persistence
- Triplet storage now persists relationships and tracks predicates automatically, closing an earlier data-loss gap.
- Consolidation and compression utilities operate in memory; persistence of maintenance results is still pending.
- Importance scoring remains a constant fallback; improved heuristics will raise retrieval quality once implemented.

## CLI & Tooling
- CLI ingestion bootstraps encoders and entities automatically but still assumes Memgraph and OpenAI credentials are configured.
- The Makefile introduces lint, format, type-check, and test targets, plus a Docker helper. External tooling installation is
required before targets succeed.
- GitHub Actions now run formatting checks and pytest on push/PR, providing basic CI coverage.

## Testing & Quality
- Pytest suites rely on fixtures (`memory_factory`, `dummy_encoder`) to run without external services. Additional coverage is
needed for Celery workflows and graph-backed retrieval when implemented.
- Type checking via `pyright` and runtime checks via `typeguard` are exposed in the Makefile; dependency installation is
necessary for full validation.

## Documentation
- `README_LATEST.md` supersedes the legacy README and documents setup, pipelines, retrieval, and tooling.
- Supporting docs (`ISSUES.md`, `PLAN.md`, `RECOMMENDATIONS.md`, `SOT.md`) reflect the latest capabilities and highlight remaining
gaps, aiding onboarding and future planning.
30 changes: 30 additions & 0 deletions ISSUES.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
# Issues Checklist

## Blockers
- [ ] MeshMind client fails without `mgclient`; introduce lazy driver initialization or documented in-memory fallback.
- [ ] Register a default embedding encoder (OpenAI or sentence-transformers) during startup so extraction and hybrid search can run.
- [ ] Update OpenAI integration to match the current SDK (Responses API payload, embeddings API response structure).
- [ ] Replace eager `tiktoken` imports in `meshmind.core.utils` and `meshmind.pipeline.compress` with guarded, optional imports.
- [ ] Align declared Python requirement with supported dependencies (currently set to Python 3.13 despite ecosystem gaps).

## High Priority
- [x] Implement relationship persistence (`GraphDriver.upsert_edge`) within the storage pipeline and expose triplet APIs.
- [x] Restore high-level API methods promised in README (`register_entity`, predicate management, `add_memory`, `update_memory`, `delete_memory`).
- [x] Ensure CLI ingestion registers entity models and embedding encoders or fails fast with actionable messaging.
- [x] Provide configuration documentation and examples for Memgraph, Redis, and OpenAI environment variables.
- [x] Add automated tests or smoke checks that run without external services (mock OpenAI, stub Memgraph driver).
- [ ] Create real docker-compose services for Memgraph and Redis or remove the placeholder file.

## Medium Priority
- [ ] Persist results from consolidation and compression tasks back to the database (currently in-memory only).
- [ ] Refine `Memory.importance` scoring to reflect actual ranking heuristics instead of a constant.
- [x] Add vector, regex, and exact-match search helpers to match stated feature set or update documentation to demote them.
- [ ] Harden Celery tasks to initialize dependencies lazily and log failures when the driver is unavailable. (In progress: lazy driver initialization added, persistence pending)
- [ ] Reconcile tests that depend on `Memory.pre_init` and outdated OpenAI interfaces with the current implementation.
- [x] Add linting, formatting, and type-checking tooling to improve code quality.

## Low Priority / Nice to Have
- [ ] Offer alternative storage backends (in-memory driver, SQLite, etc.) for easier local development.
- [ ] Provide an administrative dashboard or CLI commands for listing namespaces, counts, and maintenance statistics.
- [ ] Publish onboarding guides and troubleshooting FAQs for contributors.
- [ ] Explore plugin registration for embeddings and retrieval strategies to reduce manual wiring.
24 changes: 19 additions & 5 deletions Makefile
Original file line number Diff line number Diff line change
@@ -1,17 +1,31 @@
.PHONY: install lint fmt test docker
.PHONY: install lint fmt fmt-check typecheck test check docker clean

install:
pip install -e .

lint:
ruff .
ruff check .

fmt:
isort .
black .
ruff format .

fmt-check:
ruff format --check .
ruff check .
toml-sort --check pyproject.toml
yamllint .github/workflows

typecheck:
pyright
python -m typeguard --check meshmind

test:
pytest

check: fmt-check lint typecheck test

clean:
rm -rf .pytest_cache .ruff_cache

docker:
docker-compose up
docker compose up
45 changes: 45 additions & 0 deletions NEEDED_FOR_TESTING.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
# Needed for Testing MeshMind

## Python Runtime
- Python 3.11 or 3.12 is recommended; project metadata currently states 3.13+, but several dependencies (`pymgclient`,
`sentence-transformers`) do not yet publish wheels for 3.13.
- Use a virtual environment (`uv`, `venv`, or `conda`) to isolate dependencies.

## Python Dependencies
- Install the project editable: `pip install -e .` from the repository root.
- Required packages declared in `pyproject.toml` include `openai`, `pydantic`, `pydantic-settings`, `numpy`, `scikit-learn`,
`rapidfuzz`, `python-dotenv`, `celery[redis]`, `sentence-transformers`, `tiktoken`, and `pymgclient`.
- Development tooling referenced by the Makefile and CI:
- `ruff` for linting and formatting.
- `pyright` for static type checks.
- `typeguard` for runtime type enforcement (`python -m typeguard --check meshmind`).
- `toml-sort` and `yamllint` for configuration validation.
- Optional helpers for local workflows: `pytest-cov`, `pre-commit`, `httpx` (for future service interfaces).

## External Services and Infrastructure
- **Memgraph** (or compatible Bolt graph database) reachable via `MEMGRAPH_URI` with credentials exported in
`MEMGRAPH_USERNAME`/`MEMGRAPH_PASSWORD`.
- **Redis** for Celery task queues, referenced through `REDIS_URL`.
- **OpenAI API access** for extraction, embeddings, and LLM reranking (`OPENAI_API_KEY`).
- Recommended: Docker Compose or equivalent orchestration to run Memgraph and Redis together when developing locally.

## Environment Variables
- `OPENAI_API_KEY` — required for extraction, embeddings, and reranking.
- `MEMGRAPH_URI` — e.g., `bolt://localhost:7687`.
- `MEMGRAPH_USERNAME` and `MEMGRAPH_PASSWORD` — credentials for the graph database.
- `REDIS_URL` — optional Redis connection URI (defaults to `redis://localhost:6379/0`).
- `EMBEDDING_MODEL` — encoder key registered with `EncoderRegistry` (defaults to `text-embedding-3-small`).
- Optional overrides for Celery broker/backend if using hosted services.

## Local Configuration Steps
- Ensure an embedding encoder is registered before extraction or hybrid search. The bootstrap utilities invoked by the CLI and
`MeshMind` constructor handle this, but custom scripts must call `bootstrap_encoders()`.
- Seed demo data as needed using the `examples/extract_preprocess_store_example.py` script after configuring environment
variables.
- Create a `.env` file storing the environment variables above for consistent local configuration.

## Current Blockers in This Environment
- Neo4j/Memgraph binaries and Docker are unavailable in this workspace, preventing local graph provisioning.
- Redis cannot be installed without container or host-level access; Celery tasks remain untestable locally until a remote
instance is provisioned.
- External network restrictions may limit installation of proprietary packages or access to OpenAI endpoints.
Loading
Loading