-
Notifications
You must be signed in to change notification settings - Fork 5
(draft/poc) Refactoring to improve llm usability #79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
planetf1
wants to merge
49
commits into
AI4quantum:main
Choose a base branch
from
planetf1:phase1-remove-input-wrapper
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
BREAKING CHANGE: All MCP tool calls now use flat parameter structure Changes: - Removed 21 Input model classes (QueryInput, WriteDocumentsInput, etc.) - Refactored 22 tool functions to use flat Field() parameters - Renamed parameters for clarity (db_name→database, db_type→database_type, etc.) - Updated all test files to use new flat parameter structure - Fixed E2E tests (9/10 passing with Milvus) - Updated start.sh to use 'uv run python' - Created comprehensive migration guide and documentation - Added AGENTS.md with development tips for AI agents Test Results: - Unit/Integration: 239 tests passing - E2E Milvus: 9 passed, 1 skipped - Schema validation: 4 new tests passing Files Modified: - src/maestro_mcp/server.py (major refactor, -195 lines Input models) - tests/test_*.py (7 test files updated) - tests/e2e/test_functions.py (55 tool calls updated) - start.sh (updated for uv) - docs/MIGRATION_GUIDE.md (new, 476 lines, with status tracker) - docs/AGENTS.md (new, 250 lines, agent development guide) - tests/test_phase1_schema_validation.py (new, 4 validation tests) Migration Status: - Phase 1: ✅ COMPLETE - Phase 2-10: 📋 PLANNED (see docs/REFACTORING_PLAN.md) This implements Phase 1 of the refactoring plan to fix LLM agent compatibility by removing nested parameter structures that caused 100% failure rate. For AI agents: See docs/AGENTS.md for development tips and common pitfalls.
204d3e0 to
afe2807
Compare
…ate database setup Phase 2.5: Enhanced embedding parameter documentation - Added comprehensive docs for all embedding options - Clarified default vs custom_local usage Phase 2.6: Separated database setup from collection creation - Renamed create_vector_database_tool → register_database - Split setup_database (connection only) from create_collection - Added validation enforcing correct workflow order - New 3-step workflow: register → setup → create_collection - Updated migration guide with before/after examples - 10/10 tests passing Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
afe2807 to
fa66dda
Compare
Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
6a9374e to
d81d1f0
Compare
Fixed critical bug where overlapping text chunks were duplicated during document reassembly. Implements intelligent overlap detection using chunk metadata with text-based fallback. Changes: - Fixed _reassemble_chunks_into_document() to handle overlaps correctly - Added _find_text_overlap() helper for text-based overlap detection - Added 14 comprehensive unit tests in tests/test_reassembly.py - Updated E2E test to use base class method instead of manual workaround - Updated migration guide with Phase 3 documentation - Fixed phase numbering in migration guide to match refactoring plan - Fixed ruff linting errors (added return type annotations) Technical Details: - Primary: Offset-based detection using offset_start/offset_end metadata - Fallback: Text-based detection when metadata unavailable - Handles edge cases: empty chunks, out-of-order, missing metadata - Works with all chunking strategies (Fixed, Sentence, Semantic) Impact: - No breaking changes - transparent improvement - No user code changes required - Automatic fix for all documents with overlapping chunks - 16/16 reassembly tests passing - All ruff checks passing Closes: Phase 3 of refactoring plan Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
…rmat
Phase 4: Search Quality Controls
- Add min_score parameter to filter results by similarity threshold (0-1)
- Add metadata_filters parameter for AND-based metadata filtering
- Implement filtering in both Milvus and Weaviate backends
- Update MCP server search tool with new optional parameters
Phase 5: Improved Citation Format
- Add top-level url field to search results for easy LLM access
- Add source_citation field with formatted string: "Source: {doc_name} ({url})"
- Add canonical score field (normalized 0-1) across all backends
- Maintain backward compatibility with existing metadata structure
Testing:
- Add comprehensive integration test suite (7 tests)
- All tests passing (48 total in standard test suite)
- Tests use proper async mocking patterns
Documentation:
- Update MIGRATION_GUIDE.md with Phase 4 & 5 sections
- Update REFACTORING_PLAN.md status to COMPLETE
- Include usage examples and migration notes
All changes are backward compatible with no breaking changes.
Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
Implemented comprehensive error handling improvements to make the MCP server more LLM-agent friendly with actionable error messages and clear guidance. ### Changes **New Module:** - src/maestro_mcp/error_messages.py: Centralized error message templates with actionable recovery steps for 10+ common error scenarios **Enhanced Error Handling:** - register_database: Database type validation, existence checks - setup_database: Database validation, embedding error detection - create_collection: Comprehensive error handling with specific guidance - query: Parameter validation, collection existence checks - search: Parameter validation, result filtering, helpful suggestions - delete_document_from_collection: Clear error messages for missing resources **Improved Tool Documentation:** - Added Prerequisites section to all tool functions - Added Next Steps guidance after successful operations - Added Common Errors section with solutions - Enhanced parameter descriptions with validation rules **Parameter Validation:** - limit: Must be 1-100 - min_score: Must be 0.0-1.0 - database_type: Must be milvus or weaviate - embedding: Validated against supported models **Test Updates:** - Updated test_phase26_workflow.py to match new error message format - Fixed 4 tests expecting old "Error:" prefix - All tests passing (24 passed) **Documentation:** - Added Phase 6 section to MIGRATION_GUIDE.md - Documented all error types and improvements - Included before/after examples ### Benefits - LLM agents receive actionable error messages with clear recovery steps - Better parameter validation catches errors early - Improved tool documentation helps agents understand requirements - Backward compatible - no breaking changes to API ### Example Before: `Error: Database mydb not found` After: Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
…hema
Remove deprecated input wrapper and update parameter names in E2E test files
to align with Phase 1 flat parameter schema implementation.
Changes:
- tests/e2e/test_functions_simple.py: Remove {"input": {...}} wrapper, update parameter names
- tests/e2e/test_mcp_weaviate_simple.py: Remove {"input": {...}} wrapper, update parameter names
- docs/REFACTORING_PLAN.md: Mark Phase 7 as complete
Parameter updates:
- db_name → database
- db_type → database_type
- collection_name → collection
All Phase 1 schema validation tests pass. E2E tests now use direct parameter
passing to MCP tools, consistent with the flat schema design.
Refs: Phase 7 of REFACTORING_PLAN.md
Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
- Add Phase 4 parameter tests to test_mcp_query.py (min_score, metadata_filters) - Create test_search_quality_e2e.py with comprehensive E2E tests for Phase 4/5 - Delete DOCUMENTATION_REVIEW_FINDINGS.md (all actions complete) Closes all test gaps identified in documentation review: - Phase 4: min_score and metadata_filters parameters - Phase 5: url, source_citation, and score fields in results - E2E coverage for search quality controls and citation format All tests pass. Test coverage now complete for Phases 1-8. Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
… optional URL, better defaults
Phase 8.5 completes the LLM-friendly refactoring with critical usability improvements:
**Auto-Detect Embeddings:**
- Changed default from "default" to "auto" in register_database(), setup_database(), create_collection()
- System now auto-detects custom embeddings from CUSTOM_EMBEDDING_URL, CUSTOM_EMBEDDING_MODEL, CUSTOM_EMBEDDING_VECTORSIZE
- Falls back to OpenAI only when custom embeddings not configured
- Eliminates "OPENAI_API_KEY required" errors when using local embeddings (Ollama)
**Simplified Workflow:**
- Merged register_database() and setup_database() into single step
- Reduced from 3-step to 2-step workflow: register (with setup) → create_collection
- setup_database() still works but marked deprecated
**Optional URL Parameter:**
- Made url parameter optional in write_document() with default=""
- Auto-generates document IDs from text hash when url is empty (format: doc_{hash})
- LLMs no longer need to provide URLs for simple text documents
**Better Chunking Defaults:**
- Changed default chunking strategy from "None" to "Sentence" (512 chars, respects sentence boundaries)
- Prevents failures on large documents without manual chunking configuration
**Type Checking Fixes:**
- Fixed dict[str, object] type annotation issue in src/chunking/common.py
- Resolved all type checking errors
**Test Updates:**
- Fixed Weaviate workflow test to mock setup() call
- Updated test assertions for new success messages
- All 267 tests passing
**Documentation:**
- Updated REFACTORING_SUMMARY.md with Phase 8.5 section
- Updated MIGRATION_GUIDE.md with 2-step workflow and auto-detection
- Updated AGENTS.md with latest API changes
- Updated README.md with embedding auto-detection and chunking defaults
**Impact:**
- LLMs can now use the system without specifying embedding models or URLs
- System automatically uses custom embeddings when environment variables are set
- No more confusing errors about missing OpenAI keys when using local embeddings
- Large documents work by default without manual chunking configuration
**Breaking Changes:** None - all changes are backward compatible
Files changed:
- src/maestro_mcp/server.py (auto-detect logic, optional URL, merged workflow)
- src/chunking/common.py (default strategy, type fixes)
- tests/test_phase26_workflow.py (test updates)
- docs/REFACTORING_SUMMARY.md (Phase 8.5 documentation)
- docs/MIGRATION_GUIDE.md (updated API reference)
- AGENTS.md (updated status and examples)
- README.md (auto-detection and chunking info)
Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
- register_database → create_database - cleanup → delete_database - resync_databases_tool → refresh_databases - Updated server.py, error messages, and all test files - All unit and E2E tests passing Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
Step 1: Renamed core tools - register_database → create_database - cleanup → delete_database - resync_databases_tool → refresh_databases Step 2: Removed deprecated setup_database - Deleted setup_database tool (functionality in create_database) - Deleted obsolete test file test_phase26_workflow.py - Updated all test references Tools reduced: 22 → 21 All unit and E2E tests passing Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
…info Step 3 of Phase 9.1 tool consolidation (22→21 tools) Changes: - Added include_embeddings parameter to get_database_info - Removed get_supported_embeddings tool - Updated all tests (unit, integration, E2E) Breaking Changes: - get_supported_embeddings() removed - Use get_database_info(database="x", include_embeddings=True) Files Modified: - src/maestro_mcp/server.py - tests/test_mcp_server.py - tests/test_phase1_schema_validation.py - tests/test_integration_mcp_server.py - tests/e2e/test_functions.py - tests/e2e/test_functions_simple.py - tests/e2e/test_mcp_weaviate_simple.py Tests: All passing (257 tests) Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
- Removed standalone count_documents tool - Added include_count parameter to get_collection_info - Updated E2E tests to use get_collection_info with include_count=True - Updated tool list validation tests Tool count: 22 → 21 Breaking change: count_documents tool removed Migration: Use get_collection_info(database="db", include_count=True) Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
…(Step 5) - Remove write_document tool (single document wrapper) - Remove write_document_to_collection tool (collection-specific) - Update all test references to use write_documents - Tool count reduced: 22 → 20 tools Changes: - src/maestro_mcp/server.py: Removed both write_document functions - tests/test_phase1_schema_validation.py: Updated expected tool list - tests/test_mcp_server.py: Removed write_document from expected tools - tests/e2e/test_functions.py: Replaced calls with write_documents All tests passing (test_phase1_schema_validation.py: 4/4, test_mcp_server.py: 3/3) Part of Phase 9.1: Tool Consolidation & Naming Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
…s (Step 6) - Remove delete_document tool (single document wrapper) - Remove delete_document_from_collection tool (collection-specific) - Update all test references to use delete_documents with document_ids array - Update example files to use delete_documents - Tool count reduced: 20 → 18 tools Changes: - src/maestro_mcp/server.py: Removed both delete_document functions - tests/test_phase1_schema_validation.py: Updated expected tool list - tests/test_mcp_server.py: Removed delete_document from expected tools - tests/test_integration_examples.py: Updated to check for delete_documents - tests/e2e/test_functions.py: Replaced calls with delete_documents, added logic to find document IDs - examples/milvus_example.py: Updated to use delete_documents - examples/weaviate_example.py: Updated to use delete_documents All tests passing: - test_phase1_schema_validation.py: 4/4 passed - test_mcp_server.py: 3/3 passed - test_integration_examples.py: 1/1 passed Part of Phase 9.1: Tool Consolidation & Naming Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
Remove redundant document listing tools as part of Phase 9.1 tool consolidation. Users should now use search with query="*" to list documents. Changes: - Remove list_documents tool (replaced by search) - Remove list_documents_in_collection tool (replaced by search with collection param) - Remove count_documents tool (merged into get_collection_info in Step 4) - Update test files to remove tool references - Fix E2E tests: add required query parameter to all search calls Tool count: 22 → 15 tools Breaking Changes: - list_documents() removed - use search(query="*") instead - list_documents_in_collection() removed - use search(query="*", collection=name) - count_documents() removed - use get_collection_info(include_count=True) Files modified: - src/maestro_mcp/server.py (removed 3 tool functions) - tests/test_phase1_schema_validation.py (updated expected tools) - tests/test_mcp_server.py (updated expected tools) - tests/test_integration_examples.py (removed tool assertions) - tests/e2e/test_functions.py (replaced list_documents with search + query param) Related: Phase 9.1 LLM Usability Refactoring txt Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
…t_database_info Consolidate chunking strategy discovery into get_database_info tool. Users can now get chunking info via include_chunking parameter. Changes: - Add include_chunking parameter to get_database_info - Remove get_supported_chunking_strategies tool - Update E2E tests to use new API Tool count: 15 → 14 tools (Phase 9.1 target achieved!) Breaking Changes: - get_supported_chunking_strategies() removed Use get_database_info(include_chunking=True) instead Files modified: - src/maestro_mcp/server.py (added include_chunking param, removed tool) - tests/test_phase1_schema_validation.py (updated expected tools) - tests/e2e/test_functions.py (use include_chunking) - tests/e2e/test_functions_simple.py (use include_chunking) - tests/e2e/test_mcp_weaviate_simple.py (use include_chunking) Related: Phase 9.1 LLM Usability Refactoring - Step 8/8 Complete Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
BREAKING CHANGE: create_database() no longer accepts collection parameter Changes: - Remove collection parameter from create_database() tool - Update factory to make collection_name optional (uses placeholder) - Update success message to clarify no collections created by default - Fix all E2E tests (8 occurrences in test_functions.py) - Update test schema validation - Update examples Users must now explicitly call create_collection() after create_database(). Before: create_database(database="mydb", database_type="milvus", collection="docs") After: create_database(database="mydb", database_type="milvus") create_collection(database="mydb", collection="docs") Files modified: - src/maestro_mcp/server.py - src/db/vector_db_factory.py - tests/test_phase1_schema_validation.py - tests/e2e/test_functions_simple.py - tests/e2e/test_functions.py (8 fixes) - tests/e2e/test_mcp_weaviate_simple.py - examples/mcp_example.py Fixes: Milvus E2E test failures (test_configuration_discovery, test_collection_specific_operations, test_full_flow) Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
BREAKING CHANGES: - Renamed `document_name` → `document_id` in get_document tool - Added required `collection` parameter to delete_documents tool Changes: - src/maestro_mcp/server.py: - delete_documents: Added collection parameter, updated to set db.collection_name - get_document: Renamed document_name parameter to document_id - tests/test_phase1_schema_validation.py: - Added parameter validation for delete_documents and get_document - tests/e2e/test_functions.py: - Updated all delete_documents calls to include collection parameter - Updated get_document call to use document_id and include collection This improves API consistency by using standard parameter names across all tools: - database (not db_name) - collection (not collection_name) - document_id (not document_name) Part of Phase 9.4: Improve Parameter Consistency Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
- delete_database: Added force parameter with collection count check before deletion - All deletion operations return helpful error messages with statistics when force=False - tests/test_phase1_schema_validation.py: - Updated parameter validation to include force parameter for all deletion tools - tests/e2e/test_functions.py: - Added force=True to all delete_database calls (9 locations) - Added force=True to all delete_documents calls (3 locations) - Added force=True to delete_collection call (1 location) - tests/e2e/test_functions_simple.py: - Added force=True to delete_database calls (2 locations) - tests/e2e/test_mcp_weaviate_simple.py: - Added force=True to delete_database calls (2 locations) Safety behavior: - force=False (default): Checks if resource is empty, returns error with stats if not - force=True: Proceeds with deletion regardless of contents, includes warning in response Part of Phase 9.5: Add Safety Features Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
BREAKING CHANGE: write_documents now requires explicit collection parameter - Add required collection parameter to write_documents MCP tool - Pass collection_name directly to db.write_documents() (stateless) - Remove stateful behavior (no db.collection_name mutation) - Update all E2E tests to pass collection parameter - Fix Phase 9.3 JSON response format for list_databases - Add JSON parsing to E2E test infrastructure This change makes the API truly stateless and thread-safe: - Each request includes all necessary parameters - No shared state mutation between requests - Concurrent writes to different collections work correctly - Consistent with other document operations (delete_documents, get_document) Files modified: - src/maestro_mcp/server.py: Add collection param, pass through to method - tests/e2e/test_functions.py: Update 7 write_documents calls - tests/e2e/common.py: Add timestamp to database names for uniqueness - docs/PHASE9_COLLECTION_PARAMETER_FIX.md: Technical implementation - docs/PHASE9_API_STATELESS_AUDIT.md: Complete API audit - docs/PHASE9.3_CHANGES_LOG.md: Change tracking All E2E tests now pass (10/10). Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
Phase 9.9 & 9.10 complete - Documentation and examples updated Changes: - Add Phase 9 section to README with JSON response examples - Add comprehensive Phase 9 migration guide to MIGRATION_GUIDE.md - Add Phase 9 summary to REFACTORING_SUMMARY.md - Fix deprecated embedding parameter in document_ingestion_example.py - Fix linting error in test_functions.py (Any → object) - Create PHASE9_DOCUMENTATION_UPDATES.md summary Breaking changes documented: - JSON response format (all 14 tools) - Collection parameter required in write_documents - Force parameter for destructive operations - Parameter naming consistency (database, collection, document_name) Phase 9 Status: - Core implementation: COMPLETE ✅ - Test infrastructure: COMPLETE ✅ - Documentation: COMPLETE ✅ - Examples: COMPLETE ✅ - Test file updates: PENDING (9 files) See docs/PHASE9_HANDOVER_COMPLETE.md for full status Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
- Rewrite MIGRATION_GUIDE.md to focus on developer needs - Remove phase-by-phase history, focus on what changed - Remove REFACTORING_SUMMARY.md (duplicated content) - Simplify README.md to reference v2.0 instead of phases - Final doc structure: 6 essential files Documentation now answers: "What changed and how do I migrate?" instead of "How did we get here?" Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
- Add @pytest.mark.unit to 23 unmarked tests (converters, schema validation) - Fix event loop warnings in vector_db_milvus.py with graceful shutdown check - Suppress known gRPC cleanup warnings in pytest config (grpcio issue #37535) All 267 tests now properly categorized with clean output (0 warnings). Changes: - tests/test_converters.py: Add pytestmark = pytest.mark.unit - tests/test_phase1_schema_validation.py: Add pytestmark = pytest.mark.unit - src/db/vector_db_milvus.py: Check event loop before async operations - pyproject.toml: Add filterwarnings for gRPC cleanup race condition Test results: 215 unit + 42 integration + 10 E2E = 267 tests passing Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
- Fix delete_collection caching: remove entries from vector_databases dict after deletion - Rename get_database_info() → get_config() for clarity - Rename get_collection_info() → get_collection() for consistency - Fix duplicate @app.tool() decorators on delete_documents and get_document - Update all tests to use new tool names (7 tests passing) - Create comprehensive API reference (docs/MCP_API_REFERENCE.md) - Document architecture issues (docs/DATABASE_COLLECTION_ARCHITECTURE.md) - Clean up temporary documentation (removed 6 redundant files) Final API: 11 tools (3 document, 4 collection, 2 query, 2 system) Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
- Updated MIGRATION_GUIDE.md with Phase 8.5 improvements: * Auto-bootstrap database connections (3-step → 1-step workflow) * Auto-detection of embeddings from environment variables * Optional URL parameter with auto-generation * Improved default chunking (None → Sentence-based) * Updated all workflow examples to use current API * Removed references to disabled tools - Enhanced create_collection docstring in server.py: * Documented auto-bootstrap behavior * Detailed embedding auto-detection logic * Added environment variable requirements * Updated error codes (DB_BOOTSTRAP_FAILED) - Removed temporary documentation files: * MCP_TOOLS_FIXES_SUMMARY.md (fix tracking - obsolete) * E2E_TESTS_NEED_UPDATE.md (test updates - completed) * TOOL5_UNTRACKED_COLLECTIONS.md (consolidated into audit) - Preserved permanent documentation: * MCP_TOOLS_AUDIT.md (complete audit record) * MIGRATION_GUIDE.md (updated with Phase 8.5) * MCP_API_REFERENCE.md (already current) All documentation now accurate, consistent, and consolidated. Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
Eliminates LLM confusion by removing "database" from response data and messages. Collections are the primary entity; database is now internal-only for connection management. Changes: - Remove "database" field from get_collection_info response data - Remove "in database" phrasing from list_collections and errors - Mark database parameter as "Internal use only" in docstrings - Keep database in metadata for internal debugging - Fix _placeholder_ bug in write_documents (pass collection to get_collection_info) - Update MIGRATION_GUIDE.md with simplified response structure docs - Remove phase-specific references from documentation All changes maintain backward compatibility while making API clearer for LLMs. Collections are now the primary user-facing abstraction. Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
- Change default chunking from "None" to "Sentence/512/0" (Phase 8.5) - Improve COLL_ALREADY_EXISTS error to suggest write_documents first - Remove confusing collection_total_documents from response - Update docstrings to clarify collection creation is optional - Prepare response formatter for future document ID support Addresses agent confusion with: - Wrong chunk count (0 instead of actual count) - No chunking applied by default - Unhelpful error messages for existing collections - Stale document counts causing confusion Files changed: - src/maestro_mcp/server.py: Error messages, chunk count fix, docstrings - src/db/vector_db_milvus.py: Default chunking strategy - src/db/vector_db_weaviate.py: Default chunking strategy - src/maestro_mcp/response_formatter.py: Remove stale fields - docs/: Added comprehensive analysis and future feature design Improves LLM agent experience significantly. Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
Implements deterministic document IDs as primary identifiers for all document operations, replacing reliance on optional URL/name metadata. BREAKING CHANGES: - delete_documents: now uses document_id (not doc_name) - get_document: parameter renamed from document_name to document_id - query tool: removed (use search instead) Core Implementation: - Add document_id generation module (16-char SHA-256 hash) - URL-based if URL provided (prevents duplicates) - Content-based if no URL (idempotent writes) - Update Milvus backend: write/search/list/delete/get operations - Update Weaviate backend: write/search/list/delete/get operations - Add list_documents MCP tool with metadata_filters support - Remove query method/tool (redundant with search) Testing: - Add core document_id generation tests (3 tests) - Remove obsolete query test files (3 files) Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
Critical Fixes: - Fixed race condition in write_documents where flush() is non-blocking * Added polling loop to wait for get_collection_stats() to confirm data is sealed * Polls every 0.5s up to 10s until row_count >= expected_rows * Prevents query() failures on newly written documents - Fixed missing collection_name assignment in list_documents MCP tool * Added db.collection_name = collection before calling db.list_documents() * Resolves "_placeholder_" collection error in query operations Technical Details: - search() works immediately (reads "growing" segments in memory) - query() requires sealed segments (written to disk after flush completes) - Must poll stats after flush() to confirm data is queryable Docstring Improvements: - Enhanced list_documents: Clarified case-insensitive partial matching for name/url filters - Enhanced search: Explained metadata_filters uses exact matches with ALL conditions - Enhanced get_document: Added explanation of how to obtain document_ids - Enhanced delete_documents: Listed sources for finding document_ids - Enhanced write_documents: Emphasized custom metadata fields for filtering Testing: - All operations now work immediately after write_documents - Filters tested: name_filter, url_filter, metadata_filters, min_score - Confirmed metadata filtering works with ad-hoc custom fields Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
Test Updates: - Updated test_get_document_success to expect LIKE filter syntax * Changed from metadata["document_id"] == "abc123" * To metadata LIKE '%"document_id": "abc123"%' * Matches implementation's VARCHAR metadata field querying Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
…N metadata ## Root Cause Manual ID assignment started at 0 on each write_documents() call, causing subsequent writes to overwrite previous documents with duplicate primary keys. ## Key Changes 1. **Schema-based collections with auto-increment IDs** - Use DataType.INT64 with auto_id=True for primary key - Remove manual id_counter and ID assignment in write operations - Add vector index creation and collection loading after schema creation 2. **JSON metadata field** - Change from VARCHAR to DataType.JSON for metadata storage - Update filters from LIKE queries to JSON path: metadata["field"] == "value" - Remove json.dumps()/json.loads() conversions 3. **Reliable queries** - Replace unreliable text != "" filter with id > 0 for list_documents - Fix metadata debug logging for dict type 4. **Default chunking fallback** - Apply Sentence(512, 0) chunking when config missing after restart - Prevents no-chunking scenario ## Migration Impact Existing collections use old schema and may have PK collision issues. New collections created after this fix will use auto-increment IDs. ## Testing All CRUD operations verified with single-chunk and multi-chunk documents, including metadata filtering, search, retrieval, and deletion. Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
Changed count_documents(), count_documents_in_collection(), and get_collection_info() to query unique document_ids instead of using get_collection_stats() which returned total chunk count. Added support for legacy chunks without document_id by generating synthetic IDs from URLs for backwards compatibility. Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This is part of a refactoring proposal for maestro-knowledge to improve the ability of LLMs to sensibly interact with the document library
Full details will be added when complete.
Tests will be run at each stage including end-to-end for milvus only
Manual checks with LLM usage (granite 4 tiny-h Q4_K_M)
High level overview
NOT READY FOR MERGE. EVALUATING NEXT STEPS