SingularityNET-Archive
diff --git a/‎.cursor/rules/specify-rules.mdc‎
Lines changed: 2 additions & 1 deletion b/‎.cursor/rules/specify-rules.mdc‎
Lines changed: 2 additions & 1 deletion
diff --git a/‎pyproject.toml‎
Lines changed: 1 addition & 0 deletions b/‎pyproject.toml‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎requirements.txt‎
Lines changed: 3 additions & 0 deletions b/‎requirements.txt‎
Lines changed: 3 additions & 0 deletions
diff --git a/‎specs/004-refine-entity-extraction/TESTING.md‎
Lines changed: 144 additions & 0 deletions b/‎specs/004-refine-entity-extraction/TESTING.md‎
Lines changed: 144 additions & 0 deletions
diff --git a/‎specs/004-refine-entity-extraction/TESTING_SEMANTIC_CHUNKING.md‎
Lines changed: 166 additions & 0 deletions b/‎specs/004-refine-entity-extraction/TESTING_SEMANTIC_CHUNKING.md‎
Lines changed: 166 additions & 0 deletions
diff --git a/‎specs/004-refine-entity-extraction/checklists/requirements.md‎
Lines changed: 39 additions & 0 deletions b/‎specs/004-refine-entity-extraction/checklists/requirements.md‎
Lines changed: 39 additions & 0 deletions
@@ -5,6 +5,7 @@ Auto-generated from all feature plans. Last updated: 2025-11-02
 ## Active Technologies
 - Python 3.11+ (tested with Python 3.11, 3.12, 3.13) (001-entity-data-model)
 - Python 3.11+ (aligned with existing project requirements) (003-discord-bot-access)
+- Python 3.11+ (aligned with existing project requirements, tested with Python 3.11, 3.12, 3.13) (004-refine-entity-extraction)
 
 - Python 3.11 (locked) (001-archive-meeting-rag)
 
@@ -24,9 +25,9 @@ cd src [ONLY COMMANDS FOR ACTIVE TECHNOLOGIES][ONLY COMMANDS FOR ACTIVE TECHNOLO
 Python 3.11 (locked): Follow standard conventions
 
 ## Recent Changes
+- 004-refine-entity-extraction: Added Python 3.11+ (aligned with existing project requirements, tested with Python 3.11, 3.12, 3.13)
 - 003-discord-bot-access: Added Python 3.11+ (aligned with existing project requirements)
 - 002-constitution-compliance: Added Python 3.11+ (tested with Python 3.11, 3.12, 3.13)
-- 001-entity-data-model: Added Python 3.11+ (tested with Python 3.11, 3.12, 3.13)
 
 
 <!-- MANUAL ADDITIONS START -->
 
@@ -18,6 +18,7 @@ dependencies = [
     "pytest-cov>=4.1.0",
     "gensim>=4.3.0",
     "spacy>=3.6.0",
+    "rapidfuzz>=3.0.0",
     "pydantic>=2.0.0",
     "python-dateutil>=2.8.0",
     "requests>=2.31.0",
 
@@ -23,6 +23,9 @@ gensim>=4.3.0
 # Entity Extraction
 spacy>=3.6.0
 
+# Entity Normalization
+rapidfuzz>=3.0.0
+
 # Utilities
 pydantic>=2.0.0
 python-dateutil>=2.8.0
 
@@ -0,0 +1,144 @@
+# Testing Entity Extraction Implementation Phases
+
+This document describes how to test each phase of the entity extraction implementation using the terminal test command.
+
+## Quick Start
+
+Test all phases with a single meeting from the GitHub source:
+
+```bash
+archive-rag test-entity-extraction \
+  "https://raw.githubusercontent.com/SingularityNET-Archive/SingularityNET-Archive/refs/heads/main/Data/Snet-Ambassador-Program/Meeting-Summaries/2025/meeting-summaries-array.json" \
+  --phases all
+```
+
+Test specific phases:
+
+```bash
+archive-rag test-entity-extraction \
+  "https://raw.githubusercontent.com/SingularityNET-Archive/SingularityNET-Archive/refs/heads/main/Data/Snet-Ambassador-Program/Meeting-Summaries/2025/meeting-summaries-array.json" \
+  --phases US1,US2,US3
+```
+
+Test a different meeting (by index):
+
+```bash
+archive-rag test-entity-extraction \
+  "https://raw.githubusercontent.com/SingularityNET-Archive/SingularityNET-Archive/refs/heads/main/Data/Snet-Ambassador-Program/Meeting-Summaries/2025/meeting-summaries-array.json" \
+  --phases all \
+  --meeting-index 5
+```
+
+Save results to JSON file:
+
+```bash
+archive-rag test-entity-extraction \
+  "https://raw.githubusercontent.com/SingularityNET-Archive/SingularityNET-Archive/refs/heads/main/Data/Snet-Ambassador-Program/Meeting-Summaries/2025/meeting-summaries-array.json" \
+  --phases all \
+  --output test-results.json
+```
+
+## Available Phases
+
+### US1: Extract Entities from JSON Structure
+- Tests entity extraction from JSON objects
+- Verifies entity filtering criteria
+- Shows extracted entities (workgroup, meeting, people, documents, decisions, actions)
+
+### US2: Capture Entity Relationships
+- Tests relationship triple generation
+- Verifies relationship types (Workgroup → Meeting, Meeting → People, etc.)
+- Shows sample relationship triples
+
+### US3: Normalize Entity References
+- Tests entity name normalization
+- Verifies fuzzy similarity matching (>95%)
+- Shows normalization examples (e.g., "Stephen [QADAO]" → "Stephen")
+
+### US4: Apply Named Entity Recognition to Text Fields
+- Tests NER extraction from unstructured text
+- Verifies entity extraction from meeting purpose, decision text, etc.
+- Shows extracted NER entities with confidence scores
+
+### US5: Chunk Text by Semantic Unit Before Embedding
+- Tests semantic chunking (meeting summary, action items, decisions, etc.)
+- Verifies entity context preservation in chunks
+- Shows chunk types and counts
+
+### US6: Generate Structured Entity Output
+- Tests complete structured output generation
+- Verifies all components (entities, normalized labels, relationship triples, chunks)
+- Shows summary statistics
+
+## Command Options
+
+- `source_url` (required): URL to source JSON file containing meetings
+- `--phases`: Comma-separated list of phases to test (e.g., "US1,US2,US3") or "all" (default: all)
+- `--meeting-index`: Index of meeting to test (0-based, default: 0)
+- `--output`: Path to JSON file to save test results
+- `--verify-hash`: Optional SHA-256 hash to verify source file integrity
+
+## Example Output
+
+```
+============================================================
+Entity Extraction Implementation Test
+============================================================
+
+Source URL: https://raw.githubusercontent.com/...
+
+Fetching meeting data...
+Using meeting at index 0
+
+============================================================
+Phase: US1
+Extract Entities from JSON Structure
+============================================================
+✓ Entity extraction completed
+  Extracted 2 entities
+  Workgroup: Archives Workgroup (05ddaaf0-1dde-4d84-a722-f82c8479a8e9)
+  Meeting: Meeting 2025-01-08 (880e8400-e29b-41d4-a716-446655440000)
+
+============================================================
+Phase: US2
+Capture Entity Relationships
+============================================================
+✓ Relationship extraction completed
+  Generated 3 relationship triples
+
+  Sample relationships:
+    Archives Workgroup (Workgroup) --[held]--> Meeting 2025-01-08 (Meeting)
+    Meeting (Meeting) --[has]--> Action Item (ActionItem)
+    Decision (Decision) --[has_effect]--> mayAffectOtherPeople (Effect)
+
+...
+```
+
+## Troubleshooting
+
+### spaCy Model Not Found
+If you see an error about a missing spaCy model:
+```bash
+python -m spacy download en_core_web_sm
+```
+
+### Entity Storage Not Found
+If entities aren't found, make sure you've ingested meetings first:
+```bash
+archive-rag ingest-entities "https://raw.githubusercontent.com/..."
+```
+
+### No Relationships Found
+If US2 shows no relationships, this may be expected if:
+- The meeting doesn't have action items or decisions
+- Relationship triple generation is not yet fully implemented
+- Entities need to be loaded from storage first
+
+## Next Steps
+
+After testing, you can:
+1. Review the test results to verify each phase is working
+2. Check entity storage directories to see created entities
+3. Use `query-entity` commands to query extracted entities
+4. Continue with implementation of remaining phases
+
@@ -0,0 +1,166 @@
+# Testing Semantic Chunking vs Token-Based Chunking
+
+This document describes how to test the effect of semantic chunking on query results.
+
+## Overview
+
+The `test-semantic-chunking` command allows you to:
+1. Index meetings using both semantic chunking and token-based chunking
+2. Run test queries on both indices
+3. Compare results showing entity metadata, relationships, and retrieval scores
+
+## Usage
+
+### Basic Usage with Official Source
+
+```bash
+# Use the official SingularityNET Archive source (120+ meetings)
+archive-rag test-semantic-chunking \
+  "https://raw.githubusercontent.com/SingularityNET-Archive/SingularityNET-Archive/refs/heads/main/Data/Snet-Ambassador-Program/Meeting-Summaries/2025/meeting-summaries-array.json"
+```
+
+### Basic Usage with Custom Source
+
+```bash
+archive-rag test-semantic-chunking <source_url>
+```
+
+### With Custom Queries
+
+```bash
+archive-rag test-semantic-chunking <source_url> \
+  --queries "What decisions were made?,Who attended?,What action items?"
+```
+
+### With Options
+
+```bash
+archive-rag test-semantic-chunking <source_url> \
+  --queries "What decisions were made?" \
+  --top-k 10 \
+  --meeting-limit 5 \
+  --output-dir ./my_test_indices \
+  --chunk-size 512 \
+  --chunk-overlap 50
+```
+
+## Command Options
+
+- `source_url` (required): URL to source JSON file containing meetings
+- `--queries`: Comma-separated list of queries to test (default: built-in test queries)
+- `--top-k`: Number of chunks to retrieve per query (default: 5)
+- `--embedding-model`: Embedding model name (default: "sentence-transformers/all-MiniLM-L6-v2")
+- `--chunk-size`: Chunk size for token-based chunking (default: 512)
+- `--chunk-overlap`: Overlap for token-based chunking (default: 50)
+- `--meeting-limit`: Limit number of meetings to process (for faster testing)
+- `--output-dir`: Directory to save index files (default: ./test_indices)
+
+## Default Test Queries
+
+If no custom queries are provided, the following queries are used:
+1. "What decisions were made?"
+2. "Who attended the meetings?"
+3. "What action items were assigned?"
+4. "What workgroups are involved?"
+5. "What documents were discussed?"
+
+## Output
+
+The command provides:
+1. **Indexing Results**: Shows how many chunks were created for each method
+2. **Query Results**: For each query, shows:
+   - Retrieved chunks from semantic chunking
+   - Retrieved chunks from token-based chunking
+   - Comparison of average scores
+   - Entity metadata in semantic chunks
+   - Relationship metadata in semantic chunks
+3. **Summary**: Total statistics and index file locations
+
+## Example Output
+
+```
+======================================================================
+Semantic Chunking Query Test
+======================================================================
+
+Step 1: Indexing with Semantic Chunking
+  ✓ Meeting 1/5: 12 semantic chunks
+  ✓ Meeting 2/5: 8 semantic chunks
+  ...
+  ✓ Created 45 semantic chunks total
+
+Step 2: Indexing with Token-Based Chunking
+  ✓ Meeting 1/5: 15 token chunks
+  ✓ Meeting 2/5: 10 token chunks
+  ...
+  ✓ Created 60 token chunks total
+
+Step 3: Running Test Queries
+----------------------------------------------------------------------
+Query: What decisions were made?
+----------------------------------------------------------------------
+
+[Semantic Chunking Results]
+  Retrieved 5 chunks:
+    [1] Score: 0.8234
+        Text: The team decided to implement...
+        Meeting ID: abc-123-def
+        Chunk Type: decision_record
+        Source Field: agendaItems[0].decisionItems[0]
+        Entities: 3 entity(ies) mentioned
+          - Archives Workgroup (Workgroup)
+          - Stephen (Person)
+          - Meeting Decision (DecisionItem)
+        Relationships: 2 relationship(s)
+          - Workgroup -> made -> Decision
+          - Person -> attended -> Meeting
+
+[Token-Based Chunking Results]
+  Retrieved 5 chunks:
+    [1] Score: 0.7812
+        Text: The team decided to implement...
+        Meeting ID: abc-123-def
+
+[Comparison]
+  Average semantic chunk score: 0.8234
+  Average token chunk score: 0.7812
+  Difference: +0.0422
+  Semantic chunks with entities: 5/5
+```
+
+## Understanding the Results
+
+### Semantic Chunking Advantages
+
+1. **Entity Metadata**: Semantic chunks include embedded entity information, making it easier to understand context
+2. **Relationship Information**: Chunks include relationship triples showing how entities relate
+3. **Better Retrieval**: Semantic chunks may have higher retrieval scores for entity-focused queries
+4. **Structured Context**: Each chunk is aligned with semantic units (decisions, actions, etc.) rather than arbitrary token boundaries
+
+### Token-Based Chunking Characteristics
+
+1. **Simple Splitting**: Chunks are split at fixed token boundaries
+2. **No Entity Metadata**: Chunks don't include structured entity information
+3. **Consistent Sizing**: Chunks are more uniform in size
+
+## Index Files
+
+The command creates two index files:
+- `{output_dir}/semantic_index.faiss` - Semantic chunking index
+- `{output_dir}/token_index.faiss` - Token-based chunking index
+
+These can be reused for further testing or analysis.
+
+## Tips
+
+1. **Start Small**: Use `--meeting-limit` to test with a few meetings first
+2. **Custom Queries**: Provide queries relevant to your use case
+3. **Compare Scores**: Look at the difference in average scores - positive differences indicate semantic chunking is better for that query
+4. **Entity Coverage**: Check how many semantic chunks have entities embedded - this shows the effectiveness of entity extraction
+
+## Troubleshooting
+
+- **No chunks created**: Ensure meetings have content (purpose, action items, decisions, etc.)
+- **Empty results**: Check that meetings have been properly ingested and entities extracted
+- **Import errors**: Ensure all dependencies are installed (`pip install -r requirements.txt`)
+
@@ -0,0 +1,39 @@
+# Specification Quality Checklist: Refine Entity Extraction
+
+**Purpose**: Validate specification completeness and quality before proceeding to planning
+**Created**: 2025-01-21
+**Feature**: [spec.md](../spec.md)
+
+## Content Quality
+
+- [x] No implementation details (languages, frameworks, APIs)
+- [x] Focused on user value and business needs
+- [x] Written for non-technical stakeholders
+- [x] All mandatory sections completed
+
+## Requirement Completeness
+
+- [x] No [NEEDS CLARIFICATION] markers remain
+- [x] Requirements are testable and unambiguous
+- [x] Success criteria are measurable
+- [x] Success criteria are technology-agnostic (no implementation details)
+- [x] All acceptance scenarios are defined
+- [x] Edge cases are identified
+- [x] Scope is clearly bounded
+- [x] Dependencies and assumptions identified
+
+## Feature Readiness
+
+- [x] All functional requirements have clear acceptance criteria
+- [x] User scenarios cover primary flows
+- [x] Feature meets measurable outcomes defined in Success Criteria
+- [x] No implementation details leak into specification
+
+## Notes
+
+- All checklist items pass validation
+- Specification is ready for `/speckit.clarify` or `/speckit.plan`
+- No [NEEDS CLARIFICATION] markers found in specification
+- Success criteria are technology-agnostic and focus on user-facing outcomes (extraction accuracy, relationship capture, normalization quality)
+- Assumptions section appropriately documents domain knowledge (JSON structure, NER availability) without prescribing implementation
+