Skip to content

Commit 859a56c

Browse files
committed
Complete Phase 9: Polish & Cross-Cutting Concerns
- Add graceful error handling for missing entity fields (T077) - Enhance malformed JSON handling with encoding fallback (T078) - Handle incomplete relationship data gracefully (T079) - Implement context-based entity disambiguation (T080) - Add performance monitoring with timing logs (T081) - Optimize entity normalization with caching (T082) - Add comprehensive logging with traceability (T083) - Update quickstart documentation with usage examples (T084)
1 parent 5578490 commit 859a56c

34 files changed

Lines changed: 115006 additions & 81 deletions

.cursor/rules/specify-rules.mdc

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@ Auto-generated from all feature plans. Last updated: 2025-11-02
55
## Active Technologies
66
- Python 3.11+ (tested with Python 3.11, 3.12, 3.13) (001-entity-data-model)
77
- Python 3.11+ (aligned with existing project requirements) (003-discord-bot-access)
8+
- Python 3.11+ (aligned with existing project requirements, tested with Python 3.11, 3.12, 3.13) (004-refine-entity-extraction)
89

910
- Python 3.11 (locked) (001-archive-meeting-rag)
1011

@@ -24,9 +25,9 @@ cd src [ONLY COMMANDS FOR ACTIVE TECHNOLOGIES][ONLY COMMANDS FOR ACTIVE TECHNOLO
2425
Python 3.11 (locked): Follow standard conventions
2526

2627
## Recent Changes
28+
- 004-refine-entity-extraction: Added Python 3.11+ (aligned with existing project requirements, tested with Python 3.11, 3.12, 3.13)
2729
- 003-discord-bot-access: Added Python 3.11+ (aligned with existing project requirements)
2830
- 002-constitution-compliance: Added Python 3.11+ (tested with Python 3.11, 3.12, 3.13)
29-
- 001-entity-data-model: Added Python 3.11+ (tested with Python 3.11, 3.12, 3.13)
3031

3132

3233
<!-- MANUAL ADDITIONS START -->

pyproject.toml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,7 @@ dependencies = [
1818
"pytest-cov>=4.1.0",
1919
"gensim>=4.3.0",
2020
"spacy>=3.6.0",
21+
"rapidfuzz>=3.0.0",
2122
"pydantic>=2.0.0",
2223
"python-dateutil>=2.8.0",
2324
"requests>=2.31.0",

requirements.txt

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,9 @@ gensim>=4.3.0
2323
# Entity Extraction
2424
spacy>=3.6.0
2525

26+
# Entity Normalization
27+
rapidfuzz>=3.0.0
28+
2629
# Utilities
2730
pydantic>=2.0.0
2831
python-dateutil>=2.8.0
Lines changed: 144 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,144 @@
1+
# Testing Entity Extraction Implementation Phases
2+
3+
This document describes how to test each phase of the entity extraction implementation using the terminal test command.
4+
5+
## Quick Start
6+
7+
Test all phases with a single meeting from the GitHub source:
8+
9+
```bash
10+
archive-rag test-entity-extraction \
11+
"https://raw.githubusercontent.com/SingularityNET-Archive/SingularityNET-Archive/refs/heads/main/Data/Snet-Ambassador-Program/Meeting-Summaries/2025/meeting-summaries-array.json" \
12+
--phases all
13+
```
14+
15+
Test specific phases:
16+
17+
```bash
18+
archive-rag test-entity-extraction \
19+
"https://raw.githubusercontent.com/SingularityNET-Archive/SingularityNET-Archive/refs/heads/main/Data/Snet-Ambassador-Program/Meeting-Summaries/2025/meeting-summaries-array.json" \
20+
--phases US1,US2,US3
21+
```
22+
23+
Test a different meeting (by index):
24+
25+
```bash
26+
archive-rag test-entity-extraction \
27+
"https://raw.githubusercontent.com/SingularityNET-Archive/SingularityNET-Archive/refs/heads/main/Data/Snet-Ambassador-Program/Meeting-Summaries/2025/meeting-summaries-array.json" \
28+
--phases all \
29+
--meeting-index 5
30+
```
31+
32+
Save results to JSON file:
33+
34+
```bash
35+
archive-rag test-entity-extraction \
36+
"https://raw.githubusercontent.com/SingularityNET-Archive/SingularityNET-Archive/refs/heads/main/Data/Snet-Ambassador-Program/Meeting-Summaries/2025/meeting-summaries-array.json" \
37+
--phases all \
38+
--output test-results.json
39+
```
40+
41+
## Available Phases
42+
43+
### US1: Extract Entities from JSON Structure
44+
- Tests entity extraction from JSON objects
45+
- Verifies entity filtering criteria
46+
- Shows extracted entities (workgroup, meeting, people, documents, decisions, actions)
47+
48+
### US2: Capture Entity Relationships
49+
- Tests relationship triple generation
50+
- Verifies relationship types (Workgroup → Meeting, Meeting → People, etc.)
51+
- Shows sample relationship triples
52+
53+
### US3: Normalize Entity References
54+
- Tests entity name normalization
55+
- Verifies fuzzy similarity matching (>95%)
56+
- Shows normalization examples (e.g., "Stephen [QADAO]" → "Stephen")
57+
58+
### US4: Apply Named Entity Recognition to Text Fields
59+
- Tests NER extraction from unstructured text
60+
- Verifies entity extraction from meeting purpose, decision text, etc.
61+
- Shows extracted NER entities with confidence scores
62+
63+
### US5: Chunk Text by Semantic Unit Before Embedding
64+
- Tests semantic chunking (meeting summary, action items, decisions, etc.)
65+
- Verifies entity context preservation in chunks
66+
- Shows chunk types and counts
67+
68+
### US6: Generate Structured Entity Output
69+
- Tests complete structured output generation
70+
- Verifies all components (entities, normalized labels, relationship triples, chunks)
71+
- Shows summary statistics
72+
73+
## Command Options
74+
75+
- `source_url` (required): URL to source JSON file containing meetings
76+
- `--phases`: Comma-separated list of phases to test (e.g., "US1,US2,US3") or "all" (default: all)
77+
- `--meeting-index`: Index of meeting to test (0-based, default: 0)
78+
- `--output`: Path to JSON file to save test results
79+
- `--verify-hash`: Optional SHA-256 hash to verify source file integrity
80+
81+
## Example Output
82+
83+
```
84+
============================================================
85+
Entity Extraction Implementation Test
86+
============================================================
87+
88+
Source URL: https://raw.githubusercontent.com/...
89+
90+
Fetching meeting data...
91+
Using meeting at index 0
92+
93+
============================================================
94+
Phase: US1
95+
Extract Entities from JSON Structure
96+
============================================================
97+
✓ Entity extraction completed
98+
Extracted 2 entities
99+
Workgroup: Archives Workgroup (05ddaaf0-1dde-4d84-a722-f82c8479a8e9)
100+
Meeting: Meeting 2025-01-08 (880e8400-e29b-41d4-a716-446655440000)
101+
102+
============================================================
103+
Phase: US2
104+
Capture Entity Relationships
105+
============================================================
106+
✓ Relationship extraction completed
107+
Generated 3 relationship triples
108+
109+
Sample relationships:
110+
Archives Workgroup (Workgroup) --[held]--> Meeting 2025-01-08 (Meeting)
111+
Meeting (Meeting) --[has]--> Action Item (ActionItem)
112+
Decision (Decision) --[has_effect]--> mayAffectOtherPeople (Effect)
113+
114+
...
115+
```
116+
117+
## Troubleshooting
118+
119+
### spaCy Model Not Found
120+
If you see an error about a missing spaCy model:
121+
```bash
122+
python -m spacy download en_core_web_sm
123+
```
124+
125+
### Entity Storage Not Found
126+
If entities aren't found, make sure you've ingested meetings first:
127+
```bash
128+
archive-rag ingest-entities "https://raw.githubusercontent.com/..."
129+
```
130+
131+
### No Relationships Found
132+
If US2 shows no relationships, this may be expected if:
133+
- The meeting doesn't have action items or decisions
134+
- Relationship triple generation is not yet fully implemented
135+
- Entities need to be loaded from storage first
136+
137+
## Next Steps
138+
139+
After testing, you can:
140+
1. Review the test results to verify each phase is working
141+
2. Check entity storage directories to see created entities
142+
3. Use `query-entity` commands to query extracted entities
143+
4. Continue with implementation of remaining phases
144+
Lines changed: 166 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,166 @@
1+
# Testing Semantic Chunking vs Token-Based Chunking
2+
3+
This document describes how to test the effect of semantic chunking on query results.
4+
5+
## Overview
6+
7+
The `test-semantic-chunking` command allows you to:
8+
1. Index meetings using both semantic chunking and token-based chunking
9+
2. Run test queries on both indices
10+
3. Compare results showing entity metadata, relationships, and retrieval scores
11+
12+
## Usage
13+
14+
### Basic Usage with Official Source
15+
16+
```bash
17+
# Use the official SingularityNET Archive source (120+ meetings)
18+
archive-rag test-semantic-chunking \
19+
"https://raw.githubusercontent.com/SingularityNET-Archive/SingularityNET-Archive/refs/heads/main/Data/Snet-Ambassador-Program/Meeting-Summaries/2025/meeting-summaries-array.json"
20+
```
21+
22+
### Basic Usage with Custom Source
23+
24+
```bash
25+
archive-rag test-semantic-chunking <source_url>
26+
```
27+
28+
### With Custom Queries
29+
30+
```bash
31+
archive-rag test-semantic-chunking <source_url> \
32+
--queries "What decisions were made?,Who attended?,What action items?"
33+
```
34+
35+
### With Options
36+
37+
```bash
38+
archive-rag test-semantic-chunking <source_url> \
39+
--queries "What decisions were made?" \
40+
--top-k 10 \
41+
--meeting-limit 5 \
42+
--output-dir ./my_test_indices \
43+
--chunk-size 512 \
44+
--chunk-overlap 50
45+
```
46+
47+
## Command Options
48+
49+
- `source_url` (required): URL to source JSON file containing meetings
50+
- `--queries`: Comma-separated list of queries to test (default: built-in test queries)
51+
- `--top-k`: Number of chunks to retrieve per query (default: 5)
52+
- `--embedding-model`: Embedding model name (default: "sentence-transformers/all-MiniLM-L6-v2")
53+
- `--chunk-size`: Chunk size for token-based chunking (default: 512)
54+
- `--chunk-overlap`: Overlap for token-based chunking (default: 50)
55+
- `--meeting-limit`: Limit number of meetings to process (for faster testing)
56+
- `--output-dir`: Directory to save index files (default: ./test_indices)
57+
58+
## Default Test Queries
59+
60+
If no custom queries are provided, the following queries are used:
61+
1. "What decisions were made?"
62+
2. "Who attended the meetings?"
63+
3. "What action items were assigned?"
64+
4. "What workgroups are involved?"
65+
5. "What documents were discussed?"
66+
67+
## Output
68+
69+
The command provides:
70+
1. **Indexing Results**: Shows how many chunks were created for each method
71+
2. **Query Results**: For each query, shows:
72+
- Retrieved chunks from semantic chunking
73+
- Retrieved chunks from token-based chunking
74+
- Comparison of average scores
75+
- Entity metadata in semantic chunks
76+
- Relationship metadata in semantic chunks
77+
3. **Summary**: Total statistics and index file locations
78+
79+
## Example Output
80+
81+
```
82+
======================================================================
83+
Semantic Chunking Query Test
84+
======================================================================
85+
86+
Step 1: Indexing with Semantic Chunking
87+
✓ Meeting 1/5: 12 semantic chunks
88+
✓ Meeting 2/5: 8 semantic chunks
89+
...
90+
✓ Created 45 semantic chunks total
91+
92+
Step 2: Indexing with Token-Based Chunking
93+
✓ Meeting 1/5: 15 token chunks
94+
✓ Meeting 2/5: 10 token chunks
95+
...
96+
✓ Created 60 token chunks total
97+
98+
Step 3: Running Test Queries
99+
----------------------------------------------------------------------
100+
Query: What decisions were made?
101+
----------------------------------------------------------------------
102+
103+
[Semantic Chunking Results]
104+
Retrieved 5 chunks:
105+
[1] Score: 0.8234
106+
Text: The team decided to implement...
107+
Meeting ID: abc-123-def
108+
Chunk Type: decision_record
109+
Source Field: agendaItems[0].decisionItems[0]
110+
Entities: 3 entity(ies) mentioned
111+
- Archives Workgroup (Workgroup)
112+
- Stephen (Person)
113+
- Meeting Decision (DecisionItem)
114+
Relationships: 2 relationship(s)
115+
- Workgroup -> made -> Decision
116+
- Person -> attended -> Meeting
117+
118+
[Token-Based Chunking Results]
119+
Retrieved 5 chunks:
120+
[1] Score: 0.7812
121+
Text: The team decided to implement...
122+
Meeting ID: abc-123-def
123+
124+
[Comparison]
125+
Average semantic chunk score: 0.8234
126+
Average token chunk score: 0.7812
127+
Difference: +0.0422
128+
Semantic chunks with entities: 5/5
129+
```
130+
131+
## Understanding the Results
132+
133+
### Semantic Chunking Advantages
134+
135+
1. **Entity Metadata**: Semantic chunks include embedded entity information, making it easier to understand context
136+
2. **Relationship Information**: Chunks include relationship triples showing how entities relate
137+
3. **Better Retrieval**: Semantic chunks may have higher retrieval scores for entity-focused queries
138+
4. **Structured Context**: Each chunk is aligned with semantic units (decisions, actions, etc.) rather than arbitrary token boundaries
139+
140+
### Token-Based Chunking Characteristics
141+
142+
1. **Simple Splitting**: Chunks are split at fixed token boundaries
143+
2. **No Entity Metadata**: Chunks don't include structured entity information
144+
3. **Consistent Sizing**: Chunks are more uniform in size
145+
146+
## Index Files
147+
148+
The command creates two index files:
149+
- `{output_dir}/semantic_index.faiss` - Semantic chunking index
150+
- `{output_dir}/token_index.faiss` - Token-based chunking index
151+
152+
These can be reused for further testing or analysis.
153+
154+
## Tips
155+
156+
1. **Start Small**: Use `--meeting-limit` to test with a few meetings first
157+
2. **Custom Queries**: Provide queries relevant to your use case
158+
3. **Compare Scores**: Look at the difference in average scores - positive differences indicate semantic chunking is better for that query
159+
4. **Entity Coverage**: Check how many semantic chunks have entities embedded - this shows the effectiveness of entity extraction
160+
161+
## Troubleshooting
162+
163+
- **No chunks created**: Ensure meetings have content (purpose, action items, decisions, etc.)
164+
- **Empty results**: Check that meetings have been properly ingested and entities extracted
165+
- **Import errors**: Ensure all dependencies are installed (`pip install -r requirements.txt`)
166+
Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
# Specification Quality Checklist: Refine Entity Extraction
2+
3+
**Purpose**: Validate specification completeness and quality before proceeding to planning
4+
**Created**: 2025-01-21
5+
**Feature**: [spec.md](../spec.md)
6+
7+
## Content Quality
8+
9+
- [x] No implementation details (languages, frameworks, APIs)
10+
- [x] Focused on user value and business needs
11+
- [x] Written for non-technical stakeholders
12+
- [x] All mandatory sections completed
13+
14+
## Requirement Completeness
15+
16+
- [x] No [NEEDS CLARIFICATION] markers remain
17+
- [x] Requirements are testable and unambiguous
18+
- [x] Success criteria are measurable
19+
- [x] Success criteria are technology-agnostic (no implementation details)
20+
- [x] All acceptance scenarios are defined
21+
- [x] Edge cases are identified
22+
- [x] Scope is clearly bounded
23+
- [x] Dependencies and assumptions identified
24+
25+
## Feature Readiness
26+
27+
- [x] All functional requirements have clear acceptance criteria
28+
- [x] User scenarios cover primary flows
29+
- [x] Feature meets measurable outcomes defined in Success Criteria
30+
- [x] No implementation details leak into specification
31+
32+
## Notes
33+
34+
- All checklist items pass validation
35+
- Specification is ready for `/speckit.clarify` or `/speckit.plan`
36+
- No [NEEDS CLARIFICATION] markers found in specification
37+
- Success criteria are technology-agnostic and focus on user-facing outcomes (extraction accuracy, relationship capture, normalization quality)
38+
- Assumptions section appropriately documents domain knowledge (JSON structure, NER availability) without prescribing implementation
39+

0 commit comments

Comments
 (0)