Skip to content

feat: improve FTS5 search foundations#46

Merged
BYK merged 1 commit intomainfrom
feat/fts5-foundations
Mar 22, 2026
Merged

feat: improve FTS5 search foundations#46
BYK merged 1 commit intomainfrom
feat/fts5-foundations

Conversation

@BYK
Copy link
Owner

@BYK BYK commented Mar 22, 2026

Phase 1 of search improvements

Fixes the FTS5 search foundations as the first step toward a comprehensive search overhaul.

Changes

New: src/search.ts — centralized search module

  • ftsQuery() — AND-based FTS5 query builder with stopword + single-char filtering
  • ftsQueryOr() — OR-based variant for fallback when AND returns nothing
  • STOPWORDS — conservative set (only genuinely content-free words, preserves domain terms like handle, state, type)
  • EMPTY_QUERY sentinel for all-stopword queries

Fixed: Knowledge search ranking

  • Was: ORDER BY updated_at DESC (most recently edited wins regardless of relevance)
  • Now: ORDER BY bm25(knowledge_fts, 6.0, 2.0, 3.0) (title matches weighted 6x, category 3x)
  • Uses JOIN pattern instead of subquery for proper rank access

New: distillation_fts table (schema migration v7)

  • FTS5 on observations column with porter unicode61 tokenizer
  • Replaces LIKE-based distillation search with BM25-ranked FTS5 search
  • Backfills existing data, sync triggers for INSERT/UPDATE/DELETE

Improved: AND→OR fallback pattern

  • All search functions try AND first (precision), fall back to OR when nothing matches (recall)
  • Blanket OR was tested empirically and rejected — adds noise even with stopwords

New: "Too vague" handling

  • When query is all stopwords/single-chars, recall tool returns guidance message instead of empty results
  • Prompts the LLM to reformulate with specific keywords

Test coverage

  • 23 new tests in test/search.test.ts (query building, stopwords, edge cases)
  • New BM25 ranking test + AND→OR fallback test in test/ltm.test.ts
  • Schema v7 + distillation_fts verification in test/db.test.ts
  • Updated temporal.test.ts for new import path + behavior

- Create src/search.ts with centralized ftsQuery/ftsQueryOr functions
- Add stopword filtering (conservative list, preserves domain terms)
- Drop single-char tokens (contraction artifacts) but keep 2-char+ terms
- Implement AND-then-OR fallback: AND first for precision, OR when AND returns nothing
- Fix knowledge search to use BM25 rank instead of updated_at DESC
  - Uses bm25() with column weights: title=6.0, content=2.0, category=3.0
  - JOIN pattern instead of subquery for proper rank access
- Add distillation_fts table (schema migration v7)
  - FTS5 on observations column with porter unicode61 tokenizer
  - Backfill existing data, sync triggers for INSERT/UPDATE/DELETE
- Replace LIKE-based distillation search with FTS5 ranked search
- Add 'too vague' handling in recall tool for all-stopword queries
- Remove ftsQuery from temporal.ts (now in search.ts, no re-export)
@BYK BYK enabled auto-merge (squash) March 22, 2026 14:25
@BYK BYK merged commit 60cfc76 into main Mar 22, 2026
1 check passed
@BYK BYK deleted the feat/fts5-foundations branch March 22, 2026 14:25
BYK added a commit that referenced this pull request Mar 22, 2026
## Phase 2 of search improvements (depends on #46)

Adds cross-source score fusion using Reciprocal Rank Fusion and rewrites
the recall tool to produce a single ranked result list.

### Changes

**New in `src/search.ts`**
- `reciprocalRankFusion<T>()` — merges multiple ranked lists using RRF
(k=60, Cormack et al. 2009). Rank-based, not score-based, so magnitude
differences across FTS tables don't matter.
- `normalizeRank()` — min-max normalization of FTS5 BM25 ranks to 0–1
(for display only)

**New scored search variants**
- `ltm.searchScored()` — returns `KnowledgeEntry & { rank }` with BM25
scores via `bm25(knowledge_fts, 6, 2, 3)`
- `temporal.searchScored()` — returns `TemporalMessage & { rank }`
- `searchDistillationsScored()` — returns `Distillation & { rank }`

All scored variants include AND→OR fallback (same as Phase 1 search
functions).

**Rewritten recall tool**
- Runs all 3 scored searches, tags results with source type
- Fuses via RRF into a single ranked list
- Output format: source-annotated list (`[knowledge/category]`,
`[distilled]`, `[temporal/role]`)
- Most relevant results appear first regardless of which source they
came from

### Test coverage
- 11 new tests for `normalizeRank()` and `reciprocalRankFusion()`
- Tests cover: multi-list merge, dedup, empty lists, single list, custom
k, score correctness
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant