Summary
Index RDF data from SPARQL stores into Typesense for hybrid search (keyword + vector), enabling fuzzy matching, relevance ranking, typo tolerance, and semantic search over Linked Data.
Context
Applications like the NDE Dataset Register browser currently search RDF data via SPARQL CONTAINS() — substring matching with no relevance ranking, typo tolerance, or semantic understanding. A dedicated search engine would significantly improve search quality.
Approach
RDF-to-search-index pipeline
- Accept RDF triples as input (e.g.
N3.Store) — the caller is responsible for fetching data (e.g. via SPARQL CONSTRUCT)
- Transform using JSON-LD Framing (W3C standard) to reshape the RDF graph into deterministic JSON documents — the same pattern already used in
@lde/docgen
- Post-process the framed output: flatten language maps to per-language fields (
title: { nl, en } → title_nl, title_en), use @id (the resource's URI) as the Typesense document id for upserts, flatten nested structures
- Index documents into Typesense
Querying
- Hybrid search combining BM25 keyword search with vector similarity (using a multilingual embedding model like
paraphrase-multilingual-MiniLM-L12-v2)
- Faceted search with counts (maps directly to Typesense's
facet_by)
- Per-language field weighting to boost the user's preferred language
Package structure
A single new package @lde/search-typesense covering:
- Collection schema definition shared between indexing and querying
- Indexer: takes RDF triples (e.g.
N3.Store) + a JSON-LD frame + Typesense connection → frames, post-processes, and indexes documents. SPARQL fetching is the caller's responsibility, keeping the package focused and composable.
- Document management with two sync strategies:
- Full reindex via collection alias swap (recommended to start): create new timestamped collection → index all documents → swap collection alias → drop old collection. Zero-downtime, always a clean slate, no stale documents.
- Incremental upsert/delete: upsert by URI (
@id → Typesense id) when a dataset is updated, delete by URI when removed. Useful if freshness requirements increase later.
- Searcher: typed query interface with facet support
The JSON-LD Framing step itself doesn't need a separate package — it's a thin call to jsonld.frame() (same as in @lde/docgen). Callers provide their own frame definition (project-specific) and the package handles the Typesense-specific parts (field flattening, language map expansion, indexing, querying).
Alternatively: multiple packages
If reuse across search engines becomes a goal, the transformation layer (SPARQL → JSON-LD Framing → flat documents) could be split into a separate @lde/rdf-to-json package. But this seems premature — start with one package and extract if needed.
Relates to
@lde/docgen — already uses JSON-LD Framing for RDF → JSON transformation
Summary
Index RDF data from SPARQL stores into Typesense for hybrid search (keyword + vector), enabling fuzzy matching, relevance ranking, typo tolerance, and semantic search over Linked Data.
Context
Applications like the NDE Dataset Register browser currently search RDF data via SPARQL
CONTAINS()— substring matching with no relevance ranking, typo tolerance, or semantic understanding. A dedicated search engine would significantly improve search quality.Approach
RDF-to-search-index pipeline
N3.Store) — the caller is responsible for fetching data (e.g. via SPARQLCONSTRUCT)@lde/docgentitle: { nl, en }→title_nl,title_en), use@id(the resource's URI) as the Typesense documentidfor upserts, flatten nested structuresQuerying
paraphrase-multilingual-MiniLM-L12-v2)facet_by)Package structure
A single new package
@lde/search-typesensecovering:N3.Store) + a JSON-LD frame + Typesense connection → frames, post-processes, and indexes documents. SPARQL fetching is the caller's responsibility, keeping the package focused and composable.@id→ Typesenseid) when a dataset is updated, delete by URI when removed. Useful if freshness requirements increase later.The JSON-LD Framing step itself doesn't need a separate package — it's a thin call to
jsonld.frame()(same as in@lde/docgen). Callers provide their own frame definition (project-specific) and the package handles the Typesense-specific parts (field flattening, language map expansion, indexing, querying).Alternatively: multiple packages
If reuse across search engines becomes a goal, the transformation layer (SPARQL → JSON-LD Framing → flat documents) could be split into a separate
@lde/rdf-to-jsonpackage. But this seems premature — start with one package and extract if needed.Relates to
@lde/docgen— already uses JSON-LD Framing for RDF → JSON transformation