Skip to content

feat(pipeline-shacl-sampler): treat http and https schema.org as equivalent#398

Open
ddeboer wants to merge 4 commits into
mainfrom
feat/sampler-schema-org-dual-namespace
Open

feat(pipeline-shacl-sampler): treat http and https schema.org as equivalent#398
ddeboer wants to merge 4 commits into
mainfrom
feat/sampler-schema-org-dual-namespace

Conversation

@ddeboer
Copy link
Copy Markdown
Member

@ddeboer ddeboer commented May 21, 2026

Summary

Add a namespaceAliases option to shaclSampleStages for vocabularies that publish the same terms under multiple namespaces — most notably schema.org (http://schema.org/ vs https://schema.org/). SHACL shapes can only declare one namespace as the sh:targetClass, so without help the sampler skips every resource typed under the other form and the validator reports vacuously-conformant results. Observed on the ldmax.nl WO2 collecties dataset: 125k+ schema:CreativeWork instances typed under HTTP, validator emitted zero violations against the canonical-HTTPS SCHEMA-AP-NDE shapes.

For every declared { canonical, alias } pair the sampler:

  • broadens the subject-selection SELECT to ?s a ?type . FILTER(?type IN (<canonical/T>, <alias/T>)) so instances typed under either namespace are picked up;
  • wraps the configured validator so alias-namespace IRIs in the sampled buffer (subject, predicate, object, graph) are rewritten to the canonical form before SHACL evaluates them, allowing canonical-namespace sh:targetClass / sh:path patterns to match. Quads with no alias IRI pass through by reference (no copy).

Defaults to [] — sampler is vocabulary-neutral. Callers opt in explicitly when needed (README documents the schema.org example).

Other changes

packages/dataset-registry-client/tsconfig.lib.json and packages/pipeline/tsconfig.lib.json lose a stale reference to local-sparql-endpoint — collateral from running nx sync to typecheck the worktree.

ddeboer added 4 commits May 21, 2026 14:59
…hema.org/ as equivalent

Schema.org publishes the same vocabulary under both `http://schema.org/` and `https://schema.org/`. SHACL shapes can only declare one as the `sh:targetClass` namespace, so the sampler would previously skip every resource typed under the other form and the validator would report vacuously-conformant results — observed in the wild on ldmax.nl WO2 datasets (125k+ schema:CreativeWork instances under HTTP, zero violations reported).

- Add `namespaceAliases` option (default: one HTTPS/HTTP schema.org pair) that broadens the subject-selection SELECT to `?s a ?type . FILTER(?type IN (<canonical>, <alias>))`.

- Wrap the configured validator so alias-namespace IRIs in the sampled buffer are rewritten to the canonical namespace before SHACL evaluates them, allowing canonical-namespace `sh:targetClass` / `sh:path` patterns to match. Quads with no alias IRI pass through by reference.

nx sync collateral: stale `local-sparql-endpoint` project references removed from `dataset-registry-client` and `pipeline` tsconfig.lib.json.
…ead of the schema.org pair

Don't ship a built-in schema.org alias — callers opt in explicitly when they need it. Keeps the sampler vocabulary-neutral by default; the schema.org HTTP/HTTPS workaround stays one example in the README rather than a default that surprises callers using a single-namespace dataset.
- Convert buildSubjectSelectorQuery to a single options-object argument; tests no longer pass undefined placeholders to reach later positional params.

- Collapse the two near-identical branches in expandTargetClass into one.

- Drop unnecessary as Quad['…'] casts in normalizeQuad; NamedNode is assignable to all four term positions.
… in expandTargetClass

The truthiness-chain collapse from the previous cleanup obscured intent and required readers to parse JS falsy semantics. Two early-return if blocks read more directly; the small duplication is not worth the cleverness.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant