feat(pipeline-shacl-sampler): treat http and https schema.org as equivalent#398
Open
ddeboer wants to merge 4 commits into
Open
feat(pipeline-shacl-sampler): treat http and https schema.org as equivalent#398ddeboer wants to merge 4 commits into
ddeboer wants to merge 4 commits into
Conversation
…hema.org/ as equivalent Schema.org publishes the same vocabulary under both `http://schema.org/` and `https://schema.org/`. SHACL shapes can only declare one as the `sh:targetClass` namespace, so the sampler would previously skip every resource typed under the other form and the validator would report vacuously-conformant results — observed in the wild on ldmax.nl WO2 datasets (125k+ schema:CreativeWork instances under HTTP, zero violations reported). - Add `namespaceAliases` option (default: one HTTPS/HTTP schema.org pair) that broadens the subject-selection SELECT to `?s a ?type . FILTER(?type IN (<canonical>, <alias>))`. - Wrap the configured validator so alias-namespace IRIs in the sampled buffer are rewritten to the canonical namespace before SHACL evaluates them, allowing canonical-namespace `sh:targetClass` / `sh:path` patterns to match. Quads with no alias IRI pass through by reference. nx sync collateral: stale `local-sparql-endpoint` project references removed from `dataset-registry-client` and `pipeline` tsconfig.lib.json.
…ead of the schema.org pair Don't ship a built-in schema.org alias — callers opt in explicitly when they need it. Keeps the sampler vocabulary-neutral by default; the schema.org HTTP/HTTPS workaround stays one example in the README rather than a default that surprises callers using a single-namespace dataset.
- Convert buildSubjectSelectorQuery to a single options-object argument; tests no longer pass undefined placeholders to reach later positional params. - Collapse the two near-identical branches in expandTargetClass into one. - Drop unnecessary as Quad['…'] casts in normalizeQuad; NamedNode is assignable to all four term positions.
… in expandTargetClass The truthiness-chain collapse from the previous cleanup obscured intent and required readers to parse JS falsy semantics. Two early-return if blocks read more directly; the small duplication is not worth the cleverness.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Add a
namespaceAliasesoption toshaclSampleStagesfor vocabularies that publish the same terms under multiple namespaces — most notably schema.org (http://schema.org/vshttps://schema.org/). SHACL shapes can only declare one namespace as thesh:targetClass, so without help the sampler skips every resource typed under the other form and the validator reports vacuously-conformant results. Observed on the ldmax.nl WO2 collecties dataset: 125k+schema:CreativeWorkinstances typed under HTTP, validator emitted zero violations against the canonical-HTTPS SCHEMA-AP-NDE shapes.For every declared
{ canonical, alias }pair the sampler:?s a ?type . FILTER(?type IN (<canonical/T>, <alias/T>))so instances typed under either namespace are picked up;validatorso alias-namespace IRIs in the sampled buffer (subject, predicate, object, graph) are rewritten to the canonical form before SHACL evaluates them, allowing canonical-namespacesh:targetClass/sh:pathpatterns to match. Quads with no alias IRI pass through by reference (no copy).Defaults to
[]— sampler is vocabulary-neutral. Callers opt in explicitly when needed (README documents the schema.org example).Other changes
packages/dataset-registry-client/tsconfig.lib.jsonandpackages/pipeline/tsconfig.lib.jsonlose a stale reference tolocal-sparql-endpoint— collateral from runningnx syncto typecheck the worktree.