Skip to content

[FEATURE](pyspark) PySpark validation generated from the Pydantic schema#518

Draft
Seth Fitzsimmons (sethfitz) wants to merge 7 commits into
devfrom
pyspark-codegen
Draft

[FEATURE](pyspark) PySpark validation generated from the Pydantic schema#518
Seth Fitzsimmons (sethfitz) wants to merge 7 commits into
devfrom
pyspark-codegen

Conversation

@sethfitz
Copy link
Copy Markdown
Collaborator

@sethfitz Seth Fitzsimmons (sethfitz) commented May 11, 2026

Closes #517.

Summary

Adds a new runtime package (overture-schema-pyspark) plus a new output target in overture-schema-codegen that emits PySpark validation expressions and conformance tests from the same Pydantic models that define the schema. Ships an overture-validate CLI for running the validation against Parquet on disk or in S3.

PySpark plugs in as a peer of the existing Markdown output target: same FeatureSpec extraction, same four-layer architecture (Discovery -> Extraction -> Output Layout -> Rendering), new pipeline module. See packages/overture-schema-codegen/docs/design.md for the full picture; the "PySpark Pipeline" section there covers the new stages in detail.

What's in the PR

packages/overture-schema-pyspark/ -- runtime. Public API in validate.py (validate_feature, explain_errors), schema comparison in schema_check.py, dataclasses in check.py, the overture-validate CLI in cli.py, and shared expression building blocks in expressions/{constraint_expressions,column_patterns,_schema_structs}.py. The per-feature expression modules under expressions/generated/overture/schema/<theme>/<feature>.py and per-feature conformance tests under tests/generated/overture/schema/<theme>/test_<feature>.py are emitted by codegen and confined to a generated/ boundary that make generate-pyspark wipes and recreates. _registry.py walks that tree at import time and exposes REGISTRY: dict[str, FeatureValidation] keyed by feature type name.

packages/overture-schema-codegen/src/overture/schema/codegen/pyspark/ -- new output target. Pipeline stages:

FeatureSpec (from extraction)
    |
constraint_dispatch.py    constraints -> ExpressionDescriptor / ModelConstraintDescriptor
    |
check_builder.py          FieldSpec -> CheckNode IR (resolves array nesting, variant gating)
schema_builder.py         FieldSpec -> SchemaField list (StructType source)
test_data/                FeatureSpec -> BASE_ROW, scaffold, invalid_value
    |
renderer.py               CheckNode IR -> per-feature expression module
test_renderer.py          CheckNode IR -> per-feature conformance test module
    |
pipeline.py               orchestrates, returns GeneratedModule list

make generate-pyspark wipes both generated/ trees and recreates them; make check gates on regeneration being current.

What's covered

Every constraint Pydantic enforces today is dispatched to a PySpark expression:

  • Field constraints: Ge / Gt / Le / Lt / Interval, MinLen / MaxLen (both array and string variants), StrippedConstraint, PatternConstraint, UniqueItemsConstraint, GeometryTypeConstraint, JsonPointerConstraint.
  • NewType overrides: CountryCodeAlpha2, RegionCode, LinearlyReferencedRange (length / bounds / order).
  • Base-type overrides: HttpUrl (format + length), EmailStr, BBox (completeness, lat ordering, lat range).
  • Model constraints: RequireAnyOfConstraint, RadioGroupConstraint, RequireIfConstraint, ForbidIfConstraint, MinFieldsSetConstraint. NoExtraFieldsConstraint is intentionally skipped.

Nested arrays, structs inside arrays, variant-gated fields (discriminated unions), and nested unions (a union field within a union member) all translate into matching array_check / nested_array_check chains with discriminator gating. segment is the canonical hard case -- it produces three test files (test_segment_road.py, test_segment_rail.py, test_segment_water.py), one per arm.

Known semantic gaps

Two documented divergences from Pydantic, both with xfail'd conformance tests:

  • UniqueItemsConstraint uses Spark's array_distinct, which compares whole elements with structural equality on raw stored values. Pydantic compares normalized Python objects -- e.g., list[HttpUrl] is compared after URL normalization. The PySpark check catches exact duplicates only.
  • require_any_of checks isNotNull as a proxy for Pydantic's model_fields_set. Parquet has no equivalent of "explicitly provided"; isNotNull is stricter (it rejects fields explicitly set to null).

CLI

overture-validate <feature-type> <parquet-or-directory> [options]

Output is one row per violation: feature ID, theme/type, failing field, check name, message, offending value. Single-pass evaluation -- one DataFrame, one Spark job. Switches: --count-only, --head N, --suppress FIELD[:CHECK], --skip-schema-check, --ignore-columns, --skip-extra-columns, --conf KEY=VALUE.

Testing

make check runs the full suite including generated conformance tests. The conformance tests are the gate: when codegen changes produce different expressions, regenerated tests fail until expectations are also regenerated, so the two surfaces cannot silently drift.

Beyond the unit and conformance tests:

# Local Parquet
overture-validate segment samples/segment.parquet --count-only

# Real release prefix (expect bbox Float-vs-Double mismatch -> use --skip-schema-check)
overture-validate place s3a://overturemaps-us-west-2/release/<release>/ --skip-schema-check --head 50

Notes for review

  • This PR is intentionally large because the generated tree is large. The interesting surface is small: pyspark/{constraint_dispatch,check_builder,schema_builder,renderer,test_renderer,pipeline}.py plus the runtime in overture-schema-pyspark/src/overture/schema/pyspark/. Everything under generated/ is regenerable output -- review the codegen, not the output.
  • A handful of supporting commits (testmon, VehicleSelectorBase extraction, list_anchor_depth on ConstraintSource, Java 17 CI pin) were prerequisites for this work and are included here rather than split out.
  • Marked draft; may force-push as cleanup continues.

pytest-testmon tracks which tests cover which source files and skips
unaffected tests on subsequent runs. Activated via a TESTMON Makefile
variable so the default `make check` uses incremental selection while
`make check TESTMON=` runs the full suite.

Lock the dependency in the dev group, gitignore the local cache file,
and thread $(TESTMON) through the test, test-all, and test-only
targets.

Signed-off-by: Seth Fitzsimmons <seth@mojodna.net>
Pull the shared `dimension` and `comparison` fields of the five vehicle
selector subtypes into a `VehicleSelectorBase` parent, and thread
`discriminator="dimension"` through the `VehicleSelector` annotated
union.

The discriminator turns the union into a Pydantic discriminated union,
so it serializes as JSON Schema's `oneOf` + `discriminator` rather than
`anyOf`. Regenerated segment_baseline_schema.json captures the new
shape.

This is a prerequisite for downstream tooling that walks discriminated
unions structurally (e.g. PySpark codegen for segment's nested vehicle
scoping).

Signed-off-by: Seth Fitzsimmons <seth@mojodna.net>
ConstraintSource now carries list_anchor_depth -- the number of
list[...] layers between the field's outermost wrapper and the layer
where the constraint was declared. _UnwrapState.add_constraint
populates it from the unwrapper's current list_depth, so a constraint
attached to the inner layer of list[Annotated[list[T], MinLen(1)]]
is distinguishable from one declared at the outer wrapper instead of
collapsing into an identical descriptor.

Field-level metadata surfaced by Pydantic is anchored at depth 0; a
comment in _merge_field_metadata records this invariant.

The default of 0 keeps existing consumers unaffected. Downstream
codegen can dispatch on the residual depth (ti.list_depth -
cs.list_anchor_depth) to tell stacked list and string constraints apart.

Signed-off-by: Seth Fitzsimmons <seth@mojodna.net>
Replace the Tonga-based Division/DivisionArea/DivisionBoundary
fixtures with Kauaʻi County samples that exercise admin_level,
capital_division_ids, wikidata, and source license alongside the
existing fields.

Replace the Tonga-based Connector/Segment fixtures with a Vermooten
Street junction in Pretoria that exercises access_restrictions with
when.vehicle, speed_limits with when.heading, routes with ref,
road_surface, and multi-source attribution.

Reformat the TOML with 4-space indents and sorted keys to match
sibling theme packages.

Signed-off-by: Seth Fitzsimmons <seth@mojodna.net>
Introduce overture-schema-pyspark, a runtime PySpark validation
package whose per-feature expression modules and conformance tests
are generated from the same Pydantic models that define the schema,
along with an `overture-validate` CLI.

Runtime (overture-schema-pyspark/src/overture/schema/pyspark/):

- check.py — Check, CheckShape, FeatureValidation dataclasses.
- schema_check.py — write-first comparison of Spark schemas against
  an expected StructType, with structural type matching and
  SchemaMismatch reporting.
- validate.py — public API: validate_feature(), evaluate_checks(),
  explain_errors(). The explain stage UNPIVOTs per-row check results
  into one row per violation, preserving all input columns for
  downstream join-back.
- cli.py — `overture-validate <parquet-or-directory>` runs the
  validation pipeline against a path of GeoParquet files. Output is
  one row per violation: feature ID, theme/type, failing field,
  check name, offending value. Single-pass evaluation keeps memory
  bounded for arbitrarily large inputs.
- expressions/ — shared runtime utilities (constraint_expressions,
  column_patterns, _schema_structs). Per-feature expression modules
  live under expressions/overture/ and are added by the codegen in
  a follow-up commit.
- tests/_support/ — conformance test infrastructure (scenarios,
  harness, helpers, mutations). The harness builds one DataFrame
  per feature, applies all scenarios as deterministic-UUID-tagged
  rows, runs validation once, and indexes violations back to
  scenario IDs — O(checks) rather than O(checks * scenarios).

CLI filtering options:

  --theme <theme>           limit to one theme
  --feature <feature>       limit to one feature type
  --skip-schema-check       run only constraint checks (no schema
                            comparison)
  --count-only              print violation counts per check rather
                            than rows
  --suppress <key>          suppress specific (feature, field, check)
                            triples per a YAML config

Codegen pipeline (overture-schema-codegen/src/.../pyspark/):

    FeatureSpec
        |
    constraint_dispatch.py   map constraints to descriptors
        |
    check_builder.py         walk FieldSpec -> CheckNode IR;
                             resolve array nesting, variant gating
        |
    schema_builder.py        FieldSpec -> SchemaField list
                             (StructType source)
        |
    renderer.py              CheckNode -> per-feature expression
                             module
    test_renderer.py         CheckNode -> per-feature conformance
                             test module
    synthetic.py             FeatureSpec -> BASE_ROW + invalid values
        |
    pipeline.py              orchestrate, return GeneratedModule list

The dispatch tables map every supported constraint (Ge/Gt/Le/Lt/
Interval, MinLen/MaxLen, StrippedConstraint, PatternConstraint,
UniqueItemsConstraint, GeometryTypeConstraint, JsonPointerConstraint,
RequireAnyOfConstraint, RadioGroupConstraint, RequireIfConstraint,
ForbidIfConstraint, MinFieldsSetConstraint), NewType (Country-
CodeAlpha2, LinearlyReferencedRange, RegionCode), and base type
(HttpUrl, EmailStr) to constraint_expressions check functions.

Discriminated unions (segment is the canonical hard case) split
into per-arm test files. The codegen handles arm splitting via
generate_arm_rows in synthetic.py and _filter_field_nodes_for_arm
in test_renderer.py.

Cross-package touch-ups:

- transportation models: minor tweak.

The Makefile gains a `generate-pyspark` target and gates `check`
on it so a stale generation surfaces immediately. The CLI is exposed
as a `[project.scripts]` entry point so `overture-validate`
becomes available after `pip install` / `uv sync`.

Signed-off-by: Seth Fitzsimmons <seth@mojodna.net>
Generate PySpark expressions (and tests) for models defined in the
workspace
PySpark 3.4 (the declared floor) doesn't run on Java 21, the default
JDK on ubuntu-latest runners -- it hits NoSuchMethodException on
java.nio.DirectByteBuffer.<init>(long, int), removed in JDK 21. Pin
the lowest-direct cell to Java 17 so the resolved pyspark==3.4.0 can
actually start. The default cell (which resolves to a current pyspark
4.x) keeps the runner's default Java 21.

Signed-off-by: Seth Fitzsimmons <seth@mojodna.net>
@github-actions
Copy link
Copy Markdown

🗺️ Schema reference docs preview is live!

🌍 Preview https://staging.overturemaps.org/schema/pr/518/schema/index.html
🕐 Updated May 11, 2026 23:12 UTC
📝 Commit b23c092
🔧 env SCHEMA_PREVIEW true

Note

♻️ This preview updates automatically with each push to this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Validate Overture data on Spark against the Pydantic schema

1 participant