Skip to content

feat: optional coordinate-based NUTS lookup using local Eurostat polygons #82

@bk86a

Description

@bk86a

Context

Today the service maps country + postal_code → NUTS via the curated TERCET table + estimates fallback. For inputs richer than a bare postal code (a full address, a lat/lon from another system), callers have no path through this service — they'd have to resolve the postal code first, which loses the accuracy already in their input.

Eurostat publishes NUTS-region polygons at GISCO under an open license: ref-nuts-2024-01m.geojson.zip (169 MB, 1:1m scale; smaller scales 03m/10m/20m/60m also available). These cover NUTS 0/1/2/3 for the entire reference set exhaustively (no gaps). Point-in-polygon against ~2k NUTS3 shapes using shapely + STRtree is sub-millisecond after one-time index build.

Proposed extension

Add a new endpoint that accepts caller-supplied coordinates and returns the NUTS hierarchy via local point-in-polygon:

GET /lookup-by-coordinates?lat=50.85&lon=4.35

Response shape mirrors /lookup:

{
  "lat": 50.85, "lon": 4.35,
  "match_type": "polygon",
  "nuts1": "BE1",  "nuts1_name": "Région de Bruxelles-Capitale",
  "nuts2": "BE10", "nuts2_name": "Région de Bruxelles-Capitale",
  "nuts3": "BE100","nuts3_name": "Arr. de Bruxelles-Capitale"
}

The service downloads + caches the GeoJSON at startup (analogous to the GISCO TERCET cache at /app/data), builds an STRtree once, and serves PIP queries from memory. Polygon scale (01m vs 03m) is a config knob trading footprint for boundary precision; default 01m.

What this is NOT

This issue does not propose adding a geocoder. Address → coordinates is a separate concern with its own licensing, latency, and PII trade-offs. Callers that have an address bring their own geocoder (Google, Mapbox, Nominatim, Pelias, or an in-house one) and pass the resulting lat/lon here. This keeps the service in its lane: a data lookup, no third-party API on the hot path, no PII handling.

Relationship to #45

Issue #45 explored a related direction (geocoding postal codes via Nominatim/Zippopotam + GISCO coord2nuts) and was closed 2026-05-01 because postal-code → coordinate coverage was ~6-7% and biased rural — not enough signal to justify the operational cost.

This proposal sidesteps that failure mode in two ways:

  1. The coordinate input is caller-supplied, not derived from the postal code via a low-coverage geocoder.
  2. The polygon lookup is local (downloaded GISCO GeoJSON), avoiding the GISCO coord2nuts web service's latency and rate limits.

So this complements rather than revisits #45.

Open questions

  • Polygon scale: 01m (169 MB, sub-meter precision) vs 03m (~50 MB) vs 10m. The smaller scales lose accuracy near borders — for boundary postal codes those are exactly the cases where TERCET disagrees with polygon PIP. Recommendation: 01m by default, configurable.
  • Memory footprint after STRtree build: budget vs. current ~400 MB RSS.
  • Should /lookup accept optional lat/lon as a fallback when postal_code is missing from TERCET, or keep this strictly on a new endpoint?
  • License attribution — GISCO requires source attribution; add to / root metadata and README.
  • Coverage delta vs. TERCET: are there NUTS regions where TERCET has a postal code but the polygon set has no matching shape (or vice versa)? Validation pass needed before shipping.
  • Should the candidate-set be restricted to NUTS 2024 (matching TERCET) or downloaded per-version to support nuts_version switching?

Out of scope

  • Address-string input (would require a geocoder and is a separate service). If this gets traction, "address-string in" is a follow-up issue, not part of this one.
  • Replacing TERCET. Postal-code lookup remains primary. This is an alternate input route for callers that already have coordinates.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions