Skip to content
Merged
  •  
  •  
  •  
35 changes: 35 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -223,6 +223,11 @@ Here is its schema:
"registry_agreement_types": ["string"], // Array of agreement types: "base" | "brand" | "community" | "sponsored" | "non_sponsored"
"icann_translation_en": "string", // ICANN's raw English Translation of an IDN label, source-faithful [OPTIONAL - IDN gTLDs only]

// IDN language metadata (derived from tld_script via Unicode CLDR likelySubtags,
// with per-(script, region) and per-TLD overrides where the default is wrong)
"language_code": "string", // BCP-47 code (e.g. "ar", "hi", "zh-Hant-TW") [OPTIONAL - IDN only]
"language_name_en": "string", // English name (e.g. "Arabic", "Hindi", "Chinese (Taiwan)") [OPTIONAL - IDN only]

// AS Org infrastructure operators (resolved against organizations.json)
"as_org_aliases": ["string"], // Canonical DNS provider display_names hosting nameservers (e.g. ["Identity Digital", "VeriSign"])
"as_org_slugs": ["string"], // FKs into organizations.json, parallel to as_org_aliases
Expand All @@ -246,6 +251,18 @@ Every TLD is identified by its **A-label** — the ASCII form, including `xn--`

The **U-label** — the rendered Unicode form (e.g. `москва`) — is display-only and appears solely in the `tld_unicode` field, alongside the A-label, never as a key or reference. Consumers that render a name resolve the A-label to `tld_unicode`; they never key on it.

## The typed graph

Alongside `tlds.json`, the build ships four derived reverse-index artifacts that model the root zone as a typed graph of four entity types plus one enum:

- **Domains** — the TLDs themselves (`tlds.json`).
- **Organizations** — registries, governance bodies, and infrastructure operators (`organizations.json`).
- **Places** — countries, dependent territories, subdivisions, cities, and supranational regions (`places.json`).
- **Cultures** — ethno-linguistic communities like the Basques or Welsh (`cultures.json`).
- **Agreement types** — the ICANN registry-agreement enum (`agreements.json`).

Each TLD relates to one or more Organizations through *roles* (Sponsor, Administrative Contact, Technical Contact, and — for gTLDs — ICANN Registry Operator), to zero or more Places (most ccTLDs map to one country; geographic gTLDs map to a city, subdivision, country, or supranational region), to an optional Culture, and to its agreement types. Each derived artifact is a deterministic reverse index of `tlds.json`: delete it and `make build` rebuilds it. Every cross-file relationship is enforced by referential-integrity tests, so a foreign key can never dangle and no record is ever orphaned.

## `organizations.json`

The `data/generated/organizations.json` file is the canonical record of the organizations that play roles for TLDs, with a reverse-index of those roles. It is built from a hand-curated identity seed (`data/manual/organizations.json`) joined against `tlds.json`, and replaces the old per-role alias files.
Expand All @@ -254,6 +271,24 @@ Each org carries an editorial `display_name` and a stable kebab-case `slug` (the

> **Consolidated subset:** this currently covers the curated multi-source organizations only. The single-source long tail (orgs that appear under one exact name in one source) is not yet included, so the absence of a TLD's operator here does not mean it has none.

## `places.json`

The `data/generated/places.json` file is the canonical record of the places associated with TLDs, with a reverse-index of their TLDs. Countries are derived mechanically from ccTLDs (ISO 3166-1 via `pycountry`); subdivisions, cities, and supranational regions come from a hand-curated seed (`data/manual/places.json`).

Each place carries a stable `slug` (ISO 3166-1 alpha-2 for countries, e.g. `gb`; a recognizable short name for subdivisions, e.g. `basque-country`; the TLD for cities, e.g. `amsterdam`), an English `name_en`, a `subtype` (`country` / `subdivision` / `city` / `supranational`), the `iso_code` where one exists, a `parent` slug for hierarchy (subdivision/city → country; dependent territory → sovereign), an optional `info_link`, and the `tlds` reverse index. A sparse `iso_designation` field carries ISO 3166-1 status for the special cases: `dependent_territory` (e.g. `bm` → `gb`), `exceptionally_reserved` (`ac`), `transitionally_reserved` (`su`), and `special_area` (`aq`). `places[]` is sorted by `slug`.

The United Kingdom is one place slugged `gb` (its ISO alpha-2), carrying both `.gb` and `.uk`; IDN ccTLDs fold into their country (e.g. `xn--p1ai` joins `ru`). Slugs and `tlds` are A-labels/ASCII; Unicode rendering is left to consumers.

## `cultures.json`

The `data/generated/cultures.json` file records the ethno-linguistic communities that at least one TLD claims affiliation with, with a reverse-index of their TLDs. It is built from a hand-curated seed (`data/manual/cultures.json`) joined against each TLD's `cultural_affiliation` annotation.

Each culture carries a stable `slug` (the foreign key `cultural_affiliation` points at), an English `name_en`, an `info_link` to Wikipedia, an optional BCP-47 `language_code` (`null` for multi-lingual cultures like `swiss` / `desi` / `kiwi` / `scottish`), and the `tlds` reverse index. `cultures[]` is sorted by `slug`. The schema is intentionally minimal: descriptions and cross-artifact links belong on the canonical source (Wikipedia via `info_link`), not duplicated here.

## `agreements.json`

The `data/generated/agreements.json` file is the ICANN registry-agreement-type enum with a reverse-index of the gTLDs under each. Each record carries a canonical `slug` (`base` / `non_sponsored` / `brand` / `community` / `sponsored`), a friendly `display_name`, the verbatim ICANN string under `source_names.icann`, and the `tlds` reverse index. `agreements[]` is sorted by `slug`.

## Local usage

- `make deps` - Install the project dependencies
Expand Down
5 changes: 4 additions & 1 deletion bin/lint
Original file line number Diff line number Diff line change
Expand Up @@ -7,10 +7,13 @@ if [ ${#paths[@]} -eq 0 ]; then
paths=(src/ tests/)
fi

# Run all three linters even if an earlier one fails, so the developer
# Run all linters even if an earlier one fails, so the developer
# sees the full set of findings in one pass instead of round-tripping.
exit_code=0
uv run ruff check "${paths[@]}" || exit_code=$?
uv run ruff format --check "${paths[@]}" || exit_code=$?
uv run pyright "${paths[@]}" || exit_code=$?
# JSON parse check runs over the whole repo (independent of the path args) so a
# stray syntax error or committed merge-conflict marker fails the lint pass.
python3 bin/lint-json.py || exit_code=$?
exit $exit_code
47 changes: 47 additions & 0 deletions bin/lint-json.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
#!/usr/bin/env python3
"""Validate that every JSON file in the repo parses cleanly."""

import json
import sys
from pathlib import Path

EXCLUDED_DIRS = {".git", ".venv", "node_modules", "__pycache__"}

# Test fixtures that are intentionally invalid JSON.
EXCLUDED_FILES = {
Path("tests/fixtures/metadata/corrupted-metadata.json"),
}


def find_json_files(root: Path):
for path in root.rglob("*.json"):
if any(part in EXCLUDED_DIRS for part in path.parts):
continue
if path.relative_to(root) in EXCLUDED_FILES:
continue
yield path


def main() -> int:
root = Path.cwd()
bad: list[tuple[Path, str]] = []
count = 0
for path in find_json_files(root):
count += 1
try:
json.loads(path.read_text(encoding="utf-8"))
except (json.JSONDecodeError, OSError) as e:
bad.append((path.relative_to(root), str(e)))

if bad:
for rel, err in bad:
print(f"{rel}: {err}", file=sys.stderr)
print(f"\n{len(bad)} of {count} JSON file(s) failed to parse.", file=sys.stderr)
return 1

print(f"{count} JSON file(s) parse cleanly.")
return 0


if __name__ == "__main__":
sys.exit(main())
Loading