PolicyEngine · MaxGhenis · Apr 26, 2026 · Apr 27, 2026 · Apr 30, 2026 · Apr 30, 2026
diff --git a/.github/CONTRIBUTING.md b/.github/CONTRIBUTING.md
@@ -30,7 +30,8 @@ This repo builds the `.h5` files that feed `policyengine-uk`:
 
 The enhanced FRS dataset is licensed under strict UK Data Service terms. Violating them risks losing access, which would end PolicyEngine UK.
 
-- **Never upload data to any public location.** The HuggingFace repo `policyengine/policyengine-uk-data-private` is private and authenticated.
+- **Never upload FRS-derived or UKDS-licensed data to any public location.** The HuggingFace repo `policyengine/policyengine-uk-data-private` is private and authenticated.
+- The public transfer artifacts documented in `docs/public_transfer_dataset.md` are the narrow exception. Upload them only through `make upload-public-transfer`, which targets the public repo intentionally.
 - **Never modify `upload_completed_datasets.py` or `utils/data_upload.py`** to change upload destinations without explicit confirmation from the data controller (currently Nikhil Woodruff).
 - **Never print, log, or output individual-level records.** Aggregates (sums, means, counts, weighted totals) are fine; individual rows are not.
 - **If you see a private/public repo split, assume it is intentional** — ask why before changing it.

diff --git a/.github/workflows/upload-public-transfer-dataset.yaml b/.github/workflows/upload-public-transfer-dataset.yaml
@@ -0,0 +1,26 @@
+name: Upload public transfer dataset
+
+on:
+  workflow_dispatch:
+
+jobs:
+  upload-public-transfer:
+    runs-on: ubuntu-latest
+    env:
+      HUGGING_FACE_TOKEN: ${{ secrets.HUGGING_FACE_TOKEN }}
+    steps:
+      - uses: actions/checkout@v4
+      - name: Install uv
+        uses: astral-sh/setup-uv@v5
+      - name: Set up Python
+        uses: actions/setup-python@v5
+        with:
+          python-version: 3.13
+      - name: Install package
+        run: uv pip install -e ".[dev]" --system
+      - name: Verify public transfer artifact contract
+        run: |
+          pytest policyengine_uk_data/tests/test_enhanced_cps_artifact_manifest.py
+          pytest policyengine_uk_data/tests/test_policybench_transfer.py
+      - name: Upload public transfer artifacts
+        run: make upload-public-transfer
diff --git a/Makefile b/Makefile
@@ -15,6 +15,12 @@ download:
 upload:
 	python policyengine_uk_data/storage/upload_completed_datasets.py
 
+enhanced-cps-manifest:
+	python policyengine_uk_data/storage/write_enhanced_cps_manifest.py
+
+upload-public-transfer:
+	python policyengine_uk_data/storage/upload_public_transfer_dataset.py
+
 documentation:
 	pip install --pre "jupyter-book>=2"
 	jb clean docs && jb build docs

diff --git a/README.md b/README.md
@@ -22,6 +22,7 @@ This repo now also includes a public calibrated microdata file:
 
 - `policyengine_uk_data/storage/enhanced_cps_2025.h5`
 - source manifest: `policyengine_uk_data/storage/enhanced_cps_source_2025.csv`
+- artifact manifest: `policyengine_uk_data/storage/enhanced_cps_manifest_2025.json`
 
 The public UK calibrated transfer dataset starts from a public export of eligible households from
 PolicyEngine-US Enhanced CPS. In the current build that source manifest contains
@@ -47,11 +48,20 @@ This is a public calibrated dataset, not a replacement for the FRS or enhanced
 FRS. It is intended as the first step in a broader cross-country public-microdata
 strategy.
 
+The legacy `policybench_transfer_2025.h5` and
+`policybench_transfer_source_2025.csv` files remain the original
+1,000-household proof-of-method artifacts. The Python
+`create_policybench_transfer` and `save_policybench_transfer` entry points are
+backward-compatible aliases for the current `enhanced_cps` builder, not a
+request to regenerate the legacy 1,000-household files.
+
 Programmatic entrypoints:
 
 - `policyengine_uk_data.datasets.create_enhanced_cps`
 - `policyengine_uk_data.datasets.export_enhanced_cps_source`
 - `policyengine_uk_data.datasets.save_enhanced_cps`
+- `make enhanced-cps-manifest`
+- `make upload-public-transfer`
 
 Backward-compatible aliases remain available:
 

diff --git a/docs/myst.yml b/docs/myst.yml
@@ -9,6 +9,7 @@ project:
   github: policyengine/policyengine-uk-data
   toc:
     - file: intro.md
+    - file: public_transfer_dataset.md
     - file: methodology.ipynb
     - file: imputations.md
     - file: validation/index.md

diff --git a/docs/public_transfer_dataset.md b/docs/public_transfer_dataset.md
@@ -0,0 +1,48 @@
+# Public UK transfer dataset
+
+The public UK transfer dataset is an openly distributable benchmark artifact.
+It is not native UK survey microdata.
+
+The 2025 artifact starts from a public export of benchmark-compatible
+PolicyEngine US Enhanced CPS households. The builder maps those households into
+UK-facing PolicyEngine inputs, assigns synthetic UK geography, populates input
+leaves such as council tax bands, vehicle ownership, pensions, disability/PIP,
+consumption, and capital gains, and recalibrates household weights to selected
+UK national, regional, and country targets.
+
+The current public artifact is:
+
+- `policyengine_uk_data/storage/enhanced_cps_2025.h5`
+- `policyengine_uk_data/storage/enhanced_cps_source_2025.csv`
+- `policyengine_uk_data/storage/enhanced_cps_manifest_2025.json`
+
+The artifact manifest is the source of record for row counts, checksums, build
+assumptions, weight diagnostics, and loss diagnostics. The checked-in 2025
+manifest reports 28,532 households in both the source CSV and H5 file, 58,848
+people in the H5 file, an effective sample size of about 11,197 households, and
+a top-10 household-weight share of about 0.52%.
+
+## Intended use
+
+Use this dataset for public demos, reproducible examples, and public benchmark
+analysis where restricted UK microdata cannot be redistributed.
+
+Do not use this dataset as a substitute for FRS or enhanced FRS, as evidence of
+the UK joint household distribution, or as administrative ground truth. Aggregate
+calibration can improve target fit without recovering the native UK joint
+distribution.
+
+## Versioning
+
+The public artifact should be cited by path, manifest, package version, and
+checksum. The 2025 artifact uses a pinned USD-to-GBP exchange rate of 0.759 from
+the IRS 2025 yearly average exchange-rate table. The builder deliberately does
+not call a live foreign-exchange API.
+
+## Legacy files
+
+The `policybench_transfer_2025.h5` and `policybench_transfer_source_2025.csv`
+files are retained as the original 1,000-household proof-of-method artifacts.
+Current Python entry points named `create_policybench_transfer` and
+`save_policybench_transfer` are aliases for the current 28,532-household
+`enhanced_cps` builder.
diff --git a/policyengine_uk_data/datasets/__init__.py b/policyengine_uk_data/datasets/__init__.py
@@ -1,5 +1,6 @@
 from .enhanced_cps import (
     ENHANCED_CPS_FILE,
+    ENHANCED_CPS_MANIFEST_FILE,
     ENHANCED_CPS_SOURCE_FILE,
     create_enhanced_cps,
     export_enhanced_cps_source,
@@ -15,6 +16,7 @@
 
 __all__ = [
     "ENHANCED_CPS_FILE",
+    "ENHANCED_CPS_MANIFEST_FILE",
     "ENHANCED_CPS_SOURCE_FILE",
     "create_enhanced_cps",
     "export_enhanced_cps_source",

diff --git a/policyengine_uk_data/datasets/enhanced_cps.py b/policyengine_uk_data/datasets/enhanced_cps.py
@@ -23,6 +23,7 @@
 
 ENHANCED_CPS_SOURCE_FILE = STORAGE_FOLDER / "enhanced_cps_source_2025.csv"
 ENHANCED_CPS_FILE = STORAGE_FOLDER / "enhanced_cps_2025.h5"
+ENHANCED_CPS_MANIFEST_FILE = STORAGE_FOLDER / "enhanced_cps_manifest_2025.json"
 COUNCIL_TAX_BANDS_FILE = STORAGE_FOLDER / "council_tax_bands_2024.csv"
 
 # Build assumptions are pinned so the checked-in H5 is reproducible. Do not
@@ -35,26 +36,13 @@
     "yearly-average-currency-exchange-rates"
 )
 
-# 2025/26 reported-benefit mapping assumptions used only to populate UK input
-# leaves from U.S. source records. PolicyEngine UK applies its own parameters
-# when calculating derived tax and benefit outputs.
+# 2025/26 transfer assumptions used only to populate UK input leaves from U.S.
+# source records. PolicyEngine UK applies its own parameters when calculating
+# derived tax and benefit outputs.
 NEW_STATE_PENSION_2025 = 224.96 * 52
 DIVIDEND_YIELD_FOR_WEALTH_IMPUTATION = 0.03
 RENTAL_YIELD_FOR_WEALTH_IMPUTATION = 0.04
 
-PIP_2025_WEEKLY_RATES = {
-    "daily_living": {
-        "NONE": 0.0,
-        "STANDARD": 73.89,
-        "ENHANCED": 110.40,
-    },
-    "mobility": {
-        "NONE": 0.0,
-        "STANDARD": 29.19,
-        "ENHANCED": 77.04,
-    },
-}
-
 REGION_SHARES = (
     ("NORTH_EAST", 0.04),
     ("NORTH_WEST", 0.11),
@@ -248,11 +236,6 @@ def _pip_category(person: dict) -> str:
     return "ENHANCED" if severe_signal or low_earnings else "STANDARD"
 
 
-def _pip_reported_amount(category: str, component: str) -> float:
-    weekly = PIP_2025_WEEKLY_RATES[component][category]
-    return round(weekly * 52, 2)
-
-
 def _household_cash_income(people: list[dict], exchange_rate: float) -> float:
     total = 0.0
     for person in people:
@@ -688,14 +671,8 @@ def _build_base_dataset(
                     if bool(inputs.get("is_blind", False))
                     else 0.0,
                     "is_disabled_for_benefits": bool(inputs.get("is_disabled", False)),
-                    "pip_dl_reported": _pip_reported_amount(
-                        pip_category,
-                        "daily_living",
-                    ),
-                    "pip_m_reported": _pip_reported_amount(
-                        pip_category,
-                        "mobility",
-                    ),
+                    "pip_dl_category": pip_category,
+                    "pip_m_category": pip_category,
                     "hours_worked": float(
                         inputs.get(
                             "weekly_hours_worked",

diff --git a/policyengine_uk_data/datasets/frs.py b/policyengine_uk_data/datasets/frs.py
@@ -56,6 +56,26 @@
     "disabled_students_allowance_course_eligible",
     "disabled_students_allowance_has_qualifying_condition",
 )
+PIP_CATEGORY_SAFETY_MARGIN = 0.1
+
+
+def _category_from_reported(
+    reported,
+    thresholds: tuple[tuple[str, float], ...],
+) -> np.ndarray:
+    """Convert annual reported amounts to PE-UK category inputs."""
+
+    reported_weekly = pd.Series(reported).fillna(0).astype(float) / WEEKS_IN_YEAR
+    return np.select(
+        [
+            reported_weekly >= rate * (1 - PIP_CATEGORY_SAFETY_MARGIN)
+            for _, rate in thresholds
+        ],
+        [category for category, _ in thresholds],
+        default="NONE",
+    )
+
+
 BENEFITS_IN_OWN_RIGHT_REPORTED_COLUMNS = (
     "universal_credit_reported",
     "jsa_contrib_reported",
@@ -1221,6 +1241,76 @@ def determine_education_level(fted_val, typeed2_val, age_val):
         household.index,
     )
 
+    benefit = CountryTaxBenefitSystem().parameters(year).gov.dwp
+
+    pe_person["aa_category"] = _category_from_reported(
+        pe_person["attendance_allowance_reported"],
+        (
+            ("HIGHER", benefit.attendance_allowance.higher),
+            ("LOWER", benefit.attendance_allowance.lower),
+        ),
+    )
+    pe_person["dla_sc_category"] = _category_from_reported(
+        pe_person["dla_sc_reported"],
+        (
+            ("HIGHER", benefit.dla.self_care.higher),
+            ("MIDDLE", benefit.dla.self_care.middle),
+            ("LOWER", benefit.dla.self_care.lower),
+        ),
+    )
+    pe_person["dla_m_category"] = _category_from_reported(
+        pe_person["dla_m_reported"],
+        (
+            ("HIGHER", benefit.dla.mobility.higher),
+            ("LOWER", benefit.dla.mobility.lower),
+        ),
+    )
+    pe_person["pip_dl_category"] = _category_from_reported(
+        pe_person["pip_dl_reported"],
+        (
+            ("ENHANCED", benefit.pip.daily_living.enhanced),
+            ("STANDARD", benefit.pip.daily_living.standard),
+        ),
+    )
+    pe_person["pip_m_category"] = _category_from_reported(
+        pe_person["pip_m_reported"],
+        (
+            ("ENHANCED", benefit.pip.mobility.enhanced),
+            ("STANDARD", benefit.pip.mobility.standard),
+        ),
+    )
+
+    has_pip = (pe_person["pip_dl_category"] != "NONE") | (
+        pe_person["pip_m_category"] != "NONE"
+    )
+    has_dla = (pe_person["dla_sc_category"] != "NONE") | (
+        pe_person["dla_m_category"] != "NONE"
+    )
+    pe_person["is_disabled_for_benefits"] = has_dla | has_pip
+
+    pe_person["is_enhanced_disabled_for_benefits"] = (
+        pe_person["dla_sc_category"] == "HIGHER"
+    )
+
+    # Child Tax Credit Regulations 2002 s. 8
+    paragraph_3 = pe_person["dla_sc_category"] == "HIGHER"
+    paragraph_4 = pe_person["pip_dl_category"] == "ENHANCED"
+    paragraph_5 = pe_person.afcs_reported > 0
+    pe_person["is_severely_disabled_for_benefits"] = (
+        paragraph_3 | paragraph_4 | paragraph_5
+    )
+
+    pe_person = pe_person.drop(
+        columns=[
+            "attendance_allowance_reported",
+            "dla_sc_reported",
+            "dla_m_reported",
+            "pip_dl_reported",
+            "pip_m_reported",
+        ],
+        errors="ignore",
+    )
+
     dataset = UKSingleYearDataset(
         person=pe_person,
         benunit=pe_benunit,
@@ -1266,37 +1356,6 @@ def determine_education_level(fted_val, typeed2_val, age_val):
 
     pe_household["brma"] = brmas
 
-    parameters = sim.tax_benefit_system.parameters
-    benefit = parameters(year).gov.dwp
-
-    pe_person["is_disabled_for_benefits"] = (
-        pe_person.dla_sc_reported
-        + pe_person.dla_m_reported
-        + pe_person.pip_m_reported
-        + pe_person.pip_dl_reported
-    ) > 0
-
-    THRESHOLD_SAFETY_GAP = 1 * WEEKS_IN_YEAR
-
-    pe_person["is_enhanced_disabled_for_benefits"] = (
-        pe_person.dla_sc_reported
-        > benefit.dla.self_care.higher * WEEKS_IN_YEAR - THRESHOLD_SAFETY_GAP
-    )
-
-    # Child Tax Credit Regulations 2002 s. 8
-    paragraph_3 = (
-        pe_person.dla_sc_reported
-        >= benefit.dla.self_care.higher * WEEKS_IN_YEAR - THRESHOLD_SAFETY_GAP
-    )
-    paragraph_4 = (
-        pe_person.pip_dl_reported
-        >= benefit.pip.daily_living.enhanced * WEEKS_IN_YEAR - THRESHOLD_SAFETY_GAP
-    )
-    paragraph_5 = pe_person.afcs_reported > 0
-    pe_person["is_severely_disabled_for_benefits"] = (
-        paragraph_3 | paragraph_4 | paragraph_5
-    )
-
     # Dataset-side claimant-state approximations for future legacy ESA/JSA
     # modelling. These are explicit proxies based on observed survey
     # conditions, not legislative determinations.