Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion .github/CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,8 @@ This repo builds the `.h5` files that feed `policyengine-uk`:

The enhanced FRS dataset is licensed under strict UK Data Service terms. Violating them risks losing access, which would end PolicyEngine UK.

- **Never upload data to any public location.** The HuggingFace repo `policyengine/policyengine-uk-data-private` is private and authenticated.
- **Never upload FRS-derived or UKDS-licensed data to any public location.** The HuggingFace repo `policyengine/policyengine-uk-data-private` is private and authenticated.
- The public transfer artifacts documented in `docs/public_transfer_dataset.md` are the narrow exception. Upload them only through `make upload-public-transfer`, which targets the public repo intentionally.
- **Never modify `upload_completed_datasets.py` or `utils/data_upload.py`** to change upload destinations without explicit confirmation from the data controller (currently Nikhil Woodruff).
- **Never print, log, or output individual-level records.** Aggregates (sums, means, counts, weighted totals) are fine; individual rows are not.
- **If you see a private/public repo split, assume it is intentional** — ask why before changing it.
Expand Down
26 changes: 26 additions & 0 deletions .github/workflows/upload-public-transfer-dataset.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
name: Upload public transfer dataset

on:
workflow_dispatch:

jobs:
upload-public-transfer:
runs-on: ubuntu-latest
env:
HUGGING_FACE_TOKEN: ${{ secrets.HUGGING_FACE_TOKEN }}
steps:
- uses: actions/checkout@v4
- name: Install uv
uses: astral-sh/setup-uv@v5
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: 3.13
- name: Install package
run: uv pip install -e ".[dev]" --system
- name: Verify public transfer artifact contract
run: |
pytest policyengine_uk_data/tests/test_enhanced_cps_artifact_manifest.py
pytest policyengine_uk_data/tests/test_policybench_transfer.py
- name: Upload public transfer artifacts
run: make upload-public-transfer
6 changes: 6 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,12 @@ download:
upload:
python policyengine_uk_data/storage/upload_completed_datasets.py

enhanced-cps-manifest:
python policyengine_uk_data/storage/write_enhanced_cps_manifest.py

upload-public-transfer:
python policyengine_uk_data/storage/upload_public_transfer_dataset.py

documentation:
pip install --pre "jupyter-book>=2"
jb clean docs && jb build docs
Expand Down
10 changes: 10 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@ This repo now also includes a public calibrated microdata file:

- `policyengine_uk_data/storage/enhanced_cps_2025.h5`
- source manifest: `policyengine_uk_data/storage/enhanced_cps_source_2025.csv`
- artifact manifest: `policyengine_uk_data/storage/enhanced_cps_manifest_2025.json`

The public UK calibrated transfer dataset starts from a public export of eligible households from
PolicyEngine-US Enhanced CPS. In the current build that source manifest contains
Expand All @@ -47,11 +48,20 @@ This is a public calibrated dataset, not a replacement for the FRS or enhanced
FRS. It is intended as the first step in a broader cross-country public-microdata
strategy.

The legacy `policybench_transfer_2025.h5` and
`policybench_transfer_source_2025.csv` files remain the original
1,000-household proof-of-method artifacts. The Python
`create_policybench_transfer` and `save_policybench_transfer` entry points are
backward-compatible aliases for the current `enhanced_cps` builder, not a
request to regenerate the legacy 1,000-household files.

Programmatic entrypoints:

- `policyengine_uk_data.datasets.create_enhanced_cps`
- `policyengine_uk_data.datasets.export_enhanced_cps_source`
- `policyengine_uk_data.datasets.save_enhanced_cps`
- `make enhanced-cps-manifest`
- `make upload-public-transfer`

Backward-compatible aliases remain available:

Expand Down
1 change: 1 addition & 0 deletions docs/myst.yml
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ project:
github: policyengine/policyengine-uk-data
toc:
- file: intro.md
- file: public_transfer_dataset.md
- file: methodology.ipynb
- file: imputations.md
- file: validation/index.md
Expand Down
48 changes: 48 additions & 0 deletions docs/public_transfer_dataset.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
# Public UK transfer dataset

The public UK transfer dataset is an openly distributable benchmark artifact.
It is not native UK survey microdata.

The 2025 artifact starts from a public export of benchmark-compatible
PolicyEngine US Enhanced CPS households. The builder maps those households into
UK-facing PolicyEngine inputs, assigns synthetic UK geography, populates input
leaves such as council tax bands, vehicle ownership, pensions, disability/PIP,
consumption, and capital gains, and recalibrates household weights to selected
UK national, regional, and country targets.

The current public artifact is:

- `policyengine_uk_data/storage/enhanced_cps_2025.h5`
- `policyengine_uk_data/storage/enhanced_cps_source_2025.csv`
- `policyengine_uk_data/storage/enhanced_cps_manifest_2025.json`

The artifact manifest is the source of record for row counts, checksums, build
assumptions, weight diagnostics, and loss diagnostics. The checked-in 2025
manifest reports 28,532 households in both the source CSV and H5 file, 58,848
people in the H5 file, an effective sample size of about 11,197 households, and
a top-10 household-weight share of about 0.52%.

## Intended use

Use this dataset for public demos, reproducible examples, and public benchmark
analysis where restricted UK microdata cannot be redistributed.

Do not use this dataset as a substitute for FRS or enhanced FRS, as evidence of
the UK joint household distribution, or as administrative ground truth. Aggregate
calibration can improve target fit without recovering the native UK joint
distribution.

## Versioning

The public artifact should be cited by path, manifest, package version, and
checksum. The 2025 artifact uses a pinned USD-to-GBP exchange rate of 0.759 from
the IRS 2025 yearly average exchange-rate table. The builder deliberately does
not call a live foreign-exchange API.

## Legacy files

The `policybench_transfer_2025.h5` and `policybench_transfer_source_2025.csv`
files are retained as the original 1,000-household proof-of-method artifacts.
Current Python entry points named `create_policybench_transfer` and
`save_policybench_transfer` are aliases for the current 28,532-household
`enhanced_cps` builder.
2 changes: 2 additions & 0 deletions policyengine_uk_data/datasets/__init__.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
from .enhanced_cps import (
ENHANCED_CPS_FILE,
ENHANCED_CPS_MANIFEST_FILE,
ENHANCED_CPS_SOURCE_FILE,
create_enhanced_cps,
export_enhanced_cps_source,
Expand All @@ -15,6 +16,7 @@

__all__ = [
"ENHANCED_CPS_FILE",
"ENHANCED_CPS_MANIFEST_FILE",
"ENHANCED_CPS_SOURCE_FILE",
"create_enhanced_cps",
"export_enhanced_cps_source",
Expand Down
35 changes: 6 additions & 29 deletions policyengine_uk_data/datasets/enhanced_cps.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@

ENHANCED_CPS_SOURCE_FILE = STORAGE_FOLDER / "enhanced_cps_source_2025.csv"
ENHANCED_CPS_FILE = STORAGE_FOLDER / "enhanced_cps_2025.h5"
ENHANCED_CPS_MANIFEST_FILE = STORAGE_FOLDER / "enhanced_cps_manifest_2025.json"
COUNCIL_TAX_BANDS_FILE = STORAGE_FOLDER / "council_tax_bands_2024.csv"

# Build assumptions are pinned so the checked-in H5 is reproducible. Do not
Expand All @@ -35,26 +36,13 @@
"yearly-average-currency-exchange-rates"
)

# 2025/26 reported-benefit mapping assumptions used only to populate UK input
# leaves from U.S. source records. PolicyEngine UK applies its own parameters
# when calculating derived tax and benefit outputs.
# 2025/26 transfer assumptions used only to populate UK input leaves from U.S.
# source records. PolicyEngine UK applies its own parameters when calculating
# derived tax and benefit outputs.
NEW_STATE_PENSION_2025 = 224.96 * 52
DIVIDEND_YIELD_FOR_WEALTH_IMPUTATION = 0.03
RENTAL_YIELD_FOR_WEALTH_IMPUTATION = 0.04

PIP_2025_WEEKLY_RATES = {
"daily_living": {
"NONE": 0.0,
"STANDARD": 73.89,
"ENHANCED": 110.40,
},
"mobility": {
"NONE": 0.0,
"STANDARD": 29.19,
"ENHANCED": 77.04,
},
}

REGION_SHARES = (
("NORTH_EAST", 0.04),
("NORTH_WEST", 0.11),
Expand Down Expand Up @@ -248,11 +236,6 @@ def _pip_category(person: dict) -> str:
return "ENHANCED" if severe_signal or low_earnings else "STANDARD"


def _pip_reported_amount(category: str, component: str) -> float:
weekly = PIP_2025_WEEKLY_RATES[component][category]
return round(weekly * 52, 2)


def _household_cash_income(people: list[dict], exchange_rate: float) -> float:
total = 0.0
for person in people:
Expand Down Expand Up @@ -688,14 +671,8 @@ def _build_base_dataset(
if bool(inputs.get("is_blind", False))
else 0.0,
"is_disabled_for_benefits": bool(inputs.get("is_disabled", False)),
"pip_dl_reported": _pip_reported_amount(
pip_category,
"daily_living",
),
"pip_m_reported": _pip_reported_amount(
pip_category,
"mobility",
),
"pip_dl_category": pip_category,
"pip_m_category": pip_category,
"hours_worked": float(
inputs.get(
"weekly_hours_worked",
Expand Down
121 changes: 90 additions & 31 deletions policyengine_uk_data/datasets/frs.py
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,26 @@
"disabled_students_allowance_course_eligible",
"disabled_students_allowance_has_qualifying_condition",
)
PIP_CATEGORY_SAFETY_MARGIN = 0.1


def _category_from_reported(
reported,
thresholds: tuple[tuple[str, float], ...],
) -> np.ndarray:
"""Convert annual reported amounts to PE-UK category inputs."""

reported_weekly = pd.Series(reported).fillna(0).astype(float) / WEEKS_IN_YEAR
return np.select(
[
reported_weekly >= rate * (1 - PIP_CATEGORY_SAFETY_MARGIN)
for _, rate in thresholds
],
[category for category, _ in thresholds],
default="NONE",
)


BENEFITS_IN_OWN_RIGHT_REPORTED_COLUMNS = (
"universal_credit_reported",
"jsa_contrib_reported",
Expand Down Expand Up @@ -1221,6 +1241,76 @@ def determine_education_level(fted_val, typeed2_val, age_val):
household.index,
)

benefit = CountryTaxBenefitSystem().parameters(year).gov.dwp

pe_person["aa_category"] = _category_from_reported(
pe_person["attendance_allowance_reported"],
(
("HIGHER", benefit.attendance_allowance.higher),
("LOWER", benefit.attendance_allowance.lower),
),
)
pe_person["dla_sc_category"] = _category_from_reported(
pe_person["dla_sc_reported"],
(
("HIGHER", benefit.dla.self_care.higher),
("MIDDLE", benefit.dla.self_care.middle),
("LOWER", benefit.dla.self_care.lower),
),
)
pe_person["dla_m_category"] = _category_from_reported(
pe_person["dla_m_reported"],
(
("HIGHER", benefit.dla.mobility.higher),
("LOWER", benefit.dla.mobility.lower),
),
)
pe_person["pip_dl_category"] = _category_from_reported(
pe_person["pip_dl_reported"],
(
("ENHANCED", benefit.pip.daily_living.enhanced),
("STANDARD", benefit.pip.daily_living.standard),
),
)
pe_person["pip_m_category"] = _category_from_reported(
pe_person["pip_m_reported"],
(
("ENHANCED", benefit.pip.mobility.enhanced),
("STANDARD", benefit.pip.mobility.standard),
),
)

has_pip = (pe_person["pip_dl_category"] != "NONE") | (
pe_person["pip_m_category"] != "NONE"
)
has_dla = (pe_person["dla_sc_category"] != "NONE") | (
pe_person["dla_m_category"] != "NONE"
)
pe_person["is_disabled_for_benefits"] = has_dla | has_pip

pe_person["is_enhanced_disabled_for_benefits"] = (
pe_person["dla_sc_category"] == "HIGHER"
)

# Child Tax Credit Regulations 2002 s. 8
paragraph_3 = pe_person["dla_sc_category"] == "HIGHER"
paragraph_4 = pe_person["pip_dl_category"] == "ENHANCED"
paragraph_5 = pe_person.afcs_reported > 0
pe_person["is_severely_disabled_for_benefits"] = (
paragraph_3 | paragraph_4 | paragraph_5
)

pe_person = pe_person.drop(
columns=[
"attendance_allowance_reported",
"dla_sc_reported",
"dla_m_reported",
"pip_dl_reported",
"pip_m_reported",
],
errors="ignore",
)

dataset = UKSingleYearDataset(
person=pe_person,
benunit=pe_benunit,
Expand Down Expand Up @@ -1266,37 +1356,6 @@ def determine_education_level(fted_val, typeed2_val, age_val):

pe_household["brma"] = brmas

parameters = sim.tax_benefit_system.parameters
benefit = parameters(year).gov.dwp

pe_person["is_disabled_for_benefits"] = (
pe_person.dla_sc_reported
+ pe_person.dla_m_reported
+ pe_person.pip_m_reported
+ pe_person.pip_dl_reported
) > 0

THRESHOLD_SAFETY_GAP = 1 * WEEKS_IN_YEAR

pe_person["is_enhanced_disabled_for_benefits"] = (
pe_person.dla_sc_reported
> benefit.dla.self_care.higher * WEEKS_IN_YEAR - THRESHOLD_SAFETY_GAP
)

# Child Tax Credit Regulations 2002 s. 8
paragraph_3 = (
pe_person.dla_sc_reported
>= benefit.dla.self_care.higher * WEEKS_IN_YEAR - THRESHOLD_SAFETY_GAP
)
paragraph_4 = (
pe_person.pip_dl_reported
>= benefit.pip.daily_living.enhanced * WEEKS_IN_YEAR - THRESHOLD_SAFETY_GAP
)
paragraph_5 = pe_person.afcs_reported > 0
pe_person["is_severely_disabled_for_benefits"] = (
paragraph_3 | paragraph_4 | paragraph_5
)

# Dataset-side claimant-state approximations for future legacy ESA/JSA
# modelling. These are explicit proxies based on observed survey
# conditions, not legislative determinations.
Expand Down
Loading