Feat/cmip7 awiesm3 veg hr#266
Open
JanStreffing wants to merge 307 commits into
Open
Conversation
1a42875 to
5617a18
Compare
Contributor
Author
Contributor
Author
|
The single failing test ( Something to worry about? @pgierz |
Contributor
|
@pgierz, will you review this or should I go for it? |
Member
|
I will look |
CMIP7 salinity units are "1E-03" (dimensionless with scaling factor). pint-xarray cannot convert to such targets, raising "Unit expression cannot have a scaling factor". When both source (e.g. psu) and target are dimensionless, relabel without numeric conversion since the values are already in the correct range.
All ~33 custom step functions hardcoded the CMOR variable name (e.g. result.name = "zostoga") instead of using rule.model_variable. This caused KeyError in set_variable when model_variable differs from the CMOR name (e.g. model_variable="temp" but name was "zostoga"). Also fix compute_mass_transport to handle W-level data (nz=48 interfaces) by averaging to cell centers (nz=47) before multiplying by layer thickness.
Prefect's embedded SQLite server cannot handle 27+ concurrent rule submissions, causing cascading 503 errors. Set parallel: False to process rules serially until a proper Prefect server is available. Applied to all 17 awi-esm3-veg-hr configs and core2 test config.
Runs pycmor on a compute node with local Dask cluster, Prefect on local /tmp to avoid NFS SQLite locking. All 27 core_ocean rules complete successfully with this setup.
- field_def: add u10m/v10m aliases for 10u/10v (YACC parser cannot handle field names starting with digits in expressions) - file_def: use u10m/v10m in sfcWindmax expression, move evspsblpot and sbl from separate monthly file to existing daily land file, rename legacy 6h file suffix to avoid conflicts
New env knob ``SHARD_DRS`` (default ``off``) toggles pycmor's
``enable_output_subdirs`` config option. When on:
* The injected per-shard yaml sets
``pycmor.enable_output_subdirs = True``, so files.py's filename
builder appends the DRS sub-tree returned by GlobalAttributes
.subdir_path():
<mip_era>/<activity_id>/<institution_id>/<source_id>/
<experiment_id>/<member_id>/<table_id>/<variable_id>/
<grid_label>/v<YYYYMMDD>/
e.g. ``CMIP7/CMIP/AWI/AWI-ESM-3/picontrol/r1i1p1f1/Amon/tas/gn/
v20260515/tas_<...>.nc``
* The submitter collapses the per-tier ``<tier>/cmorized/`` OUTSUB
to ``.`` so all 17 tiers land in one shared DRS root. The non-DRS
nesting was a pycmor-specific organizational layer that downstream
consumers don't expect at the start of the path.
Off by default — keeps the legacy per-tier flat layout for any current
consumers. Turn on via ``SHARD_DRS=on bash submit_hr_year_shards.sh ...``
for production / publication-ready output.
…zed layout build_html_report: render_file_card was showing variable-group bounds (whichever cadence happened to merge first) for every per-file card. 21 variables with multiple cadence rows displayed wrong "Expected" columns and range SVG. Prefer rec.expected_* over ent.expected_*. sanity_check: walker glob() missed files under <root>/<tier>/cmorized/ (produced when the runner emits a DRS-shaped subtree). Switch to rglob() so the walker handles both flat and nested layouts. run_walker_compute.sh: new sbatch wrapper to walk on compute nodes; login-node OpenBLAS pthread limit + Lustre contention crashes the worker pool on the big 8-19 GB 1hr/3D-model-level files. run_perfile_maps_compute.sh: paths bumped to cli37 run.
build_maps: collapse all non-spatial dims (time + level + tile) by the
matching reducer per panel (min/min, mean/mean, max/max) so 3D fields
get full spatial coverage instead of mostly-NaN deep-level slices.
Other fixes: rglob() for nested cmorized/ layout, sibling-file lat/lon
recovery for hxy-si nod2 files that strip coords, dask-chunked open so
84-GB cl_day/pfull_day render without OOM, 2-98 percentile colorbar
clipping, prefer lat/lon over latitude/longitude for spatial-dim
discovery on 3D atm files.
build_html_report: per-file card now uses rec-level expected bounds
(was showing variable-group bounds, wrong for 21 cadence-split vars
incl. hfls, hfss, pr, prsn, rsds, rsus, uas, vas). NH/SH suffix on
hemispheric scalar files (siarea, siextent, sisnmass, sivol) with
North->South word substitution. FAIL diagnoses now append a
"mean-in-bounds, outlier-extremes" hint when the field mean sits
inside the expected window (e.g. hfls_mon FAIL is single-cell, not
field-wide).
sanity_check_ranges: recalibrate bounds with reviewer feedback —
phcint allow negatives (T<0degC at high lat), masscello to AWI-ESM
vertical discretization (5125..358750 kg/m^2), prra and siconc to
ice-zone/global walker semantics, cl to per-layer (~5%) not
column-integrated (~30%), umo/vmo/sfx/sfy reflect compute fix below.
custom_steps.compute_mass_transport, compute_salt_transport: integrate
across the FESOM cell edge by multiplying by sqrt(cell_area). Was
emitting kg/(s*m) but the CMIP7 spec for Omon.{umo,vmo,sfx,sfy}
requires kg/s. New _fesom_edge_width helper resolves the horizontal
dim against (nod2, ncells, ncol) and falls back to a clear error if
the mesh lacks cell_area. Order-of-magnitude check: typical FESOM HR
mid-lat cell now ~4.5e7 kg/s, strong-current cell ~1.8e9 kg/s.
cli37 walker output: 743 files, PASS 394 / WARN 251 / FAIL 96 /
ERROR 2.
09fe51e to
900c575
Compare
… lat/lon Implements the cmor/CMIP7 corrections from the cli37 sea-ice review. hxy-si rules now apply mask_where_no_seaice so the output matches CMIP7 cell_methods "area: time: mean where sea_ice (mask=siconc)". Previously prra/prsn (and all other rules using scale_pipeline / snm_pipeline / sispeed_pipeline / sistressave_pipeline / sistressmax_pipeline / siflcondtop_pipeline / sisnhc_pipeline / sitempbot_pipeline / sifb_pipeline / simpeffconc_pipeline / constant_field_pipeline / sisnhc_from_msnow_pipeline / fraction_to_percent_pipeline / sfdsi_from_fw_ice_pipeline) wrote the field over the full FESOM domain including the tropics. New scale_mask_pipeline / fraction_to_percent_mask_pipeline / sfdsi_from_fw_ice_mask_pipeline cover the cases where the original pipeline is shared with a non-hxy-si caller; the rest got mask_where_no_seaice added in-place since every consumer is hxy-si. The 17 direct-passthrough hxy-si rules (no pipelines: block) remain unmasked — the audit directive was scoped to "every hxy-si rule pipeline". sisaltmass: scale_factor 0.004 -> 3.64. FESOM's m_ice is effective ice height per cell area (units 'm', long_name "ice height per unit area" in io_meandata.F90), not kg/m² as the prior rule comment assumed. The correct salt-mass conversion is (sice/1000) * m_ice * rho_ice = 0.004 * m_ice * 910 = 3.64 * m_ice. sice=4.0 verified from Final_CMIP7_IO_Test_06/config/fesom/namelist.ice; rho_ice=910 (AOMIP) from fesom-2.7 MOD_ICE.F90:61. sidmasstran[xy]: compute_ice_mass_transport now does uice * m_ice * rho_ice * sqrt(cell_area) and emits kg/s. Reuses the existing _fesom_edge_width helper (added by 900c575 for umo/vmo/sfx/sfy). Previously the formula was uice * m_ice mislabelled "kg/s" — actually m²/s, ~5 orders of magnitude too low. regrid_oifs_to_fesom: assign_coords lat/lon on nod2 after the isel. Source-grid lat/lon were attached to the now-removed source_dim; without re-attaching the FESOM lat/lon, the output had only nod2 as a dim and external tools (ushow, Panoply, ncview) couldn't render the field. The sanity-check map renderer already worked around this with _find_sibling_latlon; that workaround stays as a defence-in-depth fallback. sisnthick: not added — CMIP7 v1.2.2.2 has no sisnthick out_name. The CMIP7 equivalent in seaIce is snd (standard_name surface_snow_thickness, units m), already produced by core_seaice/snd and cap7_seaice/snd_day.
…ealm The walker recorded ``realm`` from the bounds-table lookup, which is keyed by out_name only. For out_names that exist in multiple CMIP realms (rlds in atmos+seaIce, prra, prsn, evspsbl, rsds, rsus, rlus, ts, ...) every branded variant got stamped with whichever realm the bounds table happened to store — typically the atm realm. The ``-hxy-si`` files then landed on ``atm.html`` even though their own ``:realm`` global attribute correctly said ``seaIce``. worker_main now reads ``nc.realm`` and prefers it over the table value for the emitted JSONL record. The file's own CMIP global attrs are authoritative; the bounds-table realm is the fallback for ERROR or NOBOUNDS records that never got a realm written. collapse() in build_html_report.py now keys VarEntry by (var, realm) instead of var alone, so a variable that appears under two realms splits into two cards routed to the right HTML page. Previously ``ent.realm`` was "first record wins" — bundling all branded variants under whichever record happened to come first in the JSONL. Reported by Nadine on the cli37 report (rlds-si, prsn-si, ts-si etc. showing up in atm.html with global data).
Each disabled rule emitted the same compound_name as an existing core or cap7_atm rule with divergent content. Same metadata in both files (long_name, standard_name, cell_methods, realm, units) so downstream tools couldn't tell the LPJ-GUESS value from the OIFS one. CMIP7 ships the OIFS coupled value as the canonical evspsbl/mrro/mrros/mrso/ mrsol/snc/snd/snw. Re-enable under distinct CMIP7 names (e.g. evspsblveg for canopy, tran for transpiration) if LPJ-specific land-hydrology values are needed for vegetation analysis later. core_land/areacella: kept enabled (duplicates core_atm but content is byte-identical and each tier needs its own copy for cell_measures resolution in stand-alone tier delivery — added comment noting this is intentional). Audit: grep -h '^\s*compound_name:' awi-esm3-veg-hr-variables/*/cmip7_*.yaml | grep -v '^\s*#' | sort | uniq -c | awk '$1>1' -> 1 line (atmos.areacella, intentional)
…sin/sltbasin lat attrs Christian's cli37 review flagged four ocean-side recipe issues that fall on the pycmor side. All addressed here; the remaining items (wfo / obvfsq / wo / volcello / opottempdiff bounds) are either model-team questions or sanity-bound tuning, not recipe bugs. hfds: FESOM `fh` is positive=UP (ice_oce_coupling.F90:418 writes ``heat_flux_in = heat_flux = -net_heat_flux``; density-flux equation at line 654 confirms positive `heat_flux_in` corresponds to ocean cooling/densification). CMIP7 `hfds` is positive=DOWN, so the rule now applies `scale_factor: -1.0` via scale_pipeline. Previously hfds was a bare passthrough, giving cli37 reviewers a North Atlantic that appears to GAIN heat in January. msftbarot: f_min default 1e-5 → 2.5e-5 in compute_msftbarot. The geostrophic SSH approximation `psi = rho_0 * g * H / f * eta` breaks down within ~10° of the equator, and at f=1e-5 only |lat| < ~4° gets masked — leaving a band ±4-10° where f is small but finite, producing the "weird artifact near the equator" Christian flagged. f_min=2.5e-5 extends the NaN mask to ±~10°. compute_mass_transport: vertical component (transport_component='z', used by wmo) now multiplies by `cell_area` instead of `dz * sqrt(cell_area)`. The horizontal-edge formula (dz * sqrt(cell_area), correct for umo/vmo) under-counted vertical flux by ~dz/sqrt(cell_area) ≈ 50/1e4 ≈ 200× at FESOM HR resolution. hfbasin / sltbasin: `lat` coordinate variable was attached as a bare numpy array with no attributes; the written file then had `lat` without `standard_name`/`units`/`axis`. Added CF attrs on both compute_hfbasin_tripyview and compute_sltbasin_tripyview output paths.
…iew) difmxylo: HR-FESOM (DARS ~10km) Laplacian eddy mixing values are much lower than the coarse-resolution defaults the original bounds assumed (Christian: "for higher resolution values should be lower as far as I read up on this"). cli37 cmor observed mean ~0.01 m²/s, max ~1.6 m²/s — bound expected_mean was ~1000, max ~1e4 (~5 orders too high). New bounds 0 / ~0.01 / ~5 m²/s give PASS on cli37 stats and stay narrow enough that a 100× regression would still WARN. sfdsi: Christian: "pattern looks good - not sure why the bounds fail here". cli37 cmor observed min -1.6e-4, max 6e-5 kg m-2 s-1; previous bound min/max ±1e-5 was too narrow — Arctic freezing bursts in fully-resolved sea-ice can hit ~1e-4 kg m-2 s-1 even though the global mean stays near zero. Widened to ±5e-4 (~5× the observed peak).
Recipe commit e7306e2 added scale_factor: -1.0 to hfds (FESOM `fh` is positive=up; CMIP positive=down). The bounds table still encoded the pre-flip expectation: - expected_mean=~2 (typical positive piControl drift) - bounds ±300 W/m² Post-flip, observed mean becomes negative (cli37 was +14.2 → post-fix -14.2). The classify() sign-mismatch check at sanity_check.py:337-338 auto-FAILs when expected_mean > 0 and observed < 0, so the bounds need updating regardless of magnitude. Also widening to ±500: cli37 max=1910 W/m² (post-flip min=-1910) exceeds ±300, and CMIP6 literature notes per-cell monthly peaks reach ±500 W/m² in deep-convection / strong air-sea contrast regions. Beyond ±500 stays a FAIL — sentinel for outlier cells that warrant investigation (1910 W/m² is real but extreme). expected_mean ~2 → ~0: piControl steady state has global mean near 0; small drift in either direction is fine. The em==0.0 branch of classify() (line 328) tolerates |mean|/envelope up to 10% as PASS.
Christian's cli37 review flagged hfbasin as FAIL on max=7.6 PW vs the ±2 PW bound. cli37 inspection shows the recipe is correct: annual climatology matches Trenberth almost exactly (global peak +1.96 PW @ 16°N; Atlantic +1.08 PW @ 16°N vs Trenberth ~1.3 PW). The 7.6 PW is a monthly Indo-Pacific extreme — basin-sum peak across (lat, time). Literature for monthly hfbasin extremes: - Trenberth & Caron 2001: global annual peak ~2 PW @ 35°N - RAPID 26.5°N Atlantic monthly range 0.2-2.5 PW (Johns 2011) - Pacific tropical cell NHT 1.75±0.30, SHT -1.69±0.55 PW - North Pacific 24°N seasonal cycle 0 → 1.1 PW (Bryden et al.) - HR mesoscale eddies add 30-50% over coarse models Combined: global = basin-sum at peak latitude reaches 5-8 PW monthly extreme in HR models. cli37's 7.6 PW max sits within this range. Widen bounds to ±10 PW so monthly extremes PASS while a 2× regression still WARNs and 5× regression FAILs.
- nep: -5e-6 / 2e-7 (was -5e-8 / 5e-8) — Central America wet-tropics drainage spikes are real model output (cli37 min -1.93e-6, max 1.01e-7) - mrrob: max 5e-3 (was 1e-4) — singular grid-cell spikes are real, global field fine (cli37 max 2.34e-3) - fHarvestToAtmos: ~1e-10 mean / ~1e-8 max (was ~0/~0) — 1850 LUH3 state has cropland+pasture that keeps being harvested every year - tsl: floor 150 K (was 220 K) — LPJ-GUESS shallow-layer tracks OpenIFS forcing when uninsulated; 1.5% of values <220 K in NH winter - vegHeight: rationale text fixed — LPJ-GUESS doesn't emit grass-only height, cmor falls back to tree-dominated field Reclassify on test06 JSONL: 5 FAILs cleared (2 nep, 2 fHarvestToAtmos, 1 mrrob FAIL->WARN). Three "should be zero" vars (fAnthDisturb, fNAnthDisturb, fNfert) NOT relaxed despite cli37 showing 1e-10..1e-8 values — Laszlo says these should be truly ~0 under frozen 1850 LU; spawned as D6 investigation.
…source
Three new pipeline steps in examples/custom_steps.py:
clip_small_negatives — sets [-thr, +thr] to 0 (default thr=1e-10).
Clears tiny float-noise that propagates from
raw .out files. Reviewer (Laszlo) confirmed
these are not a bug to chase upstream.
clip_floor_zero — floors negatives at 0. For variables whose
physical floor is 0 but where the pycmor
pipeline introduces negatives not present in
the raw model output (mrsol, rhSoil per
Laszlo).
broadcast_yearly_to_monthly
— repeats each yearly value across 12 mid-month
time stamps. Used where the native monthly
file is LAI/phenology-weighted and the yearly
file is the authoritative stand-area source.
Pipeline wiring:
Phase B (clip_small_negatives inserted after loader in 4 pipelines):
cap7_land: lpjg_monthly_pipeline
veg_land: lpjg_monthly_pipeline, lpjg_monthly_lut_pipeline,
mrtws_pipeline
Phase C (new dedicated pipelines, 2 rules rewired):
cap7_land: lpjg_monthly_clip0_pipeline <- rhSoil_mon
lpjg_monthly_depth_clip0_pipeline <- mrsol_mon
Both add clip_floor_zero on top of clip_small_negatives.
D4a (new pipeline, treeFrac total source switch):
cap7_land: lpjg_yearly_to_monthly_pipeline <- treeFrac_mon
treeFrac_mon now sources treeFrac_yearly.out (was
treeFrac_monthly.out). The native monthly file is
LAI-weighted; the yearly file is correct stand area.
variable_attributes.comment added on treeFrac_mon, rhSoil_mon, mrsol_mon
documenting the derivation / clipping for downstream traceability.
The per-PFT tree fractions (treeFracBdlDcd, treeFracBdlEvg,
treeFracNdlDcd, treeFracNdlEvg) remain on the broken monthly source
pending D4c — Laszlo signoff on the synthesis math AND/OR a possible
LPJ-GUESS upstream fix he pushed today emitting yearly per-PFT files.
See HANDOFF_d4_treeFrac_per_pft.md.
PLAN_veg_land_sanity_fixes.md — triages Laszlo + Christian feedback
into 5 phases (A bounds, B clip-band, C sign-fix, D investigations,
E verify) with explicit execution order. Round-1 review folded in.
HANDOFF_d4_treeFrac_per_pft.md — investigation of treeFracNdlDcd
annual-cycle bug. Documents:
- root cause: LPJ-GUESS monthly per-PFT output is LAI/phenology
weighted, not stand area
- file inventory: treeFrac_yearly.out exists (total), per-PFT
yearly files do NOT
- four-option matrix (A skip / B synthesize / C max-only /
D wait-for-upstream)
- Option B math + per-cell normalization + divide-by-zero handling
- D4b ratio sanity check on cli37: median 1.010, 88% within ±20%
— math premise sound
- escalation channels + comment-attribute draft
REVIEW_veg_land_sanity_fixes_round1.md — external review of the plan;
folded into PLAN.
REVIEW_d4_treeFrac_per_pft.md — external review of the handoff;
folded into HANDOFF.
Three "should be zero" variables spawned as D6 from Phase A data
anchor: fAnthDisturb, fNAnthDisturb, fNfert have 1e-10..1e-8 values
in cli37 despite frozen 1850 LU — pending Laszlo round-3 investigation.
cli37 wo file inspection: olevel = [0, 5, 10, 20, 30, ..., 6250] —
i.e. mesh.depth_bnds values (interface depths), not mesh.depth
(midpoints), even though olevel:name = "nz1". FESOM 2.7's
io_meandata.F90:1520 def_streams w on the top N layer interfaces
(N=57 for DARS), with the surface at z=0 and the seabed BC implicit.
The reviewer's "uppermost layer not too bad, those below are noisy"
is the surface BC literally preserved (w=0 at interface 0) while every
deeper interface carries diagnostic-w noise from integrating
horizontal divergence down from the surface. The recipe was a bare
passthrough that emitted these interfaces verbatim with a misleading
midpoint-style coord name.
New compute step ``average_w_interfaces_to_midpoints``:
midpoint[i] = 0.5 * (w[i] + w[i+1]) for i = 0 … N-2
midpoint[N-1] = 0.5 * w[N-1] (bottom BC w_seabed=0)
The vertical coord is replaced with mesh.depth (CMIP midpoint axis)
and the dim renamed to nz1. ``compute_mass_transport`` already does
the same averaging internally for ``wmo``; this factors it out for
use in the new wo_pipeline.
Effects:
- Surface BC folded into first midpoint → no jump between layer 0
and 1; the "clean top" outlier disappears.
- Output on 57 cell-centre midpoints, matching CMIP wo convention.
- Vertical coord values switch from interface depths to midpoint
depths.
…fNAnthDisturb, fNfert LUH3 1850 wood-harvest transitions (primf/secmf/secnf_harv) and a small implicit LPJ-GUESS management N baseline make these fluxes non-zero in piControl
The OIFS branch feature/cmip7-rh-online (merged into local_combined_fixes 2026-04-17) computes CMIP/CF-conformant relative humidity online every timestep using Alduchov-Eskridge Magnus with the hard water/ice switch at 273.15 K, and sends `hur_cmip7` (model levels) and `hurs_cmip7` (2m) to XIOS. Until now those sends were dropped because no XIOS field declared the IDs. field_def_cmip7.xml.j2: - declare raw `hurs_cmip7` (fraction) in 2D_physical - declare raw `hur_cmip7` (fraction) in 3D_ml - add percent alias `near_surface_relative_humidity_pct__hurs` (name="hurs") - repoint `relative_humidity_ml__hur` from `r` (FOEEWM mixed-phase QSAT, NOT CMIP-conformant) to `hur_cmip7` file_def_oifs_cmip7_spinup.xml.j2: - write `hurs` into atmos_mon (monthly surface) - write `hurs` into atmos_day (daily surface) - atmos_mon_ml already references relative_humidity_ml__hur, so it picks up CMIP7-conformant hur automatically Addresses Felix Pithan's 2026-04-15 feedback: IFS-native r/r_pl uses mixed-phase QSAT interpolation between RTICE and RTWAT (not CMIP-CF), and post-hoc RH from monthly-mean ta/hus is biased due to nonlinear e_sat(T). Computing online and averaging downstream fixes both. Pycmor-side: hurs_pipeline (post-hoc Magnus from 2t+2d) is left in place for now; can be retired once new XIOS hurs output is validated.
…ve ice pipelines Commit 5667ab2 added mask_where_no_seaice broadly to every hxy-si rule. That was the right fix for prra (Christian's explicit complaint: "rule cell_methods: where sea_ice is not applied - one sees a global field instead of a polar field") and for the ~9 atm-regridded rules (rlds/rlus/rsds/rsus_seaice, sbl_seaice, sifllattop, siflsenstop) that carry atmospheric fluxes over the entire surface and need masking to the ice domain for CMIP cell_methods compliance. For the remaining 13 FESOM-native ice pipelines, the underlying FESOM field is intrinsically zero (or NaN) outside the ice zone — the model only updates these where ice exists. Masking those is at best cosmetic (0 -> NaN) and at worst causes downstream dask graph stalls: in cli38 the lrcs_seaice tier OOM'd at MaxRSS 525 GiB on a 512 GiB cgroup; halving N_WORKERS produced an identical 525 GiB peak (memory is per-rule, not per-worker), and halving SHARD_SIZE to spread the load eliminated the OOMs but exposed a save_dataset wedge in sivol/sihc/snm/siflfwbot/etc. with no I/O progress for 45+ minutes — the dask graph for `data.where(mask)` was not being released before save_dataset, deadlocking the worker pool. Removed mask step from these pipelines: scale_mask_pipeline, sfdsi_from_fw_ice_mask_pipeline, fraction_to_percent_mask_pipeline, snm_pipeline, sisnhc_pipeline, sispeed_pipeline, sistressave_pipeline, sistressmax_pipeline, siflcondtop_pipeline, sitempbot_pipeline, sifb_pipeline, simpeffconc_pipeline, constant_field_pipeline. Kept mask in these pipelines (genuinely needed for atm-regridded fields): regrid_atm_to_fesom_seaice_mask_pipeline regrid_atm_to_fesom_seaice_mask_negate_pipeline Rules affected by the removal still inherit their previous correctness fixes from 5667ab2 (scale_factor on sisaltmass, cell-width factor on sidmasstran[xy], lat/lon attachment) — only the unnecessary mask step is gone.
Upstream release Dec 19 2025 (v1.2.2.2 was Sep 30 2025). Same 1974
variables in both, but the compound_name schema changed in three ways:
- region suffixes lowercased: GLB/NH/SH -> glb/nh/sh
(552 .GLB + 8 .NH + 8 .SH renames across 17 rule yamls)
- ATA/GRL polar codes lowercased: ata/grl (no project rules affected)
- 13 real compound_name schema renames touching this project:
- atmosChem.{cfc11,cfc12,ch4,n2o}.tavg-u-hm-u -> .tavg-u-hm-air
- land.tas.tavg-u-hxy-u.1hr -> .tavg-h2m-hxy-u.1hr (glb + 30S-90S)
- land.tslsi.tavg-u-hxy-u.day -> .tavg-u-hxy-lsi.day
- seaIce.sisnmass.tavg-u-hm-u -> .tavg-u-hm-si (day/mon x nh/sh)
- atmos.ts.tavg-u-hxy-sn.day -> .tavg-u-hxy-lnd.day
- land.esn.tavg-u-hxy-sn.day -> .tavg-u-hxy-lnd.day
Source default in std_lib/global_attributes.py:136 updated from "GLB"
to "glb" to match the new lowercase convention when get_region() falls
back to the default (rule paths already supply region explicitly via
compound_name).
Validation: all 17 rule yamls construct CMORizer cleanly against the
new metadata.json; 548/552 compound_names resolve in v1.2.2.3 (the 4
unmatched are Laszlo's per-PFT yearly diagnostic rules, extra-CMIP).
Metadata cache populated at:
~/.cache/pycmor/cmip7_metadata/v1.2.2.3/metadata.json (kept v1.2.2.2)
Substantive content edits in v1.2.2.3 (for variables we use):
- atmos.{hurs,rsds,sfcWind}.tavg-...-1hr.30S-90S: cleaner CF-standard
long_name values (long_name was internal text in v1.2.2.2).
cli39 veg_land shards 2/3/4 failed with
ValueError: Rule 'treeFracBdlEvg_yr' with
compound_name='land.treeFracBdlEvg.tavg-u-hxy-u.yr.GLB' did not match any
variables in the CMIP7 data request
CMIP7 only has the per-PFT treeFrac variants at MONTHLY cadence:
Emon.treeFracBdlDcd / Emon.treeFracBdlEvg / Emon.treeFracNdlDcd / Emon.treeFracNdlEvg
No yearly Eyr.treeFrac{BdlDcd,BdlEvg,NdlDcd,NdlEvg} entries exist. The monthly
counterparts are already wired in cap7_land/cmip7_awiesm3-veg-hr_cap7_land.yaml
and cover the CMIP7 requirement.
Removes:
- treeFracBdlDcd_yr
- treeFracBdlEvg_yr
- treeFracNdlDcd_yr
- treeFracNdlEvg_yr
Keeps the aggregate treeFrac_yr (maps to valid Eyr.treeFrac).
cli39 extra_land (job 25003750_1) hit 3h walltime with 4 save_dataset ops stuck at heartbeat #16: ⟳ save_dataset[tas] still running (t=960s, #16) ⟳ save_dataset[mrsow] still running (t=960s, #16) ⟳ save_dataset[dslw] still running (t=960s, #16) ⟳ save_dataset[orog] still running (t=960s, #16) All four are long-running surface saves (1-hourly tas SH-subset, daily mrsow/dslw via swvl1..swvl4 + temporal_diff, fx orog SH-subset). With extra_land's 18 rules all in a single shard at SHARD_SIZE=20, these four kicked off parallel saves that contended on the synchronous netcdf write scheduler. 9 saves completed cleanly, then the 4 wedged. Lower SHARD_SIZE specifically to 5 for extra_land — splits 18 rules into ~4 shards, distributing the contended saves across them. Each shard's individual save peak stays small enough to finish inside 3h. Other tiers keep the global SHARD_SIZE=20.
Both tiers hit TIMEOUT in cli40 due to multiple save_dataset operations contending on the global HDF5 write lock with no progress at heartbeat #87+ over 3h. Sequence in lrcs_seaice_3: siarea 176×, siextent 89×, sisnmass 87×, sfdsi 87×, sihc 86×, sidmasstrany 86× veg_land_1 wedged on vegHeight/dgw/tslsi/esn at heartbeat #21. Wall-vs-Σsave analysis (cli40 successful shards) shows these two tiers already run nearly serially under lock contention: lrcs_seaice_4: ratio 1.18× (12% wall-time penalty if forced serial) lrcs_seaice_2: 1.8× Forcing strict serial eliminates the wedge with negligible cost. 3D-heavy tiers (cap7_atm 4.1×, core_ocean 3.7×, extra_atm 1.8×) genuinely need parallel saves to stay inside 3h walltime, so they are NOT modified. Mechanism: every pipeline in the two yamls now declares throttle_group: <tier>_serial and the pycmor: section adds throttle_caps: <tier>_serial: 1 which caps in-flight rules of that group at 1 per batch. For lrcs_seaice this also subsumes the previous oifs_regrid cap=2 (which was set for driver-RSS, not the lock); cap=1 trivially satisfies it.
cli41 logs showed "Throttle caps (per-group rule submission limit): none" in both lrcs_seaice and veg_land — the yaml-level `throttle_caps` block I added in 5a61ad3 got silently dropped by the Everett-based PycmorConfigManager (only Options declared in PycmorConfig are preserved through schema normalisation; unknown keys vanish). Result: default per-group cap of 2 applied, lrcs_seaice still wedged on 9+ parallel save_dataset operations: cli41 _1: TIMEOUT (heartbeat #16, 9 saves stuck) cli41 _3: TIMEOUT (same pattern) The cmorizer `_resolve_throttle_caps()` already reads PYCMOR_THROTTLE_CAPS from the environment as a higher-priority source than the yaml. Move the per-tier cap declaration to the launcher. submit_hr_year_shards.sh: - New per-tier `tier_throttle_caps` case: lrcs_seaice → "lrcs_seaice_serial:1" veg_land → "veg_land_serial:1" (others) → "" - Threaded into `--export=...,PYCMOR_THROTTLE_CAPS=$tier_throttle_caps` Yaml cleanup: - Removed the dead `throttle_caps:` block from both yamls - Kept the explanatory comment so the next reader knows where the cap actually lives (the launcher) Pipeline-level `throttle_group:` annotations from 5a61ad3 are unchanged and are still read by Everett (declared on the Pipeline schema).
cli43 lrcs_seaice_3 TIMEOUT despite cap=1 being correctly applied:
Throttle caps: {'lrcs_seaice_serial': 1}
... but 5 saves concurrent and stuck at heartbeat #101+
Root cause: `_rule_throttle_group()` only walked `rule.pipelines[*]`.
20 of 64 lrcs_seaice rules (siarea_north, siextent_north_day,
siflcondbot, sidconcth, sitimefrac, ...) have no `pipelines:` key —
they use pycmor's default pipeline, which carries no throttle_group,
so they landed in `_unthrottled` and ran in parallel through the
HDF5 global write lock.
Two-part fix:
1. cmorizer.py: read rule.throttle_group first, fall back to pipeline
annotation. Rule-level wins (per-rule override), preserves existing
pipeline-level behaviour.
2. lrcs_seaice + veg_land yamls: add `throttle_group: <tier>_serial`
to the `inherit:` block. With validate.py:299 already accepting
`throttle_group` on the rule schema, inherit propagates it to every
rule — pipelined or not.
Effect on cli44+: all 64 lrcs_seaice rules (and all veg_land rules)
join their tier's serial group; PYCMOR_THROTTLE_CAPS env var caps it
at 1; the multi-rule HDF5-lock wedge can no longer form.
esm_tools commit f3c2a3f9 added a regridded-output rename script that
produces "<var>.fesom.gr.<year>.nc" alongside the legacy native
"<var>.fesom.<year>.nc" in outdata/fesom/ (e.g. Test_v342_1y_01).
The existing pycmor patterns
pattern: <var>\.fesom\..*\.nc
would now silently match BOTH native and gr files. load_mfdataset
would then try to concat a 3.15M-nod2 unstructured dataset with a
360x720 regular lat/lon dataset and either error or, worse, produce
garbage.
Tighten the pattern to require a 4-digit year right after `.fesom.`:
pattern: <var>\.fesom\.\d{4}\.nc
This matches the native filenames only (e.g. `sst.fesom.1586.nc`).
The regridded filenames (`sst.fesom.gr.1586.nc`) have an extra `.gr.`
segment between `fesom` and the year, so they are now unambiguously
excluded from these `gn` rules.
Mechanical change: replaced 221 occurrences of `\.fesom\..*\.nc` across
the 7 FESOM-ingesting tier yamls:
lrcs_seaice 97
lrcs_ocean 64
core_ocean 23
cap7_seaice 13
cap7_ocean 13
core_seaice 10
veg_seaice 1
Behaviour for runs that do not have regridded output (pre-Test_21)
is unchanged: only native files exist, the regex still matches them.
The future `_gr` mirror tiers will use the parallel pattern
`\.fesom\.gr\.\d{4}\.nc`, which is disjoint from this one.
Add a launcher-side preprocessor that derives a `_gr` companion of every
FESOM-ingesting tier yaml at submit time. No parallel `_gr/` source tree
in the repo — the gn yaml stays the single source of truth and the gr
yaml is regenerated each run.
generate_gr_yaml.py (new):
Filter: keeps rules whose primary input pattern references `fesom`;
drops mesh-derived fx rules and OIFS-source rules
(no gr equivalent under this scheme).
Rewrite: replaces `\.fesom\.\d{4}\.nc` → `\.fesom\.gr\.\d{4}\.nc`
recursively across every string value in the yaml tree
(catches primary `pattern:` and secondary attrs like
`aice_pattern:` uniformly).
Inherit: grid_label → gr
grid → "regular 0.5° lat/lon ..."
nominal_resolution → "50km"
(CMIP7 CV-bin for the 0.5° area-weighted √mean cell area
of ~44 km. CV values are discrete: see
CMIP7-CVs/project/native-nominal-resolution.json)
Name: general.name gets " (gr)" suffix for log clarity.
submit_hr_year_shards.sh: WITH_GR={yes,y,true,1,on} env var.
When set, after `repoint_hr_year.py` populates $YAMLS_DIR, scan for
yamls that reference `.fesom.` files and emit a sibling `<tier>_gr.yaml`
into the same dir. The existing per-yaml submit loop then picks them
up automatically as separate tiers (jobname suffix `_gr`), sharded and
sbatched identically to gn.
Smoke-tested across all 7 FESOM-ingesting tiers in the repo:
core_ocean: 22 / 28 rules kept
cap7_ocean: 7 / 7
lrcs_ocean: 54 / 55
core_seaice: 9 / 9
cap7_seaice: 9 / 9
lrcs_seaice: 53 / 64 (drops are OIFS-regrid rules with no gr source)
veg_seaice: 1 / 1
cli45 core_seaice_1 TIMEOUTed on Test_v342_1y_01 with the now-familiar HDF5-write-lock wedge pattern: 1 save completed (simass 2s), then 7 in parallel all stuck @ heartbeat #16: siconc, ts, snd, siv, siu, sitimefrac, sithick The three remaining FESOM-ingesting seaice tiers (core_seaice, cap7_seaice, veg_seaice) never got the inherit-throttle treatment we applied to lrcs_seaice / veg_land. cli42/44 happened to pass them because the cluster was uncontended on those days; cli45 hit a busier node and the latent vulnerability surfaced. Mirror the lrcs_seaice/veg_land fix: yaml side — `inherit: throttle_group: <tier>_serial`. Rule-level fallback in `_rule_throttle_group()` (commit 9d4579c) propagates the group to every rule, pipelined or not. launcher — `PYCMOR_THROTTLE_CAPS=<tier>_serial:1` per tier, joining the existing lrcs_seaice/veg_land case in submit_hr_year_shards.sh. veg_seaice cli45 actually died from a separate Prefect sqlite race (`OperationalError: database is locked` in send_telemetry_heartbeat), not the lock wedge — but the tier shares the same architecture, so adding the throttle now is preventative. Wall-time cost is small: empirical data from cli44 puts lrcs_seaice's serial penalty at 12% for shard_4 and 0% for shard_3 (which had been TIMEOUT before). Same shape expected here.
Two bugs in the WITH_GR=yes path discovered on cli46 first attempt: 1. The grep that decides which yamls get a gr derivative used `grep '\.fesom\.'` without -F. Regex semantics interpret `\.` as "literal dot", so the test matched files containing the substring `.fesom.` — which only happens in free-text comments like "# Input: daily SST from FESOM (sst.fesom.*.nc ...)". The actual regex patterns in `pattern:` lines are double-escaped (`\.fesom\.`) and have NO unescaped `.fesom.` anywhere. Result: only 3 of 7 expected gr tiers were generated (cap7_ocean_gr, lrcs_ocean_gr, veg_seaice_gr — the three with chatty comments). core_ocean_gr, core_seaice_gr, cap7_seaice_gr, lrcs_seaice_gr were silently skipped. Fix: add -F to grep so the literal `\.fesom\.` is matched. 2. tier_throttle_caps case statement matched on `$short_tier` (e.g. "lrcs_seaice") only. gr variants have `_gr` appended (e.g. "lrcs_seaice_gr"), so they fell through to the default (no env cap), reverting to the hardcoded default cap of 2. That would undo the serial protection for the gr-side runs. Fix: strip trailing _gr into short_tier_base before the case. The throttle_group name (e.g. "lrcs_seaice_serial") is preserved by generate_gr_yaml.py via the inherit: copy, so the same cap key applies to both gn and gr variants of a tier.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
CMIP7 cmorization for AWI-ESM3-VEG-HR
Adds full CMIP7 support targeting AWI-ESM3-VEG-HR, including a native compound-name
architecture that replaces the legacy cmip6-table-based data request lookup.
Key changes
CMIP7 data request
DataRequestfromCMIP7_DReq_metadataJSON instead of cmip6 tablesocean.tos.tavg-u-hxy-sea.mon.GLB)cmip6_table→cmip6_cmor_tablein vendored metadatacompound_namematching againstcmip6_compound_nameandcmip7_compound_nameattributestable_idfrom compound name when not set explicitlyValueErroron zero DRV matches (instead of silent skip)Pipeline
vertical_integratecustom pipeline stepconvert()step fromDefaultPipelineStateobjects not being unwrapped to actual results in parallel runsStandard library
src/pycmor/std_lib/time_bounds.py)getattr+_pycmor_cfgfallbackglobal_attributesto derivetable_idfrom CMIP6/CMIP7 compound namesXarray accessor API
StdLibAccessorwith.process()Test infrastructure
pycmor.fixtures.model_runs)pycmor.tutorialdataset system (xarray.tutorial-style API)Misc fixes
entry_points()compatibilitypyfesom2imports for environments without itTest plan
pytest tests/unit/pycmor process examples/awiesm3-cmip7-minimal.yamlruns successfully on Levante