Skip to content

Feat/cmip7 awiesm3 veg hr#266

Open
JanStreffing wants to merge 307 commits into
prep-releasefrom
feat/cmip7-awiesm3-veg-hr
Open

Feat/cmip7 awiesm3 veg hr#266
JanStreffing wants to merge 307 commits into
prep-releasefrom
feat/cmip7-awiesm3-veg-hr

Conversation

@JanStreffing
Copy link
Copy Markdown
Contributor

@JanStreffing JanStreffing commented Apr 2, 2026

CMIP7 cmorization for AWI-ESM3-VEG-HR

Adds full CMIP7 support targeting AWI-ESM3-VEG-HR, including a native compound-name
architecture that replaces the legacy cmip6-table-based data request lookup.

Key changes

CMIP7 data request

  • Load DataRequest from CMIP7_DReq_metadata JSON instead of cmip6 tables
  • Refactor to native compound-name architecture (ocean.tos.tavg-u-hxy-sea.mon.GLB)
  • Fix JSON key mismatch: cmip6_tablecmip6_cmor_table in vendored metadata
  • Improve compound_name matching against cmip6_compound_name and cmip7_compound_name attributes
  • Derive table_id from compound name when not set explicitly
  • Strict ValueError on zero DRV matches (instead of silent skip)

Pipeline

  • Add generic vertical_integrate custom pipeline step
  • Remove duplicate convert() step from DefaultPipeline
  • Fix Prefect State objects not being unwrapped to actual results in parallel runs
  • Propagate pipeline/flow errors instead of silently logging them

Standard library

  • Add time bounds support (src/pycmor/std_lib/time_bounds.py)
  • Fix dimension mapping to use getattr + _pycmor_cfg fallback
  • Fix global_attributes to derive table_id from CMIP6/CMIP7 compound names

Xarray accessor API

  • Lazy accessor registration and StdLibAccessor with .process()

Test infrastructure

  • Modernize with entry-point model discovery (pycmor.fixtures.model_runs)
  • Add pycmor.tutorial dataset system (xarray.tutorial-style API)
  • Fix stub generator to use monotonic coordinate values for multi-file datasets

Misc fixes

  • Python 3.9 entry_points() compatibility
  • Guard pyfesom2 imports for environments without it
  • Fix tarball double-nesting extraction on Python 3.12+
  • Rename non-standard time dimension on load (OpenIFS support)

Test plan

  • Unit tests pass: pytest tests/unit/
  • pycmor process examples/awiesm3-cmip7-minimal.yaml runs successfully on Levante
  • core_atm runs with tco95_core2 on Levante
  • core_land runs with tco95_core2 on Levante
  • core_ocean runs with tco95_core2 on Levante
  • core_seaice runs with tco95_core2 on Levante
  • lrcs_land runs with tco95_core2 on Levante
  • lrcs_oce runs with tco95_core2 on Levante
  • lrcs_seaice runs with tco95_core2 on Levante
  • cap7_atm runs with tco95_core2 on Levante
  • cap7_ocean runs with tco95_core2 on Levante
  • cap7_seaice runs with tco95_core2 on Levante
  • cap7_land runs with tco95_core2 on Levante
  • cap7_aerosol runs with tco95_core2 on Levante
  • veg_atm runs with tco95_core2 on Levante
  • veg_land runs with tco95_core2 on Levante
  • veg_seaice runs with tco95_core2 on Levante
  • extra_atm runs with tco95_core2 on Levante
  • extra_land runs with tco95_core2 on Levante
  • final check of all rules.

@JanStreffing JanStreffing changed the base branch from main to prep-release April 2, 2026 12:55
@JanStreffing JanStreffing force-pushed the feat/cmip7-awiesm3-veg-hr branch from 1a42875 to 5617a18 Compare April 2, 2026 13:18
@JanStreffing
Copy link
Copy Markdown
Contributor Author

I was able to run both tos and the more complex absscint with the lastest commit on this branch. We may want to work on #267, and certainly need to work on #265. But neither should block us from starting to build up more rules for the picontrol variables.

@JanStreffing
Copy link
Copy Markdown
Contributor Author

JanStreffing commented Apr 5, 2026

The single failing test (test_library_process[fesom_dev-native-dask-cmip6]) is unrelated to this PR. It failed due to a network issue — the test tried to download pi_uxarray.tar from nextcloud.awi.de but got ConnectionError: [Errno 111] Connection refused. The AWI Nextcloud server was unreachable from the GitHub Actions runner.

Something to worry about? @pgierz

@esm-tools esm-tools deleted a comment from github-actions Bot Apr 6, 2026
@esm-tools esm-tools deleted a comment from github-actions Bot Apr 6, 2026
@esm-tools esm-tools deleted a comment from github-actions Bot Apr 6, 2026
@esm-tools esm-tools deleted a comment from github-actions Bot Apr 6, 2026
@esm-tools esm-tools deleted a comment from github-actions Bot Apr 6, 2026
@esm-tools esm-tools deleted a comment from github-actions Bot Apr 6, 2026
@esm-tools esm-tools deleted a comment from github-actions Bot Apr 6, 2026
@esm-tools esm-tools deleted a comment from github-actions Bot Apr 6, 2026
@esm-tools esm-tools deleted a comment from github-actions Bot Apr 6, 2026
@esm-tools esm-tools deleted a comment from github-actions Bot Apr 6, 2026
@mandresm
Copy link
Copy Markdown
Contributor

mandresm commented Apr 7, 2026

@pgierz, will you review this or should I go for it?

@pgierz
Copy link
Copy Markdown
Member

pgierz commented Apr 7, 2026

I will look

@esm-tools esm-tools deleted a comment from github-actions Bot Apr 8, 2026
@esm-tools esm-tools deleted a comment from github-actions Bot Apr 8, 2026
@esm-tools esm-tools deleted a comment from github-actions Bot Apr 8, 2026
JanStreffing and others added 6 commits April 9, 2026 15:36
CMIP7 salinity units are "1E-03" (dimensionless with scaling factor).
pint-xarray cannot convert to such targets, raising "Unit expression
cannot have a scaling factor". When both source (e.g. psu) and target
are dimensionless, relabel without numeric conversion since the values
are already in the correct range.
All ~33 custom step functions hardcoded the CMOR variable name
(e.g. result.name = "zostoga") instead of using rule.model_variable.
This caused KeyError in set_variable when model_variable differs from
the CMOR name (e.g. model_variable="temp" but name was "zostoga").

Also fix compute_mass_transport to handle W-level data (nz=48
interfaces) by averaging to cell centers (nz=47) before multiplying
by layer thickness.
Prefect's embedded SQLite server cannot handle 27+ concurrent rule
submissions, causing cascading 503 errors. Set parallel: False to
process rules serially until a proper Prefect server is available.
Applied to all 17 awi-esm3-veg-hr configs and core2 test config.
Runs pycmor on a compute node with local Dask cluster, Prefect on
local /tmp to avoid NFS SQLite locking. All 27 core_ocean rules
complete successfully with this setup.
- field_def: add u10m/v10m aliases for 10u/10v (YACC parser cannot
  handle field names starting with digits in expressions)
- file_def: use u10m/v10m in sfcWindmax expression, move evspsblpot
  and sbl from separate monthly file to existing daily land file,
  rename legacy 6h file suffix to avoid conflicts
New env knob ``SHARD_DRS`` (default ``off``) toggles pycmor's
``enable_output_subdirs`` config option. When on:

  * The injected per-shard yaml sets
    ``pycmor.enable_output_subdirs = True``, so files.py's filename
    builder appends the DRS sub-tree returned by GlobalAttributes
    .subdir_path():

        <mip_era>/<activity_id>/<institution_id>/<source_id>/
            <experiment_id>/<member_id>/<table_id>/<variable_id>/
            <grid_label>/v<YYYYMMDD>/

    e.g. ``CMIP7/CMIP/AWI/AWI-ESM-3/picontrol/r1i1p1f1/Amon/tas/gn/
    v20260515/tas_<...>.nc``

  * The submitter collapses the per-tier ``<tier>/cmorized/`` OUTSUB
    to ``.`` so all 17 tiers land in one shared DRS root. The non-DRS
    nesting was a pycmor-specific organizational layer that downstream
    consumers don't expect at the start of the path.

Off by default — keeps the legacy per-tier flat layout for any current
consumers. Turn on via ``SHARD_DRS=on bash submit_hr_year_shards.sh ...``
for production / publication-ready output.
…zed layout

build_html_report: render_file_card was showing variable-group bounds
(whichever cadence happened to merge first) for every per-file card.
21 variables with multiple cadence rows displayed wrong "Expected"
columns and range SVG. Prefer rec.expected_* over ent.expected_*.

sanity_check: walker glob() missed files under <root>/<tier>/cmorized/
(produced when the runner emits a DRS-shaped subtree). Switch to
rglob() so the walker handles both flat and nested layouts.

run_walker_compute.sh: new sbatch wrapper to walk on compute nodes;
login-node OpenBLAS pthread limit + Lustre contention crashes the
worker pool on the big 8-19 GB 1hr/3D-model-level files.

run_perfile_maps_compute.sh: paths bumped to cli37 run.
build_maps: collapse all non-spatial dims (time + level + tile) by the
matching reducer per panel (min/min, mean/mean, max/max) so 3D fields
get full spatial coverage instead of mostly-NaN deep-level slices.
Other fixes: rglob() for nested cmorized/ layout, sibling-file lat/lon
recovery for hxy-si nod2 files that strip coords, dask-chunked open so
84-GB cl_day/pfull_day render without OOM, 2-98 percentile colorbar
clipping, prefer lat/lon over latitude/longitude for spatial-dim
discovery on 3D atm files.

build_html_report: per-file card now uses rec-level expected bounds
(was showing variable-group bounds, wrong for 21 cadence-split vars
incl. hfls, hfss, pr, prsn, rsds, rsus, uas, vas). NH/SH suffix on
hemispheric scalar files (siarea, siextent, sisnmass, sivol) with
North->South word substitution. FAIL diagnoses now append a
"mean-in-bounds, outlier-extremes" hint when the field mean sits
inside the expected window (e.g. hfls_mon FAIL is single-cell, not
field-wide).

sanity_check_ranges: recalibrate bounds with reviewer feedback —
phcint allow negatives (T<0degC at high lat), masscello to AWI-ESM
vertical discretization (5125..358750 kg/m^2), prra and siconc to
ice-zone/global walker semantics, cl to per-layer (~5%) not
column-integrated (~30%), umo/vmo/sfx/sfy reflect compute fix below.

custom_steps.compute_mass_transport, compute_salt_transport: integrate
across the FESOM cell edge by multiplying by sqrt(cell_area). Was
emitting kg/(s*m) but the CMIP7 spec for Omon.{umo,vmo,sfx,sfy}
requires kg/s. New _fesom_edge_width helper resolves the horizontal
dim against (nod2, ncells, ncol) and falls back to a clear error if
the mesh lacks cell_area. Order-of-magnitude check: typical FESOM HR
mid-lat cell now ~4.5e7 kg/s, strong-current cell ~1.8e9 kg/s.

cli37 walker output: 743 files, PASS 394 / WARN 251 / FAIL 96 /
ERROR 2.
@JanStreffing JanStreffing force-pushed the feat/cmip7-awiesm3-veg-hr branch from 09fe51e to 900c575 Compare May 18, 2026 14:44
JanStreffing and others added 26 commits May 18, 2026 16:52
… lat/lon

Implements the cmor/CMIP7 corrections from the cli37 sea-ice review.

hxy-si rules now apply mask_where_no_seaice so the output matches CMIP7
cell_methods "area: time: mean where sea_ice (mask=siconc)". Previously
prra/prsn (and all other rules using scale_pipeline / snm_pipeline /
sispeed_pipeline / sistressave_pipeline / sistressmax_pipeline /
siflcondtop_pipeline / sisnhc_pipeline / sitempbot_pipeline / sifb_pipeline /
simpeffconc_pipeline / constant_field_pipeline / sisnhc_from_msnow_pipeline /
fraction_to_percent_pipeline / sfdsi_from_fw_ice_pipeline) wrote the
field over the full FESOM domain including the tropics. New
scale_mask_pipeline / fraction_to_percent_mask_pipeline /
sfdsi_from_fw_ice_mask_pipeline cover the cases where the original
pipeline is shared with a non-hxy-si caller; the rest got
mask_where_no_seaice added in-place since every consumer is hxy-si.
The 17 direct-passthrough hxy-si rules (no pipelines: block) remain
unmasked — the audit directive was scoped to "every hxy-si rule
pipeline".

sisaltmass: scale_factor 0.004 -> 3.64. FESOM's m_ice is effective ice
height per cell area (units 'm', long_name "ice height per unit area"
in io_meandata.F90), not kg/m² as the prior rule comment assumed. The
correct salt-mass conversion is (sice/1000) * m_ice * rho_ice =
0.004 * m_ice * 910 = 3.64 * m_ice. sice=4.0 verified from
Final_CMIP7_IO_Test_06/config/fesom/namelist.ice; rho_ice=910 (AOMIP)
from fesom-2.7 MOD_ICE.F90:61.

sidmasstran[xy]: compute_ice_mass_transport now does
uice * m_ice * rho_ice * sqrt(cell_area) and emits kg/s. Reuses the
existing _fesom_edge_width helper (added by 900c575 for umo/vmo/sfx/sfy).
Previously the formula was uice * m_ice mislabelled "kg/s" — actually
m²/s, ~5 orders of magnitude too low.

regrid_oifs_to_fesom: assign_coords lat/lon on nod2 after the isel.
Source-grid lat/lon were attached to the now-removed source_dim; without
re-attaching the FESOM lat/lon, the output had only nod2 as a dim and
external tools (ushow, Panoply, ncview) couldn't render the field. The
sanity-check map renderer already worked around this with
_find_sibling_latlon; that workaround stays as a defence-in-depth fallback.

sisnthick: not added — CMIP7 v1.2.2.2 has no sisnthick out_name. The
CMIP7 equivalent in seaIce is snd (standard_name surface_snow_thickness,
units m), already produced by core_seaice/snd and cap7_seaice/snd_day.
…ealm

The walker recorded ``realm`` from the bounds-table lookup, which is
keyed by out_name only. For out_names that exist in multiple CMIP
realms (rlds in atmos+seaIce, prra, prsn, evspsbl, rsds, rsus, rlus,
ts, ...) every branded variant got stamped with whichever realm the
bounds table happened to store — typically the atm realm. The
``-hxy-si`` files then landed on ``atm.html`` even though their own
``:realm`` global attribute correctly said ``seaIce``.

worker_main now reads ``nc.realm`` and prefers it over the table value
for the emitted JSONL record. The file's own CMIP global attrs are
authoritative; the bounds-table realm is the fallback for ERROR or
NOBOUNDS records that never got a realm written.

collapse() in build_html_report.py now keys VarEntry by (var, realm)
instead of var alone, so a variable that appears under two realms
splits into two cards routed to the right HTML page. Previously
``ent.realm`` was "first record wins" — bundling all branded variants
under whichever record happened to come first in the JSONL.

Reported by Nadine on the cli37 report (rlds-si, prsn-si, ts-si etc.
showing up in atm.html with global data).
Each disabled rule emitted the same compound_name as an existing core
or cap7_atm rule with divergent content. Same metadata in both files
(long_name, standard_name, cell_methods, realm, units) so downstream
tools couldn't tell the LPJ-GUESS value from the OIFS one. CMIP7 ships
the OIFS coupled value as the canonical evspsbl/mrro/mrros/mrso/
mrsol/snc/snd/snw.

Re-enable under distinct CMIP7 names (e.g. evspsblveg for canopy,
tran for transpiration) if LPJ-specific land-hydrology values are
needed for vegetation analysis later.

core_land/areacella: kept enabled (duplicates core_atm but content is
byte-identical and each tier needs its own copy for cell_measures
resolution in stand-alone tier delivery — added comment noting this
is intentional).

Audit:
  grep -h '^\s*compound_name:' awi-esm3-veg-hr-variables/*/cmip7_*.yaml     | grep -v '^\s*#' | sort | uniq -c | awk '$1>1'
  -> 1 line (atmos.areacella, intentional)
…sin/sltbasin lat attrs

Christian's cli37 review flagged four ocean-side recipe issues that
fall on the pycmor side. All addressed here; the remaining items
(wfo / obvfsq / wo / volcello / opottempdiff bounds) are either
model-team questions or sanity-bound tuning, not recipe bugs.

hfds: FESOM `fh` is positive=UP (ice_oce_coupling.F90:418 writes
``heat_flux_in = heat_flux = -net_heat_flux``; density-flux equation
at line 654 confirms positive `heat_flux_in` corresponds to ocean
cooling/densification). CMIP7 `hfds` is positive=DOWN, so the rule
now applies `scale_factor: -1.0` via scale_pipeline. Previously hfds
was a bare passthrough, giving cli37 reviewers a North Atlantic that
appears to GAIN heat in January.

msftbarot: f_min default 1e-5 → 2.5e-5 in compute_msftbarot. The
geostrophic SSH approximation `psi = rho_0 * g * H / f * eta` breaks
down within ~10° of the equator, and at f=1e-5 only |lat| < ~4° gets
masked — leaving a band ±4-10° where f is small but finite, producing
the "weird artifact near the equator" Christian flagged. f_min=2.5e-5
extends the NaN mask to ±~10°.

compute_mass_transport: vertical component (transport_component='z',
used by wmo) now multiplies by `cell_area` instead of
`dz * sqrt(cell_area)`. The horizontal-edge formula
(dz * sqrt(cell_area), correct for umo/vmo) under-counted vertical
flux by ~dz/sqrt(cell_area) ≈ 50/1e4 ≈ 200× at FESOM HR resolution.

hfbasin / sltbasin: `lat` coordinate variable was attached as a bare
numpy array with no attributes; the written file then had `lat`
without `standard_name`/`units`/`axis`. Added CF attrs on both
compute_hfbasin_tripyview and compute_sltbasin_tripyview output paths.
…iew)

difmxylo: HR-FESOM (DARS ~10km) Laplacian eddy mixing values are
much lower than the coarse-resolution defaults the original bounds
assumed (Christian: "for higher resolution values should be lower as
far as I read up on this"). cli37 cmor observed mean ~0.01 m²/s,
max ~1.6 m²/s — bound expected_mean was ~1000, max ~1e4 (~5 orders
too high). New bounds 0 / ~0.01 / ~5 m²/s give PASS on cli37 stats
and stay narrow enough that a 100× regression would still WARN.

sfdsi: Christian: "pattern looks good - not sure why the bounds fail
here". cli37 cmor observed min -1.6e-4, max 6e-5 kg m-2 s-1; previous
bound min/max ±1e-5 was too narrow — Arctic freezing bursts in
fully-resolved sea-ice can hit ~1e-4 kg m-2 s-1 even though the
global mean stays near zero. Widened to ±5e-4 (~5× the observed peak).
Recipe commit e7306e2 added scale_factor: -1.0 to hfds (FESOM `fh` is
positive=up; CMIP positive=down). The bounds table still encoded the
pre-flip expectation:
  - expected_mean=~2 (typical positive piControl drift)
  - bounds ±300 W/m²

Post-flip, observed mean becomes negative (cli37 was +14.2 → post-fix
-14.2). The classify() sign-mismatch check at sanity_check.py:337-338
auto-FAILs when expected_mean > 0 and observed < 0, so the bounds
need updating regardless of magnitude.

Also widening to ±500: cli37 max=1910 W/m² (post-flip min=-1910)
exceeds ±300, and CMIP6 literature notes per-cell monthly peaks
reach ±500 W/m² in deep-convection / strong air-sea contrast regions.
Beyond ±500 stays a FAIL — sentinel for outlier cells that warrant
investigation (1910 W/m² is real but extreme).

expected_mean ~2 → ~0: piControl steady state has global mean
near 0; small drift in either direction is fine. The em==0.0 branch
of classify() (line 328) tolerates |mean|/envelope up to 10% as PASS.
Christian's cli37 review flagged hfbasin as FAIL on max=7.6 PW vs the
±2 PW bound. cli37 inspection shows the recipe is correct: annual
climatology matches Trenberth almost exactly (global peak +1.96 PW
@ 16°N; Atlantic +1.08 PW @ 16°N vs Trenberth ~1.3 PW). The 7.6 PW
is a monthly Indo-Pacific extreme — basin-sum peak across (lat, time).

Literature for monthly hfbasin extremes:
  - Trenberth & Caron 2001: global annual peak ~2 PW @ 35°N
  - RAPID 26.5°N Atlantic monthly range 0.2-2.5 PW (Johns 2011)
  - Pacific tropical cell NHT 1.75±0.30, SHT -1.69±0.55 PW
  - North Pacific 24°N seasonal cycle 0 → 1.1 PW (Bryden et al.)
  - HR mesoscale eddies add 30-50% over coarse models

Combined: global = basin-sum at peak latitude reaches 5-8 PW
monthly extreme in HR models. cli37's 7.6 PW max sits within this
range. Widen bounds to ±10 PW so monthly extremes PASS while a
2× regression still WARNs and 5× regression FAILs.
- nep: -5e-6 / 2e-7 (was -5e-8 / 5e-8) — Central America wet-tropics
  drainage spikes are real model output (cli37 min -1.93e-6, max 1.01e-7)
- mrrob: max 5e-3 (was 1e-4) — singular grid-cell spikes are real,
  global field fine (cli37 max 2.34e-3)
- fHarvestToAtmos: ~1e-10 mean / ~1e-8 max (was ~0/~0) — 1850 LUH3
  state has cropland+pasture that keeps being harvested every year
- tsl: floor 150 K (was 220 K) — LPJ-GUESS shallow-layer tracks
  OpenIFS forcing when uninsulated; 1.5% of values <220 K in NH winter
- vegHeight: rationale text fixed — LPJ-GUESS doesn't emit grass-only
  height, cmor falls back to tree-dominated field

Reclassify on test06 JSONL: 5 FAILs cleared (2 nep, 2 fHarvestToAtmos,
1 mrrob FAIL->WARN).

Three "should be zero" vars (fAnthDisturb, fNAnthDisturb, fNfert) NOT
relaxed despite cli37 showing 1e-10..1e-8 values — Laszlo says these
should be truly ~0 under frozen 1850 LU; spawned as D6 investigation.
…source

Three new pipeline steps in examples/custom_steps.py:

  clip_small_negatives  — sets [-thr, +thr] to 0 (default thr=1e-10).
                          Clears tiny float-noise that propagates from
                          raw .out files. Reviewer (Laszlo) confirmed
                          these are not a bug to chase upstream.

  clip_floor_zero       — floors negatives at 0. For variables whose
                          physical floor is 0 but where the pycmor
                          pipeline introduces negatives not present in
                          the raw model output (mrsol, rhSoil per
                          Laszlo).

  broadcast_yearly_to_monthly
                        — repeats each yearly value across 12 mid-month
                          time stamps. Used where the native monthly
                          file is LAI/phenology-weighted and the yearly
                          file is the authoritative stand-area source.

Pipeline wiring:

  Phase B (clip_small_negatives inserted after loader in 4 pipelines):
    cap7_land: lpjg_monthly_pipeline
    veg_land:  lpjg_monthly_pipeline, lpjg_monthly_lut_pipeline,
               mrtws_pipeline

  Phase C (new dedicated pipelines, 2 rules rewired):
    cap7_land: lpjg_monthly_clip0_pipeline       <- rhSoil_mon
               lpjg_monthly_depth_clip0_pipeline <- mrsol_mon
    Both add clip_floor_zero on top of clip_small_negatives.

  D4a (new pipeline, treeFrac total source switch):
    cap7_land: lpjg_yearly_to_monthly_pipeline   <- treeFrac_mon
               treeFrac_mon now sources treeFrac_yearly.out (was
               treeFrac_monthly.out). The native monthly file is
               LAI-weighted; the yearly file is correct stand area.

variable_attributes.comment added on treeFrac_mon, rhSoil_mon, mrsol_mon
documenting the derivation / clipping for downstream traceability.

The per-PFT tree fractions (treeFracBdlDcd, treeFracBdlEvg,
treeFracNdlDcd, treeFracNdlEvg) remain on the broken monthly source
pending D4c — Laszlo signoff on the synthesis math AND/OR a possible
LPJ-GUESS upstream fix he pushed today emitting yearly per-PFT files.
See HANDOFF_d4_treeFrac_per_pft.md.
PLAN_veg_land_sanity_fixes.md — triages Laszlo + Christian feedback
into 5 phases (A bounds, B clip-band, C sign-fix, D investigations,
E verify) with explicit execution order. Round-1 review folded in.

HANDOFF_d4_treeFrac_per_pft.md — investigation of treeFracNdlDcd
annual-cycle bug. Documents:
  - root cause: LPJ-GUESS monthly per-PFT output is LAI/phenology
    weighted, not stand area
  - file inventory: treeFrac_yearly.out exists (total), per-PFT
    yearly files do NOT
  - four-option matrix (A skip / B synthesize / C max-only /
    D wait-for-upstream)
  - Option B math + per-cell normalization + divide-by-zero handling
  - D4b ratio sanity check on cli37: median 1.010, 88% within ±20%
    — math premise sound
  - escalation channels + comment-attribute draft

REVIEW_veg_land_sanity_fixes_round1.md — external review of the plan;
folded into PLAN.

REVIEW_d4_treeFrac_per_pft.md — external review of the handoff;
folded into HANDOFF.

Three "should be zero" variables spawned as D6 from Phase A data
anchor: fAnthDisturb, fNAnthDisturb, fNfert have 1e-10..1e-8 values
in cli37 despite frozen 1850 LU — pending Laszlo round-3 investigation.
cli37 wo file inspection: olevel = [0, 5, 10, 20, 30, ..., 6250] —
i.e. mesh.depth_bnds values (interface depths), not mesh.depth
(midpoints), even though olevel:name = "nz1". FESOM 2.7's
io_meandata.F90:1520 def_streams w on the top N layer interfaces
(N=57 for DARS), with the surface at z=0 and the seabed BC implicit.

The reviewer's "uppermost layer not too bad, those below are noisy"
is the surface BC literally preserved (w=0 at interface 0) while every
deeper interface carries diagnostic-w noise from integrating
horizontal divergence down from the surface. The recipe was a bare
passthrough that emitted these interfaces verbatim with a misleading
midpoint-style coord name.

New compute step ``average_w_interfaces_to_midpoints``:
  midpoint[i]   = 0.5 * (w[i] + w[i+1])    for i = 0 … N-2
  midpoint[N-1] = 0.5 * w[N-1]              (bottom BC w_seabed=0)
The vertical coord is replaced with mesh.depth (CMIP midpoint axis)
and the dim renamed to nz1. ``compute_mass_transport`` already does
the same averaging internally for ``wmo``; this factors it out for
use in the new wo_pipeline.

Effects:
  - Surface BC folded into first midpoint → no jump between layer 0
    and 1; the "clean top" outlier disappears.
  - Output on 57 cell-centre midpoints, matching CMIP wo convention.
  - Vertical coord values switch from interface depths to midpoint
    depths.
…fNAnthDisturb, fNfert

LUH3 1850 wood-harvest transitions (primf/secmf/secnf_harv) and a small
implicit LPJ-GUESS management N baseline make these fluxes non-zero in
piControl
The OIFS branch feature/cmip7-rh-online (merged into local_combined_fixes
2026-04-17) computes CMIP/CF-conformant relative humidity online every
timestep using Alduchov-Eskridge Magnus with the hard water/ice switch
at 273.15 K, and sends `hur_cmip7` (model levels) and `hurs_cmip7` (2m)
to XIOS. Until now those sends were dropped because no XIOS field
declared the IDs.

field_def_cmip7.xml.j2:
- declare raw `hurs_cmip7` (fraction) in 2D_physical
- declare raw `hur_cmip7` (fraction) in 3D_ml
- add percent alias `near_surface_relative_humidity_pct__hurs` (name="hurs")
- repoint `relative_humidity_ml__hur` from `r` (FOEEWM mixed-phase QSAT,
  NOT CMIP-conformant) to `hur_cmip7`

file_def_oifs_cmip7_spinup.xml.j2:
- write `hurs` into atmos_mon (monthly surface)
- write `hurs` into atmos_day (daily surface)
- atmos_mon_ml already references relative_humidity_ml__hur, so it
  picks up CMIP7-conformant hur automatically

Addresses Felix Pithan's 2026-04-15 feedback: IFS-native r/r_pl uses
mixed-phase QSAT interpolation between RTICE and RTWAT (not CMIP-CF),
and post-hoc RH from monthly-mean ta/hus is biased due to nonlinear
e_sat(T). Computing online and averaging downstream fixes both.

Pycmor-side: hurs_pipeline (post-hoc Magnus from 2t+2d) is left in
place for now; can be retired once new XIOS hurs output is validated.
…ve ice pipelines

Commit 5667ab2 added mask_where_no_seaice broadly to every hxy-si rule.
That was the right fix for prra (Christian's explicit complaint: "rule
cell_methods: where sea_ice is not applied - one sees a global field
instead of a polar field") and for the ~9 atm-regridded rules
(rlds/rlus/rsds/rsus_seaice, sbl_seaice, sifllattop, siflsenstop) that
carry atmospheric fluxes over the entire surface and need masking to
the ice domain for CMIP cell_methods compliance.

For the remaining 13 FESOM-native ice pipelines, the underlying FESOM
field is intrinsically zero (or NaN) outside the ice zone — the model
only updates these where ice exists. Masking those is at best
cosmetic (0 -> NaN) and at worst causes downstream dask graph stalls:
in cli38 the lrcs_seaice tier OOM'd at MaxRSS 525 GiB on a 512 GiB
cgroup; halving N_WORKERS produced an identical 525 GiB peak (memory
is per-rule, not per-worker), and halving SHARD_SIZE to spread the
load eliminated the OOMs but exposed a save_dataset wedge in
sivol/sihc/snm/siflfwbot/etc. with no I/O progress for 45+ minutes —
the dask graph for `data.where(mask)` was not being released before
save_dataset, deadlocking the worker pool.

Removed mask step from these pipelines:
  scale_mask_pipeline, sfdsi_from_fw_ice_mask_pipeline,
  fraction_to_percent_mask_pipeline, snm_pipeline, sisnhc_pipeline,
  sispeed_pipeline, sistressave_pipeline, sistressmax_pipeline,
  siflcondtop_pipeline, sitempbot_pipeline, sifb_pipeline,
  simpeffconc_pipeline, constant_field_pipeline.

Kept mask in these pipelines (genuinely needed for atm-regridded fields):
  regrid_atm_to_fesom_seaice_mask_pipeline
  regrid_atm_to_fesom_seaice_mask_negate_pipeline

Rules affected by the removal still inherit their previous correctness
fixes from 5667ab2 (scale_factor on sisaltmass, cell-width factor on
sidmasstran[xy], lat/lon attachment) — only the unnecessary mask step
is gone.
Upstream release Dec 19 2025 (v1.2.2.2 was Sep 30 2025). Same 1974
variables in both, but the compound_name schema changed in three ways:

- region suffixes lowercased: GLB/NH/SH -> glb/nh/sh
  (552 .GLB + 8 .NH + 8 .SH renames across 17 rule yamls)
- ATA/GRL polar codes lowercased: ata/grl (no project rules affected)
- 13 real compound_name schema renames touching this project:
  - atmosChem.{cfc11,cfc12,ch4,n2o}.tavg-u-hm-u -> .tavg-u-hm-air
  - land.tas.tavg-u-hxy-u.1hr -> .tavg-h2m-hxy-u.1hr (glb + 30S-90S)
  - land.tslsi.tavg-u-hxy-u.day -> .tavg-u-hxy-lsi.day
  - seaIce.sisnmass.tavg-u-hm-u -> .tavg-u-hm-si (day/mon x nh/sh)
  - atmos.ts.tavg-u-hxy-sn.day -> .tavg-u-hxy-lnd.day
  - land.esn.tavg-u-hxy-sn.day -> .tavg-u-hxy-lnd.day

Source default in std_lib/global_attributes.py:136 updated from "GLB"
to "glb" to match the new lowercase convention when get_region() falls
back to the default (rule paths already supply region explicitly via
compound_name).

Validation: all 17 rule yamls construct CMORizer cleanly against the
new metadata.json; 548/552 compound_names resolve in v1.2.2.3 (the 4
unmatched are Laszlo's per-PFT yearly diagnostic rules, extra-CMIP).

Metadata cache populated at:
  ~/.cache/pycmor/cmip7_metadata/v1.2.2.3/metadata.json (kept v1.2.2.2)

Substantive content edits in v1.2.2.3 (for variables we use):
- atmos.{hurs,rsds,sfcWind}.tavg-...-1hr.30S-90S: cleaner CF-standard
  long_name values (long_name was internal text in v1.2.2.2).
cli39 veg_land shards 2/3/4 failed with
  ValueError: Rule 'treeFracBdlEvg_yr' with
  compound_name='land.treeFracBdlEvg.tavg-u-hxy-u.yr.GLB' did not match any
  variables in the CMIP7 data request

CMIP7 only has the per-PFT treeFrac variants at MONTHLY cadence:
  Emon.treeFracBdlDcd / Emon.treeFracBdlEvg / Emon.treeFracNdlDcd / Emon.treeFracNdlEvg

No yearly Eyr.treeFrac{BdlDcd,BdlEvg,NdlDcd,NdlEvg} entries exist. The monthly
counterparts are already wired in cap7_land/cmip7_awiesm3-veg-hr_cap7_land.yaml
and cover the CMIP7 requirement.

Removes:
- treeFracBdlDcd_yr
- treeFracBdlEvg_yr
- treeFracNdlDcd_yr
- treeFracNdlEvg_yr

Keeps the aggregate treeFrac_yr (maps to valid Eyr.treeFrac).
cli39 extra_land (job 25003750_1) hit 3h walltime with 4 save_dataset
ops stuck at heartbeat #16:
  ⟳ save_dataset[tas]    still running (t=960s, #16)
  ⟳ save_dataset[mrsow]  still running (t=960s, #16)
  ⟳ save_dataset[dslw]   still running (t=960s, #16)
  ⟳ save_dataset[orog]   still running (t=960s, #16)

All four are long-running surface saves (1-hourly tas SH-subset, daily
mrsow/dslw via swvl1..swvl4 + temporal_diff, fx orog SH-subset). With
extra_land's 18 rules all in a single shard at SHARD_SIZE=20, these
four kicked off parallel saves that contended on the synchronous
netcdf write scheduler. 9 saves completed cleanly, then the 4 wedged.

Lower SHARD_SIZE specifically to 5 for extra_land — splits 18 rules
into ~4 shards, distributing the contended saves across them. Each
shard's individual save peak stays small enough to finish inside 3h.
Other tiers keep the global SHARD_SIZE=20.
Both tiers hit TIMEOUT in cli40 due to multiple save_dataset operations
contending on the global HDF5 write lock with no progress at heartbeat
#87+ over 3h. Sequence in lrcs_seaice_3:
  siarea 176×, siextent 89×, sisnmass 87×, sfdsi 87×, sihc 86×, sidmasstrany 86×
veg_land_1 wedged on vegHeight/dgw/tslsi/esn at heartbeat #21.

Wall-vs-Σsave analysis (cli40 successful shards) shows these two tiers
already run nearly serially under lock contention:
  lrcs_seaice_4: ratio 1.18× (12% wall-time penalty if forced serial)
  lrcs_seaice_2: 1.8×
Forcing strict serial eliminates the wedge with negligible cost.

3D-heavy tiers (cap7_atm 4.1×, core_ocean 3.7×, extra_atm 1.8×) genuinely
need parallel saves to stay inside 3h walltime, so they are NOT
modified.

Mechanism: every pipeline in the two yamls now declares
  throttle_group: <tier>_serial
and the pycmor: section adds
  throttle_caps:
    <tier>_serial: 1
which caps in-flight rules of that group at 1 per batch.

For lrcs_seaice this also subsumes the previous oifs_regrid cap=2 (which
was set for driver-RSS, not the lock); cap=1 trivially satisfies it.
cli41 logs showed "Throttle caps (per-group rule submission limit):
none" in both lrcs_seaice and veg_land — the yaml-level `throttle_caps`
block I added in 5a61ad3 got silently dropped by the Everett-based
PycmorConfigManager (only Options declared in PycmorConfig are
preserved through schema normalisation; unknown keys vanish).

Result: default per-group cap of 2 applied, lrcs_seaice still wedged on
9+ parallel save_dataset operations:
  cli41 _1: TIMEOUT (heartbeat #16, 9 saves stuck)
  cli41 _3: TIMEOUT (same pattern)

The cmorizer `_resolve_throttle_caps()` already reads PYCMOR_THROTTLE_CAPS
from the environment as a higher-priority source than the yaml. Move
the per-tier cap declaration to the launcher.

submit_hr_year_shards.sh:
- New per-tier `tier_throttle_caps` case:
    lrcs_seaice  → "lrcs_seaice_serial:1"
    veg_land     → "veg_land_serial:1"
    (others)     → ""
- Threaded into `--export=...,PYCMOR_THROTTLE_CAPS=$tier_throttle_caps`

Yaml cleanup:
- Removed the dead `throttle_caps:` block from both yamls
- Kept the explanatory comment so the next reader knows where the cap
  actually lives (the launcher)

Pipeline-level `throttle_group:` annotations from 5a61ad3 are unchanged
and are still read by Everett (declared on the Pipeline schema).
cli43 lrcs_seaice_3 TIMEOUT despite cap=1 being correctly applied:
  Throttle caps: {'lrcs_seaice_serial': 1}
  ... but 5 saves concurrent and stuck at heartbeat #101+

Root cause: `_rule_throttle_group()` only walked `rule.pipelines[*]`.
20 of 64 lrcs_seaice rules (siarea_north, siextent_north_day,
siflcondbot, sidconcth, sitimefrac, ...) have no `pipelines:` key —
they use pycmor's default pipeline, which carries no throttle_group,
so they landed in `_unthrottled` and ran in parallel through the
HDF5 global write lock.

Two-part fix:
1. cmorizer.py: read rule.throttle_group first, fall back to pipeline
   annotation. Rule-level wins (per-rule override), preserves existing
   pipeline-level behaviour.
2. lrcs_seaice + veg_land yamls: add `throttle_group: <tier>_serial`
   to the `inherit:` block. With validate.py:299 already accepting
   `throttle_group` on the rule schema, inherit propagates it to every
   rule — pipelined or not.

Effect on cli44+: all 64 lrcs_seaice rules (and all veg_land rules)
join their tier's serial group; PYCMOR_THROTTLE_CAPS env var caps it
at 1; the multi-rule HDF5-lock wedge can no longer form.
esm_tools commit f3c2a3f9 added a regridded-output rename script that
produces "<var>.fesom.gr.<year>.nc" alongside the legacy native
"<var>.fesom.<year>.nc" in outdata/fesom/ (e.g. Test_v342_1y_01).
The existing pycmor patterns

    pattern: <var>\.fesom\..*\.nc

would now silently match BOTH native and gr files. load_mfdataset
would then try to concat a 3.15M-nod2 unstructured dataset with a
360x720 regular lat/lon dataset and either error or, worse, produce
garbage.

Tighten the pattern to require a 4-digit year right after `.fesom.`:

    pattern: <var>\.fesom\.\d{4}\.nc

This matches the native filenames only (e.g. `sst.fesom.1586.nc`).
The regridded filenames (`sst.fesom.gr.1586.nc`) have an extra `.gr.`
segment between `fesom` and the year, so they are now unambiguously
excluded from these `gn` rules.

Mechanical change: replaced 221 occurrences of `\.fesom\..*\.nc` across
the 7 FESOM-ingesting tier yamls:

  lrcs_seaice  97
  lrcs_ocean   64
  core_ocean   23
  cap7_seaice  13
  cap7_ocean   13
  core_seaice  10
  veg_seaice    1

Behaviour for runs that do not have regridded output (pre-Test_21)
is unchanged: only native files exist, the regex still matches them.

The future `_gr` mirror tiers will use the parallel pattern
`\.fesom\.gr\.\d{4}\.nc`, which is disjoint from this one.
Add a launcher-side preprocessor that derives a `_gr` companion of every
FESOM-ingesting tier yaml at submit time. No parallel `_gr/` source tree
in the repo — the gn yaml stays the single source of truth and the gr
yaml is regenerated each run.

generate_gr_yaml.py (new):
  Filter:    keeps rules whose primary input pattern references `fesom`;
             drops mesh-derived fx rules and OIFS-source rules
             (no gr equivalent under this scheme).
  Rewrite:   replaces `\.fesom\.\d{4}\.nc` → `\.fesom\.gr\.\d{4}\.nc`
             recursively across every string value in the yaml tree
             (catches primary `pattern:` and secondary attrs like
             `aice_pattern:` uniformly).
  Inherit:   grid_label → gr
             grid       → "regular 0.5° lat/lon ..."
             nominal_resolution → "50km"
               (CMIP7 CV-bin for the 0.5° area-weighted √mean cell area
               of ~44 km. CV values are discrete: see
               CMIP7-CVs/project/native-nominal-resolution.json)
  Name:      general.name gets " (gr)" suffix for log clarity.

submit_hr_year_shards.sh: WITH_GR={yes,y,true,1,on} env var.
When set, after `repoint_hr_year.py` populates $YAMLS_DIR, scan for
yamls that reference `.fesom.` files and emit a sibling `<tier>_gr.yaml`
into the same dir. The existing per-yaml submit loop then picks them
up automatically as separate tiers (jobname suffix `_gr`), sharded and
sbatched identically to gn.

Smoke-tested across all 7 FESOM-ingesting tiers in the repo:
  core_ocean:  22 / 28 rules kept
  cap7_ocean:   7 /  7
  lrcs_ocean:  54 / 55
  core_seaice:  9 /  9
  cap7_seaice:  9 /  9
  lrcs_seaice: 53 / 64   (drops are OIFS-regrid rules with no gr source)
  veg_seaice:   1 /  1
cli45 core_seaice_1 TIMEOUTed on Test_v342_1y_01 with the now-familiar
HDF5-write-lock wedge pattern:
  1 save completed (simass 2s), then 7 in parallel all stuck @ heartbeat #16:
  siconc, ts, snd, siv, siu, sitimefrac, sithick

The three remaining FESOM-ingesting seaice tiers (core_seaice,
cap7_seaice, veg_seaice) never got the inherit-throttle treatment we
applied to lrcs_seaice / veg_land. cli42/44 happened to pass them
because the cluster was uncontended on those days; cli45 hit a busier
node and the latent vulnerability surfaced.

Mirror the lrcs_seaice/veg_land fix:
  yaml side — `inherit: throttle_group: <tier>_serial`. Rule-level
              fallback in `_rule_throttle_group()` (commit 9d4579c)
              propagates the group to every rule, pipelined or not.
  launcher — `PYCMOR_THROTTLE_CAPS=<tier>_serial:1` per tier, joining
              the existing lrcs_seaice/veg_land case in
              submit_hr_year_shards.sh.

veg_seaice cli45 actually died from a separate Prefect sqlite race
(`OperationalError: database is locked` in send_telemetry_heartbeat),
not the lock wedge — but the tier shares the same architecture, so
adding the throttle now is preventative.

Wall-time cost is small: empirical data from cli44 puts lrcs_seaice's
serial penalty at 12% for shard_4 and 0% for shard_3 (which had been
TIMEOUT before). Same shape expected here.
Two bugs in the WITH_GR=yes path discovered on cli46 first attempt:

1. The grep that decides which yamls get a gr derivative used
   `grep '\.fesom\.'` without -F. Regex semantics interpret `\.` as
   "literal dot", so the test matched files containing the substring
   `.fesom.` — which only happens in free-text comments like
   "# Input: daily SST from FESOM (sst.fesom.*.nc ...)". The actual
   regex patterns in `pattern:` lines are double-escaped (`\.fesom\.`)
   and have NO unescaped `.fesom.` anywhere.
   Result: only 3 of 7 expected gr tiers were generated
   (cap7_ocean_gr, lrcs_ocean_gr, veg_seaice_gr — the three with
   chatty comments). core_ocean_gr, core_seaice_gr, cap7_seaice_gr,
   lrcs_seaice_gr were silently skipped.
   Fix: add -F to grep so the literal `\.fesom\.` is matched.

2. tier_throttle_caps case statement matched on `$short_tier` (e.g.
   "lrcs_seaice") only. gr variants have `_gr` appended (e.g.
   "lrcs_seaice_gr"), so they fell through to the default (no env
   cap), reverting to the hardcoded default cap of 2. That would
   undo the serial protection for the gr-side runs.
   Fix: strip trailing _gr into short_tier_base before the case.
   The throttle_group name (e.g. "lrcs_seaice_serial") is preserved
   by generate_gr_yaml.py via the inherit: copy, so the same cap key
   applies to both gn and gr variants of a tier.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants