Skip to content

Fix CMIP6 experiment global attribute to match the WCRP/esgvoc CV label#414

Open
rhaegar325 wants to merge 2 commits into
mainfrom
fix_experiment_format_issue
Open

Fix CMIP6 experiment global attribute to match the WCRP/esgvoc CV label#414
rhaegar325 wants to merge 2 commits into
mainfrom
fix_experiment_format_issue

Conversation

@rhaegar325
Copy link
Copy Markdown
Collaborator

Summary

Every CMORised CMIP6 file written the global experiment attribute from the
legacy bundled CMIP6_CVs JSON, whose experiment field is a long
descriptive phrase (e.g. "all-forcing simulation of the recent past"). The
WCRP compliance checker (wcrp_cmip6:1.0, cc-plugin-wcrp + esgvoc) validates
that attribute against esgvoc's CMIP6 controlled vocabulary, whose label is
the short canonical name (e.g. "Historical simulation"). The two vocabularies
disagree, so the checker raised a MED-priority [ATTR007] cross-attribute
consistency failure on every affected file.

This PR resolves the experiment label from esgvoc — the same source the
checker uses — so the written value matches by construction, with a safe
fallback to the legacy value.

Problem

The bundled legacy CV and esgvoc's CMIP6 universe carry different values in the
experiment field:

Source experiment for historical
Bundled CMIP6_CVs/CMIP6_experiment_id.json (what we wrote) all-forcing simulation of the recent past
esgvoc cmip6 CV (what the checker validates against) Historical simulation

The checker, checks/consistency_checks/check_experiment_consistency.py
([ATTR007]), does roughly:

reference_term = voc.get_term_in_collection("cmip6", "experiment_id", experiment_id)
expected_experiment = getattr(reference_term, "experiment", None)
if expected_experiment and actual_experiment is not None:
    if actual_experiment != str(expected_experiment).strip():
        failures.append(f"Inconsistency for 'experiment': CV expects "
                        f"'{expected_experiment}', file has '{actual_experiment}'.")

Two consequences matter:

  1. The authoritative value is esgvoc's term.experiment, not the bundled
    CV's.
  2. The comparison only runs when expected_experiment is non-empty. esgvoc
    returns None for most experiments (e.g. piControl, amip), so the check
    is skipped for those and the legacy value is accepted. Only experiments
    where esgvoc has a non-empty label (e.g. historical, esm-hist) failed.

The attribute is written once, in CMIP6Vocabulary.get_required_global_attributes:

"experiment": self.experiment["experiment"],   # legacy CV description -> mismatch

Fix

Resolve the label from esgvoc and fall back to the bundled CV value. The write
site now calls a small helper:

"experiment": self._resolve_experiment_label(),

The esgvoc.api integration

The new helper CMIP6Vocabulary._resolve_experiment_label() is the core of this
change. Design points:

  • Same source as the checker. It calls
    esgvoc.api.get_term_in_collection(project_id="cmip6", collection_id="experiment_id", term_id=experiment_id) and reads
    term.experiment — the exact call and field [ATTR007] compares against. By
    construction the written value equals what the checker expects, for any
    experiment esgvoc has a label for (not special-cased for historical).

  • Lazy, optional import. import esgvoc.api as voc happens inside the
    method, not at module top. esgvoc is a checker/pixi dependency, not a hard
    runtime dependency of the core CMORiser, so importing lazily keeps it optional.

  • Safe fallback. Any failure — esgvoc not installed (ImportError), term not
    found, or an empty label — falls back to the bundled CV value
    (self.experiment["experiment"]). When esgvoc has no label the checker skips
    the comparison anyway, so the legacy value remains valid.

  • Cached per experiment. Results are memoised in a class-level
    _EXPERIMENT_LABEL_CACHE keyed by experiment_id, because esgvoc lookups
    touch a local database; each experiment is resolved at most once per process.

# Canonical CMIP6-CV ``experiment`` labels resolved via esgvoc, keyed by
# experiment_id. esgvoc lookups touch a database, so resolve each
# experiment at most once per process.
_EXPERIMENT_LABEL_CACHE: Dict[str, Optional[str]] = {}

def _resolve_experiment_label(self) -> str:
    """Return the canonical ``experiment`` global-attribute value.

    The WCRP compliance checker (cc-plugin-wcrp + esgvoc) compares the
    global ``experiment`` attribute against esgvoc's CMIP6 controlled
    vocabulary, whose label (e.g. ``"Historical simulation"``) differs from
    the descriptive phrase carried in the legacy CMIP6_CVs JSON bundled with
    this package (e.g. ``"all-forcing simulation of the recent past"``).

    Resolve the label from esgvoc so the written attribute matches what the
    checker validates. Fall back to the bundled CV value when esgvoc is
    unavailable or carries no label for this experiment -- in the latter
    case the checker skips the comparison, so the legacy value is accepted.
    """
    legacy_label = self.experiment.get("experiment", "")

    eid = self.experiment_id
    if eid not in CMIP6Vocabulary._EXPERIMENT_LABEL_CACHE:
        label: Optional[str] = None
        try:
            import esgvoc.api as voc

            term = voc.get_term_in_collection(
                project_id="cmip6",
                collection_id="experiment_id",
                term_id=eid,
            )
            if term is not None:
                label = getattr(term, "experiment", None)
        except Exception:
            label = None
        CMIP6Vocabulary._EXPERIMENT_LABEL_CACHE[eid] = label

    resolved = CMIP6Vocabulary._EXPERIMENT_LABEL_CACHE[eid]
    return resolved if resolved else legacy_label

Why esgvoc instead of editing the bundled JSON

  • The bundled CMIP6_CVs directory is a git submodule tracking upstream
    WCRP-CMIP/CMIP6_CVs. Editing CMIP6_experiment_id.json in place diverges
    from upstream and can be wiped by git submodule update.
  • esgvoc is the checker's own vocabulary, so it stays correct as the CV evolves,
    with zero maintenance of a hand-copied label table.

Scope

  • CMIP6 only. CMIP7Vocabulary does not emit an experiment global attribute,
    so it is untouched.
  • Single attribute write path; general across all experiments, not special-cased
    for historical.

Tests

Four unit tests in tests/unit/test_vocabulary_processors.py mock esgvoc.api
via sys.modules, so they pass with or without esgvoc installed:

  • test_resolve_experiment_label_uses_esgvoc — esgvoc label overrides the legacy
    description.
  • test_resolve_experiment_label_falls_back_when_esgvoc_label_empty — empty
    esgvoc label keeps the legacy value.
  • test_resolve_experiment_label_falls_back_when_esgvoc_missing — missing esgvoc
    (ImportError) keeps the legacy value.
  • test_resolve_experiment_label_is_cached — the lookup runs at most once per
    experiment_id.

Verification

  • esgvoc resolves historical -> "Historical simulation" and
    esm-hist -> "ESM historical simulation"; piControl -> None (checker skips).
  • Live end-to-end in conda/analysis3-26.06:
    CMIP6Vocabulary("Amon.tas", "historical", ...)._resolve_experiment_label()
    returns "Historical simulation".
  • A re-CMORised tas_Amon_ACCESS-ESM1-5_historical_r1i1p1f1_gn file now carries
    :experiment = "Historical simulation", and compliance-checker --test wcrp_cmip6:1.0 reports no [ATTR007] experiment inconsistency.

@codecov
Copy link
Copy Markdown

codecov Bot commented Jun 2, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 75.3%. Comparing base (34d58b9) to head (e91ee68).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##            main    #414     +/-   ##
=======================================
+ Coverage   75.1%   75.3%   +0.2%     
=======================================
  Files         28      28             
  Lines       5281    5298     +17     
  Branches     973     975      +2     
=======================================
+ Hits        3966    3988     +22     
+ Misses      1091    1089      -2     
+ Partials     224     221      -3     
Flag Coverage Δ
unit 75.3% <100.0%> (+0.2%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@rhaegar325 rhaegar325 requested a review from rbeucher June 2, 2026 06:14
@rbeucher
Copy link
Copy Markdown
Member

rbeucher commented Jun 3, 2026

Thanks — this explains why the current WCRP checker fails, but I think this should also be raised upstream. The value ACCESS-MOPPy currently writes for historical (all-forcing simulation of the recent past) matches the official CMIP6_CVs table and published CMIP6 files, while the esgvoc-formatted CMIP6 term currently exposes experiment = Historical simulation, which cc-plugin-wcrp then treats as the expected CMIP6 global attribute. That looks like an esgvoc/cc-plugin-wcrp inconsistency, not purely an ACCESS-MOPPy bug.

I’m okay with a local workaround if we need to pass the current checker, but I’d prefer we label it as temporary checker compatibility and open an upstream issue/PR. Also, can we make the helper mirror the checker’s lookup logic, including the drs_name fallback? Otherwise mixed-case IDs such as piControl may still diverge if the checker resolves them but ACCESS-MOPPy falls back to the bundled CV.

Could we add a note/test around the exact esgvoc version/data being targeted, and open an upstream issue asking whether CMIP6 experiment should match the legacy CMIP6_CVs value or the new esgvoc label?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants