feat: Add CMIP7 variable mapping workflow with compound names #247

siligam · 2025-11-25T14:16:25Z

Summary

This PR adds a collaborative workflow for mapping CMIP7 variables to model-specific outputs using Excel and YAML conversion.

Key Features

Excel-based mapping: Pre-populated with 1,974 CMIP7 compound names from data request
Compound name identifiers: Uses full compound names (e.g., atmos.tas.tavg-h2m-hxy-u.day.GLB) as unique identifiers
Handles duplicates: 414 variables appear in multiple variants (different frequencies, regions, methods)
Color-coded columns: Gray (identifiers), Blue (CMIP7 metadata), Green (model mappings), Yellow (processing info)
Data validation: Dropdowns for status and priority fields
YAML conversion: Script to convert filled Excel to YAML for pycmor integration

Files Added

cmip7_variable_mapping.xlsx - Pre-populated Excel file (1,974 rows, 19 columns)
create_cmip7_variable_mapping.py - Script to generate Excel from data request
excel_to_yaml.py - Script to convert Excel to YAML
CMIP7_VARIABLE_MAPPING_README.md - Comprehensive documentation

Structure

Excel Columns (19 total)

Identifiers (3): compound_name, table, variable_id
CMIP7 Metadata (7): standard_name, long_name, units, frequency, modeling_realm, region, method_level_grid
Model Mappings (4): fesom, oifs, recom, lpj_guess
Processing Info (5): preprocess, formula, comment, status, priority

Why Compound Names?

Example: tas (temperature) has 20 variants:

atmos.tas.tavg-h2m-hxy-u.day.GLB (daily mean)
atmos.tas.tavg-h2m-hxy-u.mon.GLB (monthly mean)
atmos.tas.tmax-h2m-hxy-u.day.GLB (daily maximum)
atmos.tas.tmin-h2m-hxy-u.day.GLB (daily minimum)
... and 16 more

Each variant may require different preprocessing, so they need separate mappings.

Usage

Open cmip7_variable_mapping.xlsx
Fill in model-specific variable names (green columns)
Add preprocessing info (yellow columns)
Run python excel_to_yaml.py to generate YAML
Use YAML in pycmor configuration

Testing

✅ All linting checks pass (flake8, black, isort)
✅ Excel file generated successfully with 1,974 compound names
✅ Conversion script tested with sample data

- Create Excel file with 987 CMIP7 variables pre-populated from data request - Add conversion script to generate YAML from Excel - Include comprehensive README with usage instructions - Excel has color-coded columns: blue (CMIP7 metadata), green (model mappings), yellow (processing info) - Supports collaborative mapping for FESOM, OIFS, REcoM, LPJ-Guess models - Includes dropdown validation for status and priority fields

- Update Excel to use compound_name as primary key (1,974 rows vs 987) - Handle duplicate variable names across different contexts * 414 variables appear in multiple variants (e.g., tas has 20 variants) * Different frequencies (mon, day, yr, 6hr, etc.) * Different regions (GLB, ATA, GRL, NH, SH, etc.) * Different methods (tavg, tmax, tmin, tpt, etc.) - Add columns: compound_name, table, region, method_level_grid - Each variant can have different model mappings and preprocessing - Update conversion script to use compound names as keys in YAML - Addresses issue where variable_id alone was ambiguous Example: tas variants now properly distinguished: - atmos.tas.tavg-h2m-hxy-u.day.GLB (daily mean) - atmos.tas.tavg-h2m-hxy-u.mon.GLB (monthly mean) - atmos.tas.tmax-h2m-hxy-u.day.GLB (daily maximum)

- Remove unused Path import - Fix f-strings without placeholders - Sort imports according to isort/black profile - Remove trailing whitespace - All linting checks now pass (flake8, black, isort)

- Extract priority levels (Core, High, Medium, Low) from dreq_v1.2.2.2.json - Add dreq_priority column showing CMIP7 Data Request priorities per compound name - Remove empty long_name column (reduced from 20 to 19 columns) - Remove Excel table/filter formatting for better compatibility with Numbers on Mac - Keep simple frozen panes (header row and first 3 columns) - Maintain color-coded columns and dropdown validation - Total: 1,974 compound names with priority distribution: * High: 1,038 variables * Medium: 469 variables * Core: 131 variables * Low: 112 variables

- Add dreq_v1.2.2.2.json and dreq_v1.2.2.2_metadata.json (required by create script) - Update README with: - Information about compound names (1,974 entries covering 987 unique variables) - CMIP7 Data Request priority levels (Core, High, Medium, Low) - Updated column structure reflecting actual Excel columns - Instructions for fetching data request files using CMIP7-data-request-api - Clarification that JSON files are required by create_cmip7_variable_mapping.py

pgierz · 2025-11-26T07:58:04Z

Why did you choose to include this using excel instead of something more universal like plain text?

siligam · 2025-12-03T20:10:07Z

Why did you choose to include this using excel instead of something more universal like plain text?

Based on the conversation on Pycmor (old SEAMORE channel), I was under the impression that more than one person could be contributing to fill in data for fesom/iofs/recom/lpj thing. Some online service like google sheets or Airtable or maybe hedgedoc be more ideal for easy gathering of contributions. As it takes a little more effort to do that, I settled for excel as an exemplar. csv, json, plain text were my initial though but quickly got carried away with the online service idea.

siligam and others added 7 commits November 25, 2025 14:24

chore: Regenerate Excel file with compound name structure

70f464e

style: Apply black, isort, and flake8 formatting

b703883

- Remove unused Path import - Fix f-strings without placeholders - Sort imports according to isort/black profile - Remove trailing whitespace - All linting checks now pass (flake8, black, isort)

Add CMIP7 variable mapping file for SPINUP experiment.

7abc54c

pgierz changed the base branch from main to prep-release November 26, 2025 08:15

Delete file.

0ca28e5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Add CMIP7 variable mapping workflow with compound names #247

feat: Add CMIP7 variable mapping workflow with compound names #247

Uh oh!

siligam commented Nov 25, 2025

Uh oh!

pgierz commented Nov 26, 2025

Uh oh!

siligam commented Dec 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

feat: Add CMIP7 variable mapping workflow with compound names #247

Are you sure you want to change the base?

feat: Add CMIP7 variable mapping workflow with compound names #247

Uh oh!

Conversation

siligam commented Nov 25, 2025

Summary

Key Features

Files Added

Structure

Excel Columns (19 total)

Why Compound Names?

Usage

Testing

Related

Uh oh!

pgierz commented Nov 26, 2025

Uh oh!

siligam commented Dec 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants