Skip to content

Conversation

@siligam
Copy link
Contributor

@siligam siligam commented Nov 25, 2025

Summary

This PR adds a collaborative workflow for mapping CMIP7 variables to model-specific outputs using Excel and YAML conversion.

Key Features

  • Excel-based mapping: Pre-populated with 1,974 CMIP7 compound names from data request
  • Compound name identifiers: Uses full compound names (e.g., atmos.tas.tavg-h2m-hxy-u.day.GLB) as unique identifiers
  • Handles duplicates: 414 variables appear in multiple variants (different frequencies, regions, methods)
  • Color-coded columns: Gray (identifiers), Blue (CMIP7 metadata), Green (model mappings), Yellow (processing info)
  • Data validation: Dropdowns for status and priority fields
  • YAML conversion: Script to convert filled Excel to YAML for pycmor integration

Files Added

  • cmip7_variable_mapping.xlsx - Pre-populated Excel file (1,974 rows, 19 columns)
  • create_cmip7_variable_mapping.py - Script to generate Excel from data request
  • excel_to_yaml.py - Script to convert Excel to YAML
  • CMIP7_VARIABLE_MAPPING_README.md - Comprehensive documentation

Structure

Excel Columns (19 total)

Identifiers (3): compound_name, table, variable_id
CMIP7 Metadata (7): standard_name, long_name, units, frequency, modeling_realm, region, method_level_grid
Model Mappings (4): fesom, oifs, recom, lpj_guess
Processing Info (5): preprocess, formula, comment, status, priority

Why Compound Names?

Example: tas (temperature) has 20 variants:

  • atmos.tas.tavg-h2m-hxy-u.day.GLB (daily mean)
  • atmos.tas.tavg-h2m-hxy-u.mon.GLB (monthly mean)
  • atmos.tas.tmax-h2m-hxy-u.day.GLB (daily maximum)
  • atmos.tas.tmin-h2m-hxy-u.day.GLB (daily minimum)
  • ... and 16 more

Each variant may require different preprocessing, so they need separate mappings.

Usage

  1. Open cmip7_variable_mapping.xlsx
  2. Fill in model-specific variable names (green columns)
  3. Add preprocessing info (yellow columns)
  4. Run python excel_to_yaml.py to generate YAML
  5. Use YAML in pycmor configuration

Testing

  • ✅ All linting checks pass (flake8, black, isort)
  • ✅ Excel file generated successfully with 1,974 compound names
  • ✅ Conversion script tested with sample data

Related

Addresses the need for collaborative CMIP7 variable mapping discussed in internal documentation.

siligam and others added 7 commits November 25, 2025 14:24
- Create Excel file with 987 CMIP7 variables pre-populated from data request
- Add conversion script to generate YAML from Excel
- Include comprehensive README with usage instructions
- Excel has color-coded columns: blue (CMIP7 metadata), green (model mappings), yellow (processing info)
- Supports collaborative mapping for FESOM, OIFS, REcoM, LPJ-Guess models
- Includes dropdown validation for status and priority fields
- Update Excel to use compound_name as primary key (1,974 rows vs 987)
- Handle duplicate variable names across different contexts
  * 414 variables appear in multiple variants (e.g., tas has 20 variants)
  * Different frequencies (mon, day, yr, 6hr, etc.)
  * Different regions (GLB, ATA, GRL, NH, SH, etc.)
  * Different methods (tavg, tmax, tmin, tpt, etc.)
- Add columns: compound_name, table, region, method_level_grid
- Each variant can have different model mappings and preprocessing
- Update conversion script to use compound names as keys in YAML
- Addresses issue where variable_id alone was ambiguous

Example: tas variants now properly distinguished:
  - atmos.tas.tavg-h2m-hxy-u.day.GLB (daily mean)
  - atmos.tas.tavg-h2m-hxy-u.mon.GLB (monthly mean)
  - atmos.tas.tmax-h2m-hxy-u.day.GLB (daily maximum)
- Remove unused Path import
- Fix f-strings without placeholders
- Sort imports according to isort/black profile
- Remove trailing whitespace
- All linting checks now pass (flake8, black, isort)
- Extract priority levels (Core, High, Medium, Low) from dreq_v1.2.2.2.json
- Add dreq_priority column showing CMIP7 Data Request priorities per compound name
- Remove empty long_name column (reduced from 20 to 19 columns)
- Remove Excel table/filter formatting for better compatibility with Numbers on Mac
- Keep simple frozen panes (header row and first 3 columns)
- Maintain color-coded columns and dropdown validation
- Total: 1,974 compound names with priority distribution:
  * High: 1,038 variables
  * Medium: 469 variables
  * Core: 131 variables
  * Low: 112 variables
- Add dreq_v1.2.2.2.json and dreq_v1.2.2.2_metadata.json (required by create script)
- Update README with:
  - Information about compound names (1,974 entries covering 987 unique variables)
  - CMIP7 Data Request priority levels (Core, High, Medium, Low)
  - Updated column structure reflecting actual Excel columns
  - Instructions for fetching data request files using CMIP7-data-request-api
  - Clarification that JSON files are required by create_cmip7_variable_mapping.py
@pgierz
Copy link
Member

pgierz commented Nov 26, 2025

Why did you choose to include this using excel instead of something more universal like plain text?

@pgierz pgierz changed the base branch from main to prep-release November 26, 2025 08:15
@siligam
Copy link
Contributor Author

siligam commented Dec 3, 2025

Why did you choose to include this using excel instead of something more universal like plain text?

Based on the conversation on Pycmor (old SEAMORE channel), I was under the impression that more than one person could be contributing to fill in data for fesom/iofs/recom/lpj thing. Some online service like google sheets or Airtable or maybe hedgedoc be more ideal for easy gathering of contributions. As it takes a little more effort to do that, I settled for excel as an exemplar. csv, json, plain text were my initial though but quickly got carried away with the online service idea.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants