-
Notifications
You must be signed in to change notification settings - Fork 2
feat: Add CMIP7 variable mapping workflow with compound names #247
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: prep-release
Are you sure you want to change the base?
Conversation
- Create Excel file with 987 CMIP7 variables pre-populated from data request - Add conversion script to generate YAML from Excel - Include comprehensive README with usage instructions - Excel has color-coded columns: blue (CMIP7 metadata), green (model mappings), yellow (processing info) - Supports collaborative mapping for FESOM, OIFS, REcoM, LPJ-Guess models - Includes dropdown validation for status and priority fields
- Update Excel to use compound_name as primary key (1,974 rows vs 987) - Handle duplicate variable names across different contexts * 414 variables appear in multiple variants (e.g., tas has 20 variants) * Different frequencies (mon, day, yr, 6hr, etc.) * Different regions (GLB, ATA, GRL, NH, SH, etc.) * Different methods (tavg, tmax, tmin, tpt, etc.) - Add columns: compound_name, table, region, method_level_grid - Each variant can have different model mappings and preprocessing - Update conversion script to use compound names as keys in YAML - Addresses issue where variable_id alone was ambiguous Example: tas variants now properly distinguished: - atmos.tas.tavg-h2m-hxy-u.day.GLB (daily mean) - atmos.tas.tavg-h2m-hxy-u.mon.GLB (monthly mean) - atmos.tas.tmax-h2m-hxy-u.day.GLB (daily maximum)
- Remove unused Path import - Fix f-strings without placeholders - Sort imports according to isort/black profile - Remove trailing whitespace - All linting checks now pass (flake8, black, isort)
- Extract priority levels (Core, High, Medium, Low) from dreq_v1.2.2.2.json - Add dreq_priority column showing CMIP7 Data Request priorities per compound name - Remove empty long_name column (reduced from 20 to 19 columns) - Remove Excel table/filter formatting for better compatibility with Numbers on Mac - Keep simple frozen panes (header row and first 3 columns) - Maintain color-coded columns and dropdown validation - Total: 1,974 compound names with priority distribution: * High: 1,038 variables * Medium: 469 variables * Core: 131 variables * Low: 112 variables
- Add dreq_v1.2.2.2.json and dreq_v1.2.2.2_metadata.json (required by create script) - Update README with: - Information about compound names (1,974 entries covering 987 unique variables) - CMIP7 Data Request priority levels (Core, High, Medium, Low) - Updated column structure reflecting actual Excel columns - Instructions for fetching data request files using CMIP7-data-request-api - Clarification that JSON files are required by create_cmip7_variable_mapping.py
|
Why did you choose to include this using excel instead of something more universal like plain text? |
Based on the conversation on Pycmor (old SEAMORE channel), I was under the impression that more than one person could be contributing to fill in data for fesom/iofs/recom/lpj thing. Some online service like google sheets or Airtable or maybe hedgedoc be more ideal for easy gathering of contributions. As it takes a little more effort to do that, I settled for excel as an exemplar. csv, json, plain text were my initial though but quickly got carried away with the online service idea. |
Summary
This PR adds a collaborative workflow for mapping CMIP7 variables to model-specific outputs using Excel and YAML conversion.
Key Features
atmos.tas.tavg-h2m-hxy-u.day.GLB) as unique identifiersFiles Added
cmip7_variable_mapping.xlsx- Pre-populated Excel file (1,974 rows, 19 columns)create_cmip7_variable_mapping.py- Script to generate Excel from data requestexcel_to_yaml.py- Script to convert Excel to YAMLCMIP7_VARIABLE_MAPPING_README.md- Comprehensive documentationStructure
Excel Columns (19 total)
Identifiers (3): compound_name, table, variable_id
CMIP7 Metadata (7): standard_name, long_name, units, frequency, modeling_realm, region, method_level_grid
Model Mappings (4): fesom, oifs, recom, lpj_guess
Processing Info (5): preprocess, formula, comment, status, priority
Why Compound Names?
Example:
tas(temperature) has 20 variants:atmos.tas.tavg-h2m-hxy-u.day.GLB(daily mean)atmos.tas.tavg-h2m-hxy-u.mon.GLB(monthly mean)atmos.tas.tmax-h2m-hxy-u.day.GLB(daily maximum)atmos.tas.tmin-h2m-hxy-u.day.GLB(daily minimum)Each variant may require different preprocessing, so they need separate mappings.
Usage
cmip7_variable_mapping.xlsxpython excel_to_yaml.pyto generate YAMLTesting
Related
Addresses the need for collaborative CMIP7 variable mapping discussed in internal documentation.