Skip to content

Coarse resolution grid for fast testing #641

@kdraeder

Description

@kdraeder

This issue stems from cime #4933, which is about developing a large ensemble test
motivated by DART applications.

Because of the large ensemble, the testing will be more managable
if it uses a coarse resolution grid. An ne3 grid is available for CAM and CTSM,
and now a ~10-degree resolution is available in MOM6 (MOM_interface #311).
These have been combined into a new CESM grid and used in ERI and MCC tests,
which also use a new testmod tailored to DART needs.
I'm open to suggestions for a shorter testmod name,
but @billsacks and I feel that it will be helpful to have DART in it.
This grid (especially the MOM6 grid) limits the tasks/instance to 12
(6 for MOM, 6 for the other components).

The CESM version I'm using is copied from what @alperaltuntas used
for developing the coarse resolution MOM6 grid (cesm3_0_alpha08d):
/glade/work/raeder/Models/cesm3_0_alpha08d_mar13

An MCC test for a small ensemble passes all test stages
(/glade/work/raeder/Exp/CESM+DART_testing/MCC_cG.ne3pg3_10deg.B_DART.lowres)
but ensembles which require more than 1 node mostly fail
with an error in cmeps/cesm/driver/ensemble_driver.F90.
This seems to arise from smaller ensembles fitting into a single (develop queue) node,
where the exact number of processors needed is assigned to them,
while larger ensembles need multiple (cpu/main) nodes
and more processors are assigned to the job than are needed.
For example, 40 instances request 12 x 40 = 480 processors.
This means that 4 nodes x 128 = 512 processors are assigned.
This difference causes an error:
PetCount ( 512) - Async IOtasks ( 0) must be evenly divisable by number of members ( 40).

When the check for this error is commented out, the job goes farther,
but hangs just before the time stepping in CAM.
One final data point is that a 32 instance ensemble fits exactly into 3 nodes
and it does not fail in ensemble_driver.F90, but it also hangs later.
This should maybe be a separate issue, or at least a separate PR.

These may not be problems or issues for standard-resolution grids which require 1 or more whole nodes per instance.

The hanging happens between the reading of
fndep_clm_hist_b.e21.BWHIST.f09_g17.CMIP6-historical-WACCM.ensmean_1849-2015_monthly_0.9x1.25_c180926.nc
in CAM and the start of the time stepping.
All of the processors on one of the cpu nodes were used at 100%, but nothing was happening in the $RUNDIR.
This could be related to @alperaltuntas comment in MOM_interface #314:
"Instead of modifying RUN_STARTDATE (which causes slow stream reads)"

I can't assign people or labels, so please choose some.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions