Coarse resolution grid for fast testing

This issue stems from cime [#4933](https://github.com/ESMCI/cime/issues/4933), which is about developing a large ensemble test
motivated by DART applications.

Because of the large ensemble, the testing will be more managable
if it uses a coarse resolution grid.  An ne3 grid is available for CAM and CTSM,
and now a ~10-degree resolution is available in MOM6 (MOM_interface [#311](https://github.com/ESCOMP/MOM_interface/pull/311)).
These have been combined into a new CESM grid and used in ERI and MCC tests,
which also use a new testmod tailored to DART needs.
I'm open to suggestions for a shorter testmod name,
but @billsacks and I feel that it will be helpful to have DART in it.
This grid (especially the MOM6 grid) limits the tasks/instance to 12
(6 for MOM, 6 for the other components).

The CESM version I'm using is copied from what @alperaltuntas used
for developing the coarse resolution MOM6 grid (cesm3_0_alpha08d):
/glade/work/raeder/Models/cesm3_0_alpha08d_mar13

An MCC test for a small ensemble passes all test stages
(/glade/work/raeder/Exp/CESM+DART_testing/MCC_cG.ne3pg3_10deg.B_DART.lowres)
but ensembles which require more than 1 node mostly fail
with an error in cmeps/cesm/driver/ensemble_driver.F90.
This seems to arise from smaller ensembles fitting into a single (develop queue) node,
where the exact number of processors needed is assigned to them,
while larger ensembles need multiple (cpu/main) nodes
and more processors are assigned to the job than are needed.
For example, 40 instances request 12 x 40 = 480 processors.
This means that 4 nodes x 128 = 512 processors are assigned.
This difference causes an error:
PetCount (  512) - Async IOtasks (  0) must be evenly divisable by number of members ( 40).

When the check for this error is commented out, the job goes farther,
but hangs just before the time stepping in CAM.
One final data point is that a 32 instance ensemble fits exactly into 3 nodes
and it does not fail in ensemble_driver.F90, but it also hangs later.
This should maybe be a separate issue, or at least a separate PR.

These may not be problems or issues for standard-resolution grids which require 1 or more whole nodes per instance.

The hanging happens between the reading of
fndep_clm_hist_b.e21.BWHIST.f09_g17.CMIP6-historical-WACCM.ensmean_1849-2015_monthly_0.9x1.25_c180926.nc
in CAM and the start of the time stepping.
All of the processors on one of the cpu nodes were used at 100%, but nothing was happening in the $RUNDIR.
This  could be related to @alperaltuntas comment in MOM_interface [#314](https://github.com/ESCOMP/MOM_interface/pull/314):
"Instead of modifying RUN_STARTDATE (which causes slow stream reads)"

I can't assign people or labels, so please choose some.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Coarse resolution grid for fast testing #641

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Coarse resolution grid for fast testing #641

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions