-
Notifications
You must be signed in to change notification settings - Fork 96
Coarse resolution grid for fast testing #641
Description
This issue stems from cime #4933, which is about developing a large ensemble test
motivated by DART applications.
Because of the large ensemble, the testing will be more managable
if it uses a coarse resolution grid. An ne3 grid is available for CAM and CTSM,
and now a ~10-degree resolution is available in MOM6 (MOM_interface #311).
These have been combined into a new CESM grid and used in ERI and MCC tests,
which also use a new testmod tailored to DART needs.
I'm open to suggestions for a shorter testmod name,
but @billsacks and I feel that it will be helpful to have DART in it.
This grid (especially the MOM6 grid) limits the tasks/instance to 12
(6 for MOM, 6 for the other components).
The CESM version I'm using is copied from what @alperaltuntas used
for developing the coarse resolution MOM6 grid (cesm3_0_alpha08d):
/glade/work/raeder/Models/cesm3_0_alpha08d_mar13
An MCC test for a small ensemble passes all test stages
(/glade/work/raeder/Exp/CESM+DART_testing/MCC_cG.ne3pg3_10deg.B_DART.lowres)
but ensembles which require more than 1 node mostly fail
with an error in cmeps/cesm/driver/ensemble_driver.F90.
This seems to arise from smaller ensembles fitting into a single (develop queue) node,
where the exact number of processors needed is assigned to them,
while larger ensembles need multiple (cpu/main) nodes
and more processors are assigned to the job than are needed.
For example, 40 instances request 12 x 40 = 480 processors.
This means that 4 nodes x 128 = 512 processors are assigned.
This difference causes an error:
PetCount ( 512) - Async IOtasks ( 0) must be evenly divisable by number of members ( 40).
When the check for this error is commented out, the job goes farther,
but hangs just before the time stepping in CAM.
One final data point is that a 32 instance ensemble fits exactly into 3 nodes
and it does not fail in ensemble_driver.F90, but it also hangs later.
This should maybe be a separate issue, or at least a separate PR.
These may not be problems or issues for standard-resolution grids which require 1 or more whole nodes per instance.
The hanging happens between the reading of
fndep_clm_hist_b.e21.BWHIST.f09_g17.CMIP6-historical-WACCM.ensmean_1849-2015_monthly_0.9x1.25_c180926.nc
in CAM and the start of the time stepping.
All of the processors on one of the cpu nodes were used at 100%, but nothing was happening in the $RUNDIR.
This could be related to @alperaltuntas comment in MOM_interface #314:
"Instead of modifying RUN_STARTDATE (which causes slow stream reads)"
I can't assign people or labels, so please choose some.