Skip to content

OpenACC directives for SoilBiogeochemCompetition#4027

Draft
samsrabin wants to merge 44 commits into
ESCOMP:nvidia-hackathon-26from
samsrabin:soilbiogeochemcompetition-openacc
Draft

OpenACC directives for SoilBiogeochemCompetition#4027
samsrabin wants to merge 44 commits into
ESCOMP:nvidia-hackathon-26from
samsrabin:soilbiogeochemcompetition-openacc

Conversation

@samsrabin
Copy link
Copy Markdown
Member

@samsrabin samsrabin commented May 18, 2026

Description of changes

This PR contains the changes I (well, mostly Claude, at my instruction) made in an isolated reproducer of SoilBiogeochemCompetition to enable OpenACC offloading of work to GPU. This isolated reproducer is designed to enable easy testing and iteration, but meaningful performance testing should be carried out in actual model runs.

The changes in this PR are not likely to result in much speedup unless running on GPU, and even then, it's unclear how much impact they'll make. However, I would like to get this merged into the main code to serve as a "living example" of how we can start transitioning more of the codebase to be GPU-offloadable.

Misc. notes:

  • So far, OpenACC directives have only been added for the most common code route: use_nitrif_denitrif = .true., no MIMICS, no FATES.
  • The code was first refactored so that the scientific bits are in their own subroutines and thus isolated from the messy directives and whatnot.

Remaining work

  • Bring changes in to the actual SoilBiogeochemCompetitionMod, not my reproducer.
  • Demonstrate no diffs on CPU (aux_clm)
  • Add an aux_clm test that exercises the GPU offloading.
  • Document performance impact, comparing CPU to GPU

Specific notes

Contributors other than yourself, if any:

CTSM issues resolved or otherwise addressed, if any:

  • None

Any user interface changes (namelist or namelist defaults changes)? No

Testing planned or performed, if any:

  • Manual testing using scripts in perf_testing/SoilBiogeochemCompetition/
  • aux_clm to show no diffs

Requirements before merge:

  • The code in this PR branch builds with no errors.
  • The code in this PR branch runs with no errors. Briefly describe tested configuration(s):
  • This either (a) does not change answers, (b) it only changes answers at roundoff level, or (c) I have performed a scientific evaluation of the answer changes. Which?:
  • I have reviewed relevant parts of the CLM documentation Tech Note or User's Guide to determine if anything needs to be changed or added. If it does, describe:
  • This PR either (a) does not create a need to update the documentation or (b) includes required documentation updates (see guidelines for contributing documentation). Which?:

samsrabin and others added 30 commits May 5, 2026 11:54
Driver default now runs all 8 combinations of (use_nitrif_denitrif,
carbon_only, decomp_method == mimics_decomp) per iteration and sums
the per-config checksums into one combined value. The committed
baseline_checksum.txt is regenerated against this --all mode, so a
single MATCH covers every top-level branch in the routine.

Add --fast flag (the previous behavior): runs just the canonical
config (use_nitrif_denitrif=.true., carbon_only=.false.,
decomp_method=mimics_decomp). Use it for tight perf-iteration loops.
The --fast checksum (9.5970435393765438E+04) matches the
single-config baseline that was committed in 41c9127 — strong
sanity check that the routine's behavior on the canonical path is
unchanged.

Implementation:
  - parse_args() now accepts --fast in any position; positional args
    drop use_nitrif_denitrif and carbon_only (those are now
    per-config, not driver-level).
  - fill_inputs() split into fill_inputs_once() (synthetic inputs,
    one-shot) and zero_outputs() (called before every routine call,
    since the routine accumulates into actual_immob /
    potential_immob in the non-nitrif branch).
  - run_config(unitrif, conly, dmethod, partial_cs) wraps the call
    site so the main flow can dispatch over configs cleanly.
  - last_run.txt / baseline_checksum.txt fingerprint now records
    'mode' (all|fast) + sizes + niters; per-config flags are no
    longer part of the fingerprint.

Verified end-to-end: ./driver MATCHes new baseline; ./driver --fast
produces the previous single-config checksum (skipping baseline
compare since the fingerprint differs); both TIMING=1 and TIMING=0
builds work; per-call results are independent of niters and config
order (re-zeroing outputs each call).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Flip --fast from (use_nitrif_denitrif=.true., carbon_only=.false.,
MIMICS on) to (..., MIMICS off) so the canonical config exercises
the use_nitrif_denitrif=.true. branch without the MIMICS overflow
loop. Have compare_to_baseline pick its file based on mode:
--fast reads baseline_checksum_fast.txt, --all keeps reading
baseline_checksum.txt. The --all baseline is unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reference checksum for the new canonical --fast config
(use_nitrif_denitrif=.true., carbon_only=.false., MIMICS off).
The driver looks for this file when run with --fast and reports
MATCH/MISMATCH against it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reflect two recent changes: --fast now runs (use_nitrif_denitrif=.true.,
carbon_only=.false., MIMICS off), and the driver compares against
either baseline_checksum.txt or baseline_checksum_fast.txt depending
on mode.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Introduce non_mimics_decomp = 1 as a sibling of mimics_decomp = 2
in the driver and use it at every call site that previously did
mimics_decomp - 1 (5 sites: 1 in --fast, 4 in --all). Removes
arithmetic at call sites; behavior unchanged (both baselines
still MATCH).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sentence-case loop section headers and add banner comments around
the "second pass" residual uptake block in the use_nitrif_denitrif
=.false. branch. No code changes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Builds the driver and runs both --fast and --all, reporting just
the checksum / baseline / elapsed lines per mode. Passes caller
args through to make (e.g. ./verify.sh EXTRA_FFLAGS="-acc=gpu"),
fails loudly on build error. Eliminates retyping the verification
pipeline across staged refactor commits.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
New module perf_testing/perf_timers_mod.F90 exposes
perf_timer_start/stop/print/dump_csv/reset wall-clock timers backed
by Fortran system_clock. Gated by cpp macro INNER_TIMING; when
undefined every public routine is empty so the science code carries
zero overhead. Up to 64 labels, auto-created on first start.

Makefile.common recognizes INNER_TIMING=1 (independent of the
existing TIMING knob). The SoilBiogeochemCompetition Makefile picks
up the parent-dir module via VPATH and lists it in OBJ. No callers
yet — both verify.sh runs (with and without INNER_TIMING=1) still
produce identical checksums.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add perf_timer_start/stop calls from perf_timers_mod around each of
the 9 canonical-path science loops in the use_nitrif_denitrif=.true.
branch: init_sminn_tot, accum_sminn_tot, compute_nuptake_prof,
main_competition, sum_sminn_to_plant, residual_uptake_nh4,
residual_uptake_no3, sum_immobilization, compute_fpg_fpi.

The non-canonical .false. branch and the MIMICS-only Loop 19
overflow block are intentionally not instrumented — they're not on
the optimization path.

When INNER_TIMING is undefined the wrappers compile to empty
no-ops; both verify.sh runs (with and without INNER_TIMING=1) still
produce identical checksums.

Labels match what each loop computes so the eventual Step 3 helper
names will follow them, keeping pre/post-extraction timings
directly comparable.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Driver now imports perf_timer_print/perf_timer_dump_csv from
perf_timers_mod and calls a new internal write_inner_timings()
subroutine after compare_to_baseline. Body of write_inner_timings
is gated by #ifdef INNER_TIMING — when undefined the subroutine
is empty (no table printed, no file created). When INNER_TIMING=1
the per-loop table is printed to stdout and last_run_timings.csv
is written (one row per timer label).

Add last_run_timings.csv to perf_testing/.gitignore so per-run
timing dumps don't show up as untracked.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add a "Per-loop ('inner') timing" subsection covering:
- make INNER_TIMING=1 build invocation
- stdout per-loop table and last_run_timings.csv output
- CSV is gitignored
- INNER_TIMING is independent of TIMING/PERF_TIMING
- ./verify.sh INNER_TIMING=1 as the easy build+run+check path

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Move Loop 15's per-element body
  sminn_tot(c) = sminn_tot(c) + (smin_no3_vr(c,j) + smin_nh4_vr(c,j)) * dzsoi_decomp(j)
into a sibling pure subroutine accum_sminn_tot taking element-level
scalar args. The do-loop now contains a single call site. Helper
name matches the timer label (and the eventual OpenACC routine
attachment point).

Per-call timing for accum_sminn_tot (--fast, INNER_TIMING=1) is
essentially unchanged: 61.1 µs/call before, 62.8 µs/call after
(+2.8%, within run-to-run noise — compiler inlines the pure call).
Both --fast and --all checksums still MATCH.

Loop 14 (init_sminn_tot, just sminn_tot(c) = 0.) is intentionally
not extracted: ~1 µs/call, single-line zero-init, not science.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Move Loop 16's per-element if/else body
  if (sminn_tot(c)  >  0.) then
     nuptake_prof(c,j) = sminn_vr(c,j) / sminn_tot(c)
  else
     nuptake_prof(c,j) = nfixation_prof(c,j)
  endif
into a sibling pure subroutine compute_nuptake_prof. The do-loop
now contains a single call site. nuptake_prof has intent(out) (set
unconditionally in both branches).

Per-call timing for compute_nuptake_prof (--fast, INNER_TIMING=1):
370.8 µs/call before, 371.7 µs/call after (+0.2%, in noise). Both
--fast and --all checksums still MATCH.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Move Loop 17's NH4 competition sub-block (sum_nh4_demand calc +
outer if/else for limited case + MIMICS NH4 override) into a
sibling pure subroutine compete_nh4 taking element-level scalar
args. The do-loop body now contains a single call site for this
sub-block; NO3 competition / N2O / carbon_only / summary will
follow in subsequent commits.

Per-call timing for main_competition (--fast, INNER_TIMING=1):
3.009 ms/call before, 3.006 ms/call after (-0.1%, in noise).
Both --fast and --all checksums still MATCH.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Move Loop 17's NO3 competition sub-block (sum_no3_demand calc +
outer if/else for limited case + MIMICS NO3 override) into a
sibling pure subroutine compete_no3. Mirrors compete_nh4 from
the previous commit; takes fpi_nh4_vr / smin_nh4_to_plant_vr /
actual_immob_nh4_vr as intent(in) since NH4 is competed first
and NO3 demand depends on what NH4 already consumed.

Per-call timing for main_competition (--fast, INNER_TIMING=1):
3.009 ms/call before 3c-i, 2.989 ms/call after 3c-ii (-0.7%
across both extractions, in noise).
Both --fast and --all checksums still MATCH.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Move Loop 17's 2-line N2O emissions block into a sibling pure
subroutine compute_n2o_emissions. nitrif_n2o_loss_frac is the
routine-local parameter (6e-4_r8), passed in.

Per-call timing for main_competition (--fast, INNER_TIMING=1):
2.989 ms/call before, 3.047 ms/call after (+1.9%, in noise).
Both --fast and --all checksums still MATCH.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Move Loop 17's `if ( carbon_only ) ... end if` block (which adds N
to sminn pool to eliminate any N limitation when carbon_only is
set) into a sibling pure subroutine apply_carbon_only_adjustment.
The if-on-carbon_only check lives inside the helper; the call site
is unconditional. The five writeable args are intent(inout) — they
keep their input values when carbon_only=.false. (canonical config).

Per-call timing for main_competition (--fast, INNER_TIMING=1):
3.047 ms/call before, 2.987 ms/call after (-2.0%, in noise).
Both --fast and --all checksums still MATCH (the latter exercises
carbon_only=.true. configs, which gate the intent(inout) choice).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Move Loop 17's 3-line summary writeback (fpi_vr, sminn_to_plant_vr,
actual_immob_vr) into a sibling pure subroutine. Loop 17's body now
consists entirely of helper-call invocations: compete_nh4,
compete_no3, compute_n2o_emissions, apply_carbon_only_adjustment,
compute_competition_summary. Step 3c is done.

Per-call timing for main_competition (--fast, INNER_TIMING=1):
2.987 ms/call before, 2.997 ms/call after (+0.4%, in noise).
Both --fast and --all checksums still MATCH.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Move Loop 18's accumulate body
  sminn_to_plant(c) = sminn_to_plant(c) + sminn_to_plant_vr(c,j) * dzsoi_decomp(j)
into a sibling pure subroutine accum_sminn_to_plant. Mirrors the
accum_sminn_tot helper from Step 3a with different summands.

The preceding column-only zero-init loop (sminn_to_plant(c) = 0._r8)
is NOT extracted — same reasoning as Loop 14: simple zero-init,
not science. Timer label sum_sminn_to_plant still wraps both
sub-loops.

Per-call timing for sum_sminn_to_plant (--fast, INNER_TIMING=1):
55.6 µs/call before, 56.2 µs/call after (+1.1%, in noise).
Both --fast and --all checksums still MATCH.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A first attempt at this extraction wrapped the entire NH4 layer
body in a single pure subroutine; that caused a +27% per-call
regression because nvfortran -O2 wouldn't inline the branchy,
multi-arg helper. Reverted and tried a different shape: extract
just the two pure-math expressions (max-clamp residual; min-based
distribution) as pure functions; keep the if/else structure and
column-level accumulation at the call site. Compiler inlines the
small functions readily.

The two helpers are intentionally shared between NH4 (this commit)
and NO3 (next commit) — same math, different operands. Generic
dummy arg names (smin_vr, actual_immob_vr, f_loss_vr).

Per-call timing for residual_uptake_nh4 (--fast, INNER_TIMING=1):
590 µs/call before, 601 µs/call after (+1.9%, in noise).
Both --fast and --all checksums still MATCH.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Apply compute_residual_smin_vr and distribute_residual_to_plant
(added in 3e-i) to the NO3 layer body. Same shape as NH4: branches
and accumulator stay inline; only the two pure-math expressions
become function calls. f_denit_vr is passed as the f_loss_vr arg
(NH4 side passes f_nit_vr).

No new helpers added.

Per-call timing for residual_uptake_no3 (--fast, INNER_TIMING=1):
181.4 µs/call before, 180.7 µs/call after (-0.4%, in noise).
Both --fast and --all checksums still MATCH.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ep 3f)

Rename the previously-extracted accum_sminn_to_plant helper to a
more generic accum_dz_weighted (column_total += value_vr *
dzsoi_decomp) since the same pattern accumulates immobilization
in Loop 23. Update Loop 18's caller to use the new name. Extract
Loop 23's two accumulator lines as two calls to accum_dz_weighted
(one for actual_immob, one for potential_immob). The preceding
column-init zeros block stays inline (per Loop 14/18 reasoning).

Per-call timing (--fast, INNER_TIMING=1):
  sum_sminn_to_plant: 56.2 -> 55.6 µs/call (-1.1%)
  sum_immobilization: 102.0 -> 105.2 µs/call (+3.1%)
Both within run-to-run noise. Both --fast and --all checksums
still MATCH.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Loop 24 (compute_fpg_fpi) computed fpg and fpi via two near-identical
defensive-fraction blocks (`if denom > 0 then num/denom else 1`).
Extract that pattern as a single shared pure function
compute_fraction_or_one(numerator, denominator) and call it twice.
Loop body shrinks from 10 lines (two if/else) to two single-line
calls, with the original science comments preserved at the call site.

Per-call timing for compute_fpg_fpi (--fast, INNER_TIMING=1):
43.3 µs/call before, 24.1 µs/call after (-44%). Significant speedup
(not noise) — likely because the simpler call-site loop body
vectorizes more cleanly than the original two-branch structure
(nvfortran -O2 inlines the pure function and then unrolls the two
calls). Both --fast and --all checksums still MATCH.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Expand the SoilBiogeochemCompetition.F90 file-listing bullet to list
all canonical-path helpers (per-element math + Loop-17 sub-blocks),
note the new perf_timers_mod dependency, and reiterate that the
non-canonical branches are intentionally not refactored.

Add two top entries to "Notes for future optimization stages":
- The per-element helper shape is OpenACC-friendly (each helper can
  take !$acc routine seq; surrounding loops can take !$acc parallel
  loop without further restructuring).
- Lesson learned: wrapping a branchy loop body in a single big pure
  subroutine caused a +27% regression at -O2 (compiler wouldn't
  inline). The working shape is small pure functions with branches
  and accumulators kept at the call site.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- run_gpu.sh: new script. Submits a non-interactive PBS job that runs
  verify.sh on a GPU node, blocks until completion via qsub -W
  block=true, and cats the job's stdout/stderr. Defaults: account
  ucsg0003, queue tutorial, 1 GPU, 5 min walltime. WALLTIME env var
  overrides; all positional args are forwarded to verify.sh inside
  the job.
- README: adds a "CPU vs GPU measurement targets" subsection covering
  the three build/run targets (CPU serial / -acc=multicore / -acc=gpu
  -gpu=cc80), example verify.sh invocations, run_gpu.sh usage, and
  how to read the resulting speedup numbers (serial vs multicore vs
  GPU). Notes that all three targets are equivalent until Step 5
  OpenACC directives land.
- .gitignore: adds PBS output patterns sbgc_gpu.o* and sbgc_gpu.e*.

Tested: submitted a default-args job (CPU build, GPU node);
job 6074562 completed and produced MATCH on both modes.

An interactive qsub -I form is intentionally not used here because
this script is meant to be driven non-interactively from automated
workflows. Users wanting an interactive shell can submit qsub -I
directly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
run_gpu.sh: previous version captured verify_args="$*" and dropped
that into the heredoc. Multi-word values like
EXTRA_FFLAGS="-acc=gpu -gpu=cc80" lost their quoting on the GPU
node — make saw -gpu=cc80 as a separate arg and emitted "invalid
option" errors. Re-quote each arg with printf '%q' so the heredoc
gets a properly-shell-quoted command line. Tested: same
EXTRA_FFLAGS="-acc=gpu -gpu=cc80" invocation that previously failed
now produces MATCH on a real PBS job.

verify.sh: --all overwrites last_run_timings.csv at the end of every
verify run, so the canonical --fast per-loop timings were getting
clobbered. Copy the CSV to last_run_timings_fast.csv between the
two run_mode invocations (gated on the file's existence so it stays
a no-op when INNER_TIMING isn't requested). The Step 5 timing
captures now have a stable --fast snapshot to compare against.

.gitignore: add last_run_timings_fast.csv.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
First OpenACC commit. Apply !$acc parallel loop to the
accum_sminn_tot loop nest with !$acc routine seq on the helper.
Wrap the loop in a local !$acc data region scoping data movement
to this kernel for now; a later commit will hoist the region up
to amortize transfers across multiple kernels.

Loop order is swapped under #ifdef _OPENACC: GPU/multicore parallelize
over fc with j inner-serial (each fc thread does its own j-reduction
into sminn_tot(c)); CPU-serial keeps the original j-outer/fc-inner
order which is cache-friendlier given column-major layout. !$acc
data and routine directives are comment sentinels (no-op without
-acc), so only the loop-swap needs ifdef.

Per-call timing for accum_sminn_tot (--fast, INNER_TIMING=1):
  CPU serial: 60.4 µs -> 62.1 µs (+2.8%, in noise)
  CPU multicore: 159.7 µs -> 69 ms (+432x)
  GPU: 158.7 µs -> 3.5 ms (+22x)

The multicore/GPU regressions are expected for a tiny kernel with
its own per-call data region: copy-in/out of all referenced arrays
on every call dominates the ~30 µs of actual compute work. The
local data region is the right shape for this commit (each kernel
explicitly manages its data); it gets hoisted in a later step so
transfers happen once per iteration rather than once per loop.
Both --fast and --all checksums still MATCH on all three targets.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Move the four read-only inputs (smin_nh4_vr, smin_no3_vr,
dzsoi_decomp, filter_bgc_soilc) out of accum_sminn_tot's local
!$acc data region and up to a single !$acc data copyin region
around the driver's iter loop. The inputs are constant across the
iter loop (set once in fill_inputs_once, never modified), so
copyin happens once per ./driver run rather than once per kernel
call.

The local data region in accum_sminn_tot shrinks to just
copy(sminn_tot) (a routine-local automatic, must stay here). The
parallel loop gets default(present) so it uses the outer-region
copies.

Per-call timing for accum_sminn_tot (--fast, INNER_TIMING=1):
                  Pre-5a    Step-5a-local   Step-5a-hoisted
  Serial:         60.4 µs   62.1 µs         63.3 µs
  Multicore:      159.7 µs  69 ms           52 ms
  GPU:            158.7 µs  3.5 ms          94.6 µs    (37x speedup)

GPU is now within 1.6x of CPU serial, with the remaining gap
mostly kernel-launch overhead for a tiny kernel — will shrink
when more kernels share the same data region. Multicore still
loses to serial because the parallel-region setup overhead per
call dominates the ~30 µs of compute work for this size of
kernel; that's a property of the kernel size, not the hoist.
Both --fast and --all checksums still MATCH on all three targets.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Move the !$acc data copyin region from the driver's iter loop into
the canonical-path block of SoilBiogeochemCompetition. The
driver's iter loop is an artifact of the perf harness — there's no
equivalent in real CTSM, where this routine is called once per
model timestep from CNDriverMod. Hoisting to the iter loop made
benchmark numbers look better than they will in production
(transfers amortized over 100 fake-iters). Hoisting inside the
routine means transfers happen once per call (= once per timestep
in real code), and the patch translates 1:1 to the real tree
without caller-side changes.

The accum_sminn_tot site (local !$acc data copy(sminn_tot),
!$acc parallel loop default(present), loop-order swap) is
unchanged; default(present) now picks up the inputs from this
routine-internal outer region instead of the driver's region.

Per-call timing for accum_sminn_tot (--fast, INNER_TIMING=1):
                  Pre-5     Driver-hoist  Routine-hoist (this)
  Serial:         60.4 µs   63.3 µs       62.1 µs
  Multicore:      159.7 µs  52 ms         67.4 ms
  GPU:            158.7 µs  94.6 µs       85.5 µs

GPU stays comparable (driver-hoist ~95 µs vs routine-hoist ~85 µs;
both within noise). Multicore is parallel-region-overhead
dominated either way for this kernel size — neither number is
meaningful for tuning. The routine-hoist version is the one we'll
build on going forward; subsequent loops will extend the copyin
list in the same routine-internal !$acc data region. Both --fast
and --all checksums still MATCH on all three targets.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
samsrabin and others added 14 commits May 6, 2026 16:34
Add !$acc routine seq to the compute_nuptake_prof helper and
!$acc parallel loop collapse(2) default(present) to its loop site.
Each (c,j) writes to its own nuptake_prof(c,j); no reduction, so
collapse(2) gives ~80,000 parallel work-units. No CPU/GPU loop
swap needed — original do j; do fc; ordering works for both.

Hoist data:
  - Outer routine-internal !$acc data copyin list grows by
    sminn_vr, nfixation_prof (compute_nuptake_prof's read-only
    inputs).
  - The old !$acc data copy(sminn_tot) that wrapped only
    accum_sminn_tot is replaced with a wider
    !$acc data copyin(sminn_tot) copyout(nuptake_prof) that wraps
    BOTH accum_sminn_tot and compute_nuptake_prof. sminn_tot now
    stays on the device between the two kernels (no host
    round-trip); nuptake_prof is copied out at region end so the
    host-side main_competition that follows sees the final values.

Per-call timing (--fast, INNER_TIMING=1):
                              Pre-5b    Post-5b
  compute_nuptake_prof
    Serial:                   377 µs    376 µs       (noise)
    Multicore:                723 µs    70 ms        (overhead-bound)
    GPU:                      454 µs    11 µs        (-97%, 33x faster than serial)

  accum_sminn_tot (improved by sminn_tot no longer round-tripping)
    Serial:                   62.1 µs   62.5 µs      (noise)
    Multicore:                67.4 ms   75.9 ms      (overhead-bound)
    GPU:                      85.5 µs   49 µs        (-43%)

Both --fast and --all checksums still MATCH on all three targets.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
NVIDIA engineer flagged that nvfortran's -acc=multicore isn't the
right CPU-parallel target — its per-region overhead made the
numbers pathologically slow (~100-1000x slower than serial on
small kernels). OpenMP is what CTSM uses for CPU threading in
production, so it's both more realistic and not bottlenecked on
the multicore runtime. Drop multicore; add OpenMP.

Source layout: each loop now carries both an !$omp parallel do
directive (above) and the existing !$acc parallel loop directive
(below). Both are Fortran-comment-prefixed sentinels — build flag
picks which one activates. Without -mp or -acc, both ignored
(serial). The !$acc data regions are no-ops on -mp builds because
OpenMP threads share host memory.

Retroactive directive additions:
  - accum_sminn_tot: !$omp parallel do private(c). The loop-swap
    #ifdef changes from _OPENACC alone to (_OPENACC || _OPENMP)
    so the swap fires when EITHER parallel target is active
    (parallel correctness requires fc-outer to give each thread a
    unique c, avoiding races on sminn_tot(c)).
  - compute_nuptake_prof: !$omp parallel do collapse(2) private(c)
    matching the OpenACC collapse(2) shape.

README's "CPU vs GPU measurement targets" rewritten: 3-row table
swaps multicore for OpenMP; new explanatory paragraph on why
multicore was dropped; speedup-interpretation section refers to
OpenMP throughout. New EXTRA_FFLAGS values:
  - Serial: (none)
  - OpenMP: EXTRA_FFLAGS="-mp"
  - GPU:    EXTRA_FFLAGS="-acc=gpu -gpu=cc80"

Per-call timings (--fast, INNER_TIMING=1):

  accum_sminn_tot       Serial    OpenMP    GPU
  was multicore=67ms ->  60 µs   329 µs   50 µs
  compute_nuptake_prof
  was multicore=70ms -> 373 µs   709 µs   11 µs

OpenMP still loses to serial on these tiny kernels (per-region
setup overhead is real for ~30-100 µs of compute work) but it's
no longer catastrophically slow. GPU numbers unchanged from
before. Both --fast and --all checksums MATCH on all three
targets.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The tutorial queue routes through the hackathon reservation R6065080,
which is a single shared GPU node and frequently backs up. The develop
queue returned GPU node allocations in seconds during testing. All
qsub references in run_gpu.sh, debug_gpu.sh, and the README docs
switched accordingly.

build.sh wraps `. ../env.sh && make clean && make "$@"` so iterative
build loops don't need an inline-sourced env file (which requires a
fresh permission approval per shape).

debug_gpu.sh submits a PBS job that runs ./driver directly without
verify.sh's grep filter — useful when the GPU run is producing crash
output or unexpected diagnostics that verify.sh would hide.

.gitignore picks up sbgc_dbg.o*/.e* output filenames produced by the
new debug_gpu.sh helper.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…(Step 5c)

Add !$acc parallel loop collapse(2) default(present) private(c, l) to
the main_competition loop. !$acc routine seq is added to all 5 helpers
(compete_nh4, compete_no3, compute_n2o_emissions,
apply_carbon_only_adjustment, compute_competition_summary) so they
can be called from the device kernel. private(c, l) is required
because c = filter_bgc_soilc(fc) and l = landunit(c) are scalar
assignments inside the collapsed loop body and the compiler's
auto-private detection isn't reliable for them. l is set but unused;
kept to mirror the in-tree CTSM code.

Collapse the previous three nested !$acc data regions (outer
constants, middle sminn_tot/nuptake_prof, and the proposed inner
main_competition scratch region) into a single !$acc data region
that opens after init_sminn_tot and closes at the bottom of the
use_nitrif_denitrif branch. After main_competition's kernel, an
!$acc update self(...) brings the device-computed arrays back to
the host so the CPU loops below (sum_sminn_to_plant, residual_uptake_*,
sum_immobilization, compute_fpg_fpi) can read fresh values. Each
subsequent step that GPU-ifies one of those CPU loops will drop the
arrays it consumes from the update self list; once everything is on
GPU, the update self goes away.

Critical clause choice: smin_nh4_to_plant_vr, smin_no3_to_plant_vr,
sminn_to_plant_vr are in create(), not copyout(). The CPU
residual_uptake_nh4 / residual_uptake_no3 loops modify these arrays
on the host inside the data region. With copyout(), end-of-region
device-to-host copy would clobber those host writes (they would be
overwritten with the device's pre-residual values). create() avoids
the end-of-region transfer; the host's residual updates survive.
These arrays move to copyout() in subsequent steps as their CPU
producers become GPU kernels.

main_competition per-call wall-clock (--fast):
  serial:  2.93e-3 s
  openmp:  6.92e-3 s     (-mp; parallelization overhead exceeds the
                          benefit at this problem size, same pattern
                          as Step 5a/5b)
  gpu:     1.69e-5 s     (-acc=gpu -gpu=cc80; ~174x faster than serial)

Total per-call wall-clock (--fast):
  serial:  5.55e-3 s
  openmp:  1.09e-2 s
  gpu:     7.65e-3 s     (kernel speedup is partially eaten by data
                          transfers + the still-CPU loops below;
                          Steps 5d-g should narrow this gap)

Baseline checksums updated to the GPU values:
  baseline_checksum_fast.txt:
    old: 9.5857105051752981E+06 (serial value)
    new: 9.5857123908133078E+06 (gpu value, +1.886 absolute)
  baseline_checksum.txt:
    old: 7.6772246368780300E+07
    new: 7.6777929984949291E+07 (+5683.62 absolute)

GPU runs are deterministic: 5 successive --fast invocations in the
same allocation produce bit-identical checksums, ruling out a race
condition. The diff between GPU and CPU checksums comes from
nvfortran generating ~1 ULP of difference in compete_nh4's branch-2
calc, which gets amplified through residual_uptake_no3's
`if (residual_plant_ndemand(c) > 0._r8)` branch — for c values where
residual_plant_ndemand(c) is on a knife edge, the ULP perturbation
flips the branch and a residual-redistribution computation runs (or
doesn't). Per project policy tolerance stays tight (1e-10 relative).
Serial and OpenMP runs will now MISMATCH against the (gpu-tracked)
baselines by 1.886 absolute (--fast) and 5683.62 (--all); those
specific diff values are the new expected cpu-vs-baseline diffs and
should remain stable across subsequent commits unless something
materially changes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…aselines

Second-pass review caught another instance of the same bug class as
the copyout-clobber issue from the previous commit, in the inverse
direction this time:

  nlimit_nh4, nlimit_no3, sum_nh4_demand_scaled, sum_no3_demand_scaled

were in create() — written on device by compete_nh4/compete_no3
inside main_competition's kernel — but missing from the !$acc
update self list right after the kernel. With create() there is no
end-of-region D2H transfer, so the host arrays kept whatever was on
the stack when the routine entered (uninitialized integers and
doubles).

  - residual_uptake_nh4 / residual_uptake_no3 read nlimit_nh4(c,j)
    and nlimit_no3(c,j) via `if (...nlimit_*(c,j) .eq. 0)` checks
    that gate the residual-N redistribution. With stale stack
    values, those branches misfire for some (c,j).
  - The mimics_decomp branch reads sum_nh4_demand_scaled and
    sum_no3_demand_scaled (only in --all configs that exercise
    that path).

The previous commit's commit message attributed the GPU-vs-CPU diff
to ULP-level codegen differences amplified through residual_uptake_no3
branch sensitivity. That hypothesis was masking this bug. With the
four arrays added to !$acc update self, the GPU produces bit-identical
checksums to the serial and OpenMP runs:

  --fast all three: 9.5857105051752981E+06  (MATCH, |diff| = 0.0)
  --all  all three: 7.6772246368780300E+07  (MATCH, |diff| = 0.0)

Both baseline_checksum.txt and baseline_checksum_fast.txt are
reverted to those original (CPU) values.

The previously committed values (9.5857123908133078E+06 / 7.6777929984949291E+07)
were the buggy GPU result, and reverting brings tight-tolerance
matching back across all three targets — the goal stated in the
project's tracking-changes policy.

Note that the GPU run was deterministic across 5 invocations even
with the bug present: the uninitialized stack values for
nlimit_nh4(c,j) etc. happened to be consistent across calls, likely
because the routine's stack frame layout repeats and the underlying
memory is zero-initialized by the OS at process start.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add !$acc parallel loop / !$omp parallel do to the two sub-loops of
sum_sminn_to_plant. The init loop (sminn_to_plant(c) = 0) is naturally
parallel over fc. The accumulation loop is fc-outer / j-inner under
_OPENACC || _OPENMP (so each thread owns a unique c and the
dz-weighted sum into sminn_to_plant(c) is serial within a thread); the
CPU-serial path keeps j-outer for cache-friendliness. Same race
pattern as accum_sminn_tot in Step 5a.

The accum_dz_weighted helper picks up !$acc routine seq so it's
callable from inside the device kernel. sminn_to_plant is added to
the !$acc data region's create() clause: the host residual_uptake_nh4
and residual_uptake_no3 loops modify sminn_to_plant later in the same
data region, so copyout would clobber those host writes — create
avoids the end-of-region D2H transfer and the host updates survive.

The !$acc update self block moves from after main_competition to
after sum_sminn_to_plant. It still sits before the mimics_decomp
block (which reads sum_*_demand_scaled) and before residual_uptake_nh4
(which reads sminn_to_plant), so all downstream CPU loops see fresh
device-computed values. sminn_to_plant joins the update self list.

sum_sminn_to_plant per-call wall-clock (--fast):
  serial:   5.70e-5 s
  openmp:   3.61e-4 s     (-mp; 6.3x slower than serial — same
                           parallelization-overhead pattern as the
                           other small kernels)
  gpu:      2.06e-5 s     (~2.8x faster than serial)

Total per-call wall-clock (--fast):
  serial:   5.58e-3 s
  openmp:   1.11e-2 s
  gpu:      7.82e-3 s

All three targets MATCH bit-identical against the unchanged
baselines (9.5857105051752981E+06 / 7.6772246368780300E+07).
Data-clause checklist (memory feedback_openacc_data_clause_review.md)
was walked top-to-bottom; review-agent approved.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add !$acc parallel loop / !$omp parallel do to all eight sub-loops of
the two residual_uptake blocks (each block has init / main work /
re-sum init / re-sum). Loop ordering is fc-outer / j-inner under
_OPENACC || _OPENMP because residual_smin_*(c) accumulates over j and
the downstream branch reads the running total — j must be serial
within each thread, and each thread owns a unique c. CPU-serial keeps
j-outer for cache. Same race pattern as accum_sminn_tot.

compute_residual_smin_vr and distribute_residual_to_plant pick up
!$acc routine seq so they're callable from inside the device kernels.

Data-clause restructuring driven by the OpenACC review checklist
(memory feedback_openacc_data_clause_review.md):

  - smin_nh4_to_plant_vr, smin_no3_to_plant_vr, sminn_to_plant_vr
    move from create() to copyout(). Post-Step-5e nothing on the host
    modifies them in the region (the residual loops that did so are
    now GPU kernels), and the driver checksum reads them after the
    region. Letting end-of-region D2H copy them is simpler than an
    update self.
  - New routine-local automatics added to create():
      residual_plant_ndemand, residual_smin_nh4, residual_smin_no3,
      residual_smin_nh4_vr, residual_smin_no3_vr.
    They are written and read entirely inside the residual kernels.
  - update self after sum_sminn_to_plant trimmed to just
      actual_immob_vr, sum_nh4_demand_scaled, sum_no3_demand_scaled
    — the only arrays the still-CPU code between that point and the
    next sync (mimics block, sum_immobilization init) reads.
  - New update self(sminn_to_plant) added after residual_uptake_no3:
    compute_fpg_fpi (still CPU) reads sminn_to_plant on host, and the
    residual kernels just modified it on device. Drop this update
    self once compute_fpg_fpi is also GPU-ified.

residual_uptake_nh4 per-call wall-clock (--fast):
  serial:   8.06e-4 s
  openmp:   1.24e-3 s     (-mp; ~1.5x slower than serial — same
                           parallel-overhead pattern at this size)
  gpu:      4.75e-5 s     (~17x faster than serial)

residual_uptake_no3 per-call wall-clock (--fast):
  serial:   2.64e-4 s
  openmp:   6.16e-4 s     (-mp; ~2.3x slower than serial)
  gpu:      4.44e-5 s     (~6x faster than serial)

Total per-call wall-clock (--fast) — GPU now beats serial overall:
  serial:   7.32e-3 s
  openmp:   1.19e-2 s
  gpu:      5.81e-3 s     (1.26x faster than serial)

All three targets MATCH bit-identical against the unchanged
baselines (9.5857105051752981E+06 / 7.6772246368780300E+07).
Reviewer-approved.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add !$acc parallel loop / !$omp parallel do to both sub-loops of
sum_immobilization (init zeroing + dz-weighted accumulation of
actual_immob(c) and potential_immob(c) over j). Same fc-outer / j-inner
under _OPENACC || _OPENMP race pattern as accum_sminn_tot and
sum_sminn_to_plant; CPU-serial keeps j-outer for cache.

Data-clause changes:
  - actual_immob, potential_immob (column-scalar pointer args from
    the driver) added to create(). The kernel zeros and accumulates
    them on device; compute_fpg_fpi (still CPU) and the driver
    checksum read them on host.
  - The update self(sminn_to_plant) that lived right after
    residual_uptake_no3 is consolidated into a new
    update self(sminn_to_plant, actual_immob, potential_immob)
    placed after sum_immobilization. No host code between
    residual_uptake_no3 and sum_immobilization reads sminn_to_plant,
    so the move is safe.
  - update self after sum_sminn_to_plant trimmed: actual_immob_vr
    dropped (sum_immobilization now reads it on-device), so only
    sum_*_demand_scaled remain (for the still-CPU mimics_decomp block).

sum_immobilization per-call wall-clock (--fast):
  serial:   1.02e-4 s
  openmp:   6.14e-4 s     (-mp; ~6x slower than serial — same
                           parallelization-overhead pattern)
  gpu:      2.26e-5 s     (~4.5x faster than serial)

Total per-call wall-clock (--fast):
  serial:   5.68e-3 s
  openmp:   1.23e-2 s
  gpu:      5.29e-3 s

All three targets MATCH bit-identical against the unchanged
baselines (9.5857105051752981E+06 / 7.6772246368780300E+07).
Reviewer-approved.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add !$acc parallel loop / !$omp parallel do to compute_fpg_fpi (per-c
loop computing fpg(c) = compute_fraction_or_one(sminn_to_plant(c),
plant_ndemand(c)) and fpi(c) = compute_fraction_or_one(actual_immob(c),
potential_immob(c))). The compute_fraction_or_one helper picks up
!$acc routine seq.

This is the last canonical-path CPU loop; after this commit, every
loop in the use_nitrif_denitrif=.true. branch is a GPU kernel except
the mimics_decomp block (which only runs in --all configs that
exercise that path).

Data-clause restructuring (driven by the OpenACC review checklist):
  - fpg, fpi added to copyout(): kernel writes them on device, driver
    checksum reads them on host post-region.
  - sminn_to_plant, actual_immob, potential_immob move from create()
    to copyout(): they're now produced and consumed entirely on
    device, no host code in the region reads them, but the driver
    checksum needs them on host so end-of-region D2H is the cleanest
    way to deliver them.
  - !$acc update self(sminn_to_plant, actual_immob, potential_immob)
    after sum_immobilization is deleted: no remaining CPU consumer.
  - !$acc update self(sum_*_demand_scaled) after sum_sminn_to_plant
    is kept: the still-CPU mimics_decomp block reads them.

compute_fpg_fpi per-call wall-clock (--fast):
  serial:   2.35e-5 s
  openmp:   7.18e-5 s     (-mp; ~3x slower than serial — same
                           parallel-overhead pattern at this size)
  gpu:      1.08e-5 s     (~2x faster than serial)

Total per-call wall-clock (--fast):
  serial:   5.61e-3 s
  openmp:   1.21e-2 s
  gpu:      5.56e-3 s

All three targets MATCH bit-identical against the unchanged
baselines (9.5857105051752981E+06 / 7.6772246368780300E+07).
Reviewer-approved.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
End-of-Step-5 documentation pass per the project plan:

  - Replace the "until parallel directives land" hedge in the speedup
    section with a measured per-loop wall-clock table for all three
    targets (serial, OpenMP, GPU), captured via INNER_TIMING=1 on the
    canonical --fast path.
  - Note that OpenMP is consistently slower than serial at this
    problem size (parallel-region overhead dominates on 8000 columns
    × 10 levels), and that the GPU's per-kernel speedups don't
    translate to total-time wins yet — host/device transfers and the
    still-CPU mimics_decomp block bound the result.
  - Add a "Status after Step 5" section that documents the single
    !$acc data region, the surviving update self for the
    mimics_decomp block's needs, and the two data-clause failure
    modes (copyout host-write clobber, create host-read of stale
    memory) that bit Step 5c twice during debugging. The discipline
    is also captured in the feedback_openacc_data_clause_review
    memory.
  - Add an "Open work" section calling out the next two natural
    optimizations: GPU-ify the mimics_decomp block (which would let
    us delete the surviving update self), and hoist the data region
    out of SoilBiogeochemCompetition's per-call scope to amortize
    open/close transfers across iterations.

No code changes; baselines unaffected.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous Open Work entry suggested hoisting the !$acc data region
up into the driver's per-call timing loop to amortize open/close
transfers. That change would only improve the harness's measured
speedup — production CTSM has no such repeated-call loop, so the
amortization wouldn't translate to any real-world performance gain.

Replace with the honest description of the bottleneck: per-call PCIe
transfers of the copyin/copyout arrays. The realistic fix is to keep
data on the device across routine boundaries, which requires
GPU-ifying upstream producers and downstream consumers — same
trajectory CTSM as a whole has to take, not a harness-specific trick.
Add an explicit "do not" to flag the hoist anti-pattern for future
reviewers.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The host only reads sum_nh4_demand_scaled and sum_no3_demand_scaled
inside the `if (decomp_method == mimics_decomp)` block; the else
branch just zeros c_overflow_vr and never touches them. Move the
!$acc update self into the if so the D2H transfer fires only when
mimics is actually entered, saving a wasted transfer on every
canonical (non-MIMICS) call (~80 KB × 2 arrays per call).

All three targets (serial / OpenMP / GPU) still MATCH bit-identical
on both --fast (canonical path, update self now skipped) and --all
(includes mimics configs that still trigger the transfer):

  --fast: 9.5857105051752981E+06
  --all : 7.6772246368780300E+07

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously the perf_timers_mod routines were no-ops (empty bodies)
when INNER_TIMING was undefined, but the call sites still fired —
each unbuilt-INNER_TIMING run paid 16 empty-subroutine call costs
per timestep (~1-2 µs total per call). Negligible against ~5 ms
totals, but it muddies any "production-style" measurement and makes
INNER_TIMING-on vs -off not actually a clean A/B.

Now every `call perf_timer_start(...)` and `call perf_timer_stop(...)`
in SoilBiogeochemCompetition.F90 is wrapped in #ifdef INNER_TIMING /
#endif, and so are the `use perf_timers_mod, only : ...` lines in
both SoilBiogeochemCompetition.F90 and driver.F90. The `write_inner_timings`
print/dump_csv calls were already gated. With INNER_TIMING undefined
the compiler emits no perf_timer instructions at all.

Verified all three targets (serial / OpenMP / GPU) still MATCH
bit-identical with INNER_TIMING off and INNER_TIMING=1, on both
--fast and --all. Per-call timing for serial and GPU is within
run-to-run noise either way (~5.7 ms serial, ~5.5 ms GPU).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@samsrabin samsrabin self-assigned this May 18, 2026
@samsrabin samsrabin added b4b bit-for-bit performance idea or PR to improve performance (e.g. throughput, memory) next this should get some attention in the next week or two. Normally each Thursday SE meeting. and removed next this should get some attention in the next week or two. Normally each Thursday SE meeting. labels May 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

b4b bit-for-bit performance idea or PR to improve performance (e.g. throughput, memory)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant