OpenACC directives for SoilBiogeochemCompetition by samsrabin · Pull Request #4027 · ESCOMP/CTSM

samsrabin · 2026-05-18T19:41:12Z

Description of changes

This PR contains the changes I (well, mostly Claude, at my instruction) made in an isolated reproducer of SoilBiogeochemCompetition to enable OpenACC offloading of work to GPU. This isolated reproducer is designed to enable easy testing and iteration, but meaningful performance testing should be carried out in actual model runs.

The changes in this PR are not likely to result in much speedup unless running on GPU, and even then, it's unclear how much impact they'll make. However, I would like to get this merged into the main code to serve as a "living example" of how we can start transitioning more of the codebase to be GPU-offloadable.

Misc. notes:

So far, OpenACC directives have only been added for the most common code route: use_nitrif_denitrif = .true., no MIMICS, no FATES.
The code was first refactored so that the scientific bits are in their own subroutines and thus isolated from the messy directives and whatnot.

Remaining work

Bring changes in to the actual SoilBiogeochemCompetitionMod, not my reproducer.
Demonstrate no diffs on CPU (aux_clm)
Add an aux_clm test that exercises the GPU offloading.
Document performance impact, comparing CPU to GPU

Specific notes

Contributors other than yourself, if any:

@scamp-nvidia

CTSM issues resolved or otherwise addressed, if any:

None

Any user interface changes (namelist or namelist defaults changes)? No

Testing planned or performed, if any:

Manual testing using scripts in perf_testing/SoilBiogeochemCompetition/
aux_clm to show no diffs

Requirements before merge:

I have followed the CTSM contribution guidelines.

The code in this PR branch builds with no errors.
The code in this PR branch runs with no errors. Briefly describe tested configuration(s):
This either (a) does not change answers, (b) it only changes answers at roundoff level, or (c) I have performed a scientific evaluation of the answer changes. Which?:

I have reviewed relevant parts of the CLM documentation Tech Note or User's Guide to determine if anything needs to be changed or added. If it does, describe:
This PR either (a) does not create a need to update the documentation or (b) includes required documentation updates (see guidelines for contributing documentation). Which?:

Driver default now runs all 8 combinations of (use_nitrif_denitrif, carbon_only, decomp_method == mimics_decomp) per iteration and sums the per-config checksums into one combined value. The committed baseline_checksum.txt is regenerated against this --all mode, so a single MATCH covers every top-level branch in the routine. Add --fast flag (the previous behavior): runs just the canonical config (use_nitrif_denitrif=.true., carbon_only=.false., decomp_method=mimics_decomp). Use it for tight perf-iteration loops. The --fast checksum (9.5970435393765438E+04) matches the single-config baseline that was committed in 41c9127 — strong sanity check that the routine's behavior on the canonical path is unchanged. Implementation: - parse_args() now accepts --fast in any position; positional args drop use_nitrif_denitrif and carbon_only (those are now per-config, not driver-level). - fill_inputs() split into fill_inputs_once() (synthetic inputs, one-shot) and zero_outputs() (called before every routine call, since the routine accumulates into actual_immob / potential_immob in the non-nitrif branch). - run_config(unitrif, conly, dmethod, partial_cs) wraps the call site so the main flow can dispatch over configs cleanly. - last_run.txt / baseline_checksum.txt fingerprint now records 'mode' (all|fast) + sizes + niters; per-config flags are no longer part of the fingerprint. Verified end-to-end: ./driver MATCHes new baseline; ./driver --fast produces the previous single-config checksum (skipping baseline compare since the fingerprint differs); both TIMING=1 and TIMING=0 builds work; per-call results are independent of niters and config order (re-zeroing outputs each call). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Flip --fast from (use_nitrif_denitrif=.true., carbon_only=.false., MIMICS on) to (..., MIMICS off) so the canonical config exercises the use_nitrif_denitrif=.true. branch without the MIMICS overflow loop. Have compare_to_baseline pick its file based on mode: --fast reads baseline_checksum_fast.txt, --all keeps reading baseline_checksum.txt. The --all baseline is unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Reference checksum for the new canonical --fast config (use_nitrif_denitrif=.true., carbon_only=.false., MIMICS off). The driver looks for this file when run with --fast and reports MATCH/MISMATCH against it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Reflect two recent changes: --fast now runs (use_nitrif_denitrif=.true., carbon_only=.false., MIMICS off), and the driver compares against either baseline_checksum.txt or baseline_checksum_fast.txt depending on mode. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Introduce non_mimics_decomp = 1 as a sibling of mimics_decomp = 2 in the driver and use it at every call site that previously did mimics_decomp - 1 (5 sites: 1 in --fast, 4 in --all). Removes arithmetic at call sites; behavior unchanged (both baselines still MATCH). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Sentence-case loop section headers and add banner comments around the "second pass" residual uptake block in the use_nitrif_denitrif =.false. branch. No code changes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Builds the driver and runs both --fast and --all, reporting just the checksum / baseline / elapsed lines per mode. Passes caller args through to make (e.g. ./verify.sh EXTRA_FFLAGS="-acc=gpu"), fails loudly on build error. Eliminates retyping the verification pipeline across staged refactor commits. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

New module perf_testing/perf_timers_mod.F90 exposes perf_timer_start/stop/print/dump_csv/reset wall-clock timers backed by Fortran system_clock. Gated by cpp macro INNER_TIMING; when undefined every public routine is empty so the science code carries zero overhead. Up to 64 labels, auto-created on first start. Makefile.common recognizes INNER_TIMING=1 (independent of the existing TIMING knob). The SoilBiogeochemCompetition Makefile picks up the parent-dir module via VPATH and lists it in OBJ. No callers yet — both verify.sh runs (with and without INNER_TIMING=1) still produce identical checksums. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Add perf_timer_start/stop calls from perf_timers_mod around each of the 9 canonical-path science loops in the use_nitrif_denitrif=.true. branch: init_sminn_tot, accum_sminn_tot, compute_nuptake_prof, main_competition, sum_sminn_to_plant, residual_uptake_nh4, residual_uptake_no3, sum_immobilization, compute_fpg_fpi. The non-canonical .false. branch and the MIMICS-only Loop 19 overflow block are intentionally not instrumented — they're not on the optimization path. When INNER_TIMING is undefined the wrappers compile to empty no-ops; both verify.sh runs (with and without INNER_TIMING=1) still produce identical checksums. Labels match what each loop computes so the eventual Step 3 helper names will follow them, keeping pre/post-extraction timings directly comparable. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Driver now imports perf_timer_print/perf_timer_dump_csv from perf_timers_mod and calls a new internal write_inner_timings() subroutine after compare_to_baseline. Body of write_inner_timings is gated by #ifdef INNER_TIMING — when undefined the subroutine is empty (no table printed, no file created). When INNER_TIMING=1 the per-loop table is printed to stdout and last_run_timings.csv is written (one row per timer label). Add last_run_timings.csv to perf_testing/.gitignore so per-run timing dumps don't show up as untracked. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Add a "Per-loop ('inner') timing" subsection covering: - make INNER_TIMING=1 build invocation - stdout per-loop table and last_run_timings.csv output - CSV is gitignored - INNER_TIMING is independent of TIMING/PERF_TIMING - ./verify.sh INNER_TIMING=1 as the easy build+run+check path Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Move Loop 15's per-element body sminn_tot(c) = sminn_tot(c) + (smin_no3_vr(c,j) + smin_nh4_vr(c,j)) * dzsoi_decomp(j) into a sibling pure subroutine accum_sminn_tot taking element-level scalar args. The do-loop now contains a single call site. Helper name matches the timer label (and the eventual OpenACC routine attachment point). Per-call timing for accum_sminn_tot (--fast, INNER_TIMING=1) is essentially unchanged: 61.1 µs/call before, 62.8 µs/call after (+2.8%, within run-to-run noise — compiler inlines the pure call). Both --fast and --all checksums still MATCH. Loop 14 (init_sminn_tot, just sminn_tot(c) = 0.) is intentionally not extracted: ~1 µs/call, single-line zero-init, not science. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Move Loop 16's per-element if/else body if (sminn_tot(c) > 0.) then nuptake_prof(c,j) = sminn_vr(c,j) / sminn_tot(c) else nuptake_prof(c,j) = nfixation_prof(c,j) endif into a sibling pure subroutine compute_nuptake_prof. The do-loop now contains a single call site. nuptake_prof has intent(out) (set unconditionally in both branches). Per-call timing for compute_nuptake_prof (--fast, INNER_TIMING=1): 370.8 µs/call before, 371.7 µs/call after (+0.2%, in noise). Both --fast and --all checksums still MATCH. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Move Loop 17's NH4 competition sub-block (sum_nh4_demand calc + outer if/else for limited case + MIMICS NH4 override) into a sibling pure subroutine compete_nh4 taking element-level scalar args. The do-loop body now contains a single call site for this sub-block; NO3 competition / N2O / carbon_only / summary will follow in subsequent commits. Per-call timing for main_competition (--fast, INNER_TIMING=1): 3.009 ms/call before, 3.006 ms/call after (-0.1%, in noise). Both --fast and --all checksums still MATCH. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Move Loop 17's NO3 competition sub-block (sum_no3_demand calc + outer if/else for limited case + MIMICS NO3 override) into a sibling pure subroutine compete_no3. Mirrors compete_nh4 from the previous commit; takes fpi_nh4_vr / smin_nh4_to_plant_vr / actual_immob_nh4_vr as intent(in) since NH4 is competed first and NO3 demand depends on what NH4 already consumed. Per-call timing for main_competition (--fast, INNER_TIMING=1): 3.009 ms/call before 3c-i, 2.989 ms/call after 3c-ii (-0.7% across both extractions, in noise). Both --fast and --all checksums still MATCH. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Move Loop 17's 2-line N2O emissions block into a sibling pure subroutine compute_n2o_emissions. nitrif_n2o_loss_frac is the routine-local parameter (6e-4_r8), passed in. Per-call timing for main_competition (--fast, INNER_TIMING=1): 2.989 ms/call before, 3.047 ms/call after (+1.9%, in noise). Both --fast and --all checksums still MATCH. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Move Loop 17's `if ( carbon_only ) ... end if` block (which adds N to sminn pool to eliminate any N limitation when carbon_only is set) into a sibling pure subroutine apply_carbon_only_adjustment. The if-on-carbon_only check lives inside the helper; the call site is unconditional. The five writeable args are intent(inout) — they keep their input values when carbon_only=.false. (canonical config). Per-call timing for main_competition (--fast, INNER_TIMING=1): 3.047 ms/call before, 2.987 ms/call after (-2.0%, in noise). Both --fast and --all checksums still MATCH (the latter exercises carbon_only=.true. configs, which gate the intent(inout) choice). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Move Loop 17's 3-line summary writeback (fpi_vr, sminn_to_plant_vr, actual_immob_vr) into a sibling pure subroutine. Loop 17's body now consists entirely of helper-call invocations: compete_nh4, compete_no3, compute_n2o_emissions, apply_carbon_only_adjustment, compute_competition_summary. Step 3c is done. Per-call timing for main_competition (--fast, INNER_TIMING=1): 2.987 ms/call before, 2.997 ms/call after (+0.4%, in noise). Both --fast and --all checksums still MATCH. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Move Loop 18's accumulate body sminn_to_plant(c) = sminn_to_plant(c) + sminn_to_plant_vr(c,j) * dzsoi_decomp(j) into a sibling pure subroutine accum_sminn_to_plant. Mirrors the accum_sminn_tot helper from Step 3a with different summands. The preceding column-only zero-init loop (sminn_to_plant(c) = 0._r8) is NOT extracted — same reasoning as Loop 14: simple zero-init, not science. Timer label sum_sminn_to_plant still wraps both sub-loops. Per-call timing for sum_sminn_to_plant (--fast, INNER_TIMING=1): 55.6 µs/call before, 56.2 µs/call after (+1.1%, in noise). Both --fast and --all checksums still MATCH. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

A first attempt at this extraction wrapped the entire NH4 layer body in a single pure subroutine; that caused a +27% per-call regression because nvfortran -O2 wouldn't inline the branchy, multi-arg helper. Reverted and tried a different shape: extract just the two pure-math expressions (max-clamp residual; min-based distribution) as pure functions; keep the if/else structure and column-level accumulation at the call site. Compiler inlines the small functions readily. The two helpers are intentionally shared between NH4 (this commit) and NO3 (next commit) — same math, different operands. Generic dummy arg names (smin_vr, actual_immob_vr, f_loss_vr). Per-call timing for residual_uptake_nh4 (--fast, INNER_TIMING=1): 590 µs/call before, 601 µs/call after (+1.9%, in noise). Both --fast and --all checksums still MATCH. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Apply compute_residual_smin_vr and distribute_residual_to_plant (added in 3e-i) to the NO3 layer body. Same shape as NH4: branches and accumulator stay inline; only the two pure-math expressions become function calls. f_denit_vr is passed as the f_loss_vr arg (NH4 side passes f_nit_vr). No new helpers added. Per-call timing for residual_uptake_no3 (--fast, INNER_TIMING=1): 181.4 µs/call before, 180.7 µs/call after (-0.4%, in noise). Both --fast and --all checksums still MATCH. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ep 3f) Rename the previously-extracted accum_sminn_to_plant helper to a more generic accum_dz_weighted (column_total += value_vr * dzsoi_decomp) since the same pattern accumulates immobilization in Loop 23. Update Loop 18's caller to use the new name. Extract Loop 23's two accumulator lines as two calls to accum_dz_weighted (one for actual_immob, one for potential_immob). The preceding column-init zeros block stays inline (per Loop 14/18 reasoning). Per-call timing (--fast, INNER_TIMING=1): sum_sminn_to_plant: 56.2 -> 55.6 µs/call (-1.1%) sum_immobilization: 102.0 -> 105.2 µs/call (+3.1%) Both within run-to-run noise. Both --fast and --all checksums still MATCH. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Loop 24 (compute_fpg_fpi) computed fpg and fpi via two near-identical defensive-fraction blocks (`if denom > 0 then num/denom else 1`). Extract that pattern as a single shared pure function compute_fraction_or_one(numerator, denominator) and call it twice. Loop body shrinks from 10 lines (two if/else) to two single-line calls, with the original science comments preserved at the call site. Per-call timing for compute_fpg_fpi (--fast, INNER_TIMING=1): 43.3 µs/call before, 24.1 µs/call after (-44%). Significant speedup (not noise) — likely because the simpler call-site loop body vectorizes more cleanly than the original two-branch structure (nvfortran -O2 inlines the pure function and then unrolls the two calls). Both --fast and --all checksums still MATCH. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Expand the SoilBiogeochemCompetition.F90 file-listing bullet to list all canonical-path helpers (per-element math + Loop-17 sub-blocks), note the new perf_timers_mod dependency, and reiterate that the non-canonical branches are intentionally not refactored. Add two top entries to "Notes for future optimization stages": - The per-element helper shape is OpenACC-friendly (each helper can take !$acc routine seq; surrounding loops can take !$acc parallel loop without further restructuring). - Lesson learned: wrapping a branchy loop body in a single big pure subroutine caused a +27% regression at -O2 (compiler wouldn't inline). The working shape is small pure functions with branches and accumulators kept at the call site. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- run_gpu.sh: new script. Submits a non-interactive PBS job that runs verify.sh on a GPU node, blocks until completion via qsub -W block=true, and cats the job's stdout/stderr. Defaults: account ucsg0003, queue tutorial, 1 GPU, 5 min walltime. WALLTIME env var overrides; all positional args are forwarded to verify.sh inside the job. - README: adds a "CPU vs GPU measurement targets" subsection covering the three build/run targets (CPU serial / -acc=multicore / -acc=gpu -gpu=cc80), example verify.sh invocations, run_gpu.sh usage, and how to read the resulting speedup numbers (serial vs multicore vs GPU). Notes that all three targets are equivalent until Step 5 OpenACC directives land. - .gitignore: adds PBS output patterns sbgc_gpu.o* and sbgc_gpu.e*. Tested: submitted a default-args job (CPU build, GPU node); job 6074562 completed and produced MATCH on both modes. An interactive qsub -I form is intentionally not used here because this script is meant to be driven non-interactively from automated workflows. Users wanting an interactive shell can submit qsub -I directly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

run_gpu.sh: previous version captured verify_args="$*" and dropped that into the heredoc. Multi-word values like EXTRA_FFLAGS="-acc=gpu -gpu=cc80" lost their quoting on the GPU node — make saw -gpu=cc80 as a separate arg and emitted "invalid option" errors. Re-quote each arg with printf '%q' so the heredoc gets a properly-shell-quoted command line. Tested: same EXTRA_FFLAGS="-acc=gpu -gpu=cc80" invocation that previously failed now produces MATCH on a real PBS job. verify.sh: --all overwrites last_run_timings.csv at the end of every verify run, so the canonical --fast per-loop timings were getting clobbered. Copy the CSV to last_run_timings_fast.csv between the two run_mode invocations (gated on the file's existence so it stays a no-op when INNER_TIMING isn't requested). The Step 5 timing captures now have a stable --fast snapshot to compare against. .gitignore: add last_run_timings_fast.csv. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

First OpenACC commit. Apply !$acc parallel loop to the accum_sminn_tot loop nest with !$acc routine seq on the helper. Wrap the loop in a local !$acc data region scoping data movement to this kernel for now; a later commit will hoist the region up to amortize transfers across multiple kernels. Loop order is swapped under #ifdef _OPENACC: GPU/multicore parallelize over fc with j inner-serial (each fc thread does its own j-reduction into sminn_tot(c)); CPU-serial keeps the original j-outer/fc-inner order which is cache-friendlier given column-major layout. !$acc data and routine directives are comment sentinels (no-op without -acc), so only the loop-swap needs ifdef. Per-call timing for accum_sminn_tot (--fast, INNER_TIMING=1): CPU serial: 60.4 µs -> 62.1 µs (+2.8%, in noise) CPU multicore: 159.7 µs -> 69 ms (+432x) GPU: 158.7 µs -> 3.5 ms (+22x) The multicore/GPU regressions are expected for a tiny kernel with its own per-call data region: copy-in/out of all referenced arrays on every call dominates the ~30 µs of actual compute work. The local data region is the right shape for this commit (each kernel explicitly manages its data); it gets hoisted in a later step so transfers happen once per iteration rather than once per loop. Both --fast and --all checksums still MATCH on all three targets. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Move the four read-only inputs (smin_nh4_vr, smin_no3_vr, dzsoi_decomp, filter_bgc_soilc) out of accum_sminn_tot's local !$acc data region and up to a single !$acc data copyin region around the driver's iter loop. The inputs are constant across the iter loop (set once in fill_inputs_once, never modified), so copyin happens once per ./driver run rather than once per kernel call. The local data region in accum_sminn_tot shrinks to just copy(sminn_tot) (a routine-local automatic, must stay here). The parallel loop gets default(present) so it uses the outer-region copies. Per-call timing for accum_sminn_tot (--fast, INNER_TIMING=1): Pre-5a Step-5a-local Step-5a-hoisted Serial: 60.4 µs 62.1 µs 63.3 µs Multicore: 159.7 µs 69 ms 52 ms GPU: 158.7 µs 3.5 ms 94.6 µs (37x speedup) GPU is now within 1.6x of CPU serial, with the remaining gap mostly kernel-launch overhead for a tiny kernel — will shrink when more kernels share the same data region. Multicore still loses to serial because the parallel-region setup overhead per call dominates the ~30 µs of compute work for this size of kernel; that's a property of the kernel size, not the hoist. Both --fast and --all checksums still MATCH on all three targets. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Move the !$acc data copyin region from the driver's iter loop into the canonical-path block of SoilBiogeochemCompetition. The driver's iter loop is an artifact of the perf harness — there's no equivalent in real CTSM, where this routine is called once per model timestep from CNDriverMod. Hoisting to the iter loop made benchmark numbers look better than they will in production (transfers amortized over 100 fake-iters). Hoisting inside the routine means transfers happen once per call (= once per timestep in real code), and the patch translates 1:1 to the real tree without caller-side changes. The accum_sminn_tot site (local !$acc data copy(sminn_tot), !$acc parallel loop default(present), loop-order swap) is unchanged; default(present) now picks up the inputs from this routine-internal outer region instead of the driver's region. Per-call timing for accum_sminn_tot (--fast, INNER_TIMING=1): Pre-5 Driver-hoist Routine-hoist (this) Serial: 60.4 µs 63.3 µs 62.1 µs Multicore: 159.7 µs 52 ms 67.4 ms GPU: 158.7 µs 94.6 µs 85.5 µs GPU stays comparable (driver-hoist ~95 µs vs routine-hoist ~85 µs; both within noise). Multicore is parallel-region-overhead dominated either way for this kernel size — neither number is meaningful for tuning. The routine-hoist version is the one we'll build on going forward; subsequent loops will extend the copyin list in the same routine-internal !$acc data region. Both --fast and --all checksums still MATCH on all three targets. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Add !$acc routine seq to the compute_nuptake_prof helper and !$acc parallel loop collapse(2) default(present) to its loop site. Each (c,j) writes to its own nuptake_prof(c,j); no reduction, so collapse(2) gives ~80,000 parallel work-units. No CPU/GPU loop swap needed — original do j; do fc; ordering works for both. Hoist data: - Outer routine-internal !$acc data copyin list grows by sminn_vr, nfixation_prof (compute_nuptake_prof's read-only inputs). - The old !$acc data copy(sminn_tot) that wrapped only accum_sminn_tot is replaced with a wider !$acc data copyin(sminn_tot) copyout(nuptake_prof) that wraps BOTH accum_sminn_tot and compute_nuptake_prof. sminn_tot now stays on the device between the two kernels (no host round-trip); nuptake_prof is copied out at region end so the host-side main_competition that follows sees the final values. Per-call timing (--fast, INNER_TIMING=1): Pre-5b Post-5b compute_nuptake_prof Serial: 377 µs 376 µs (noise) Multicore: 723 µs 70 ms (overhead-bound) GPU: 454 µs 11 µs (-97%, 33x faster than serial) accum_sminn_tot (improved by sminn_tot no longer round-tripping) Serial: 62.1 µs 62.5 µs (noise) Multicore: 67.4 ms 75.9 ms (overhead-bound) GPU: 85.5 µs 49 µs (-43%) Both --fast and --all checksums still MATCH on all three targets. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

NVIDIA engineer flagged that nvfortran's -acc=multicore isn't the right CPU-parallel target — its per-region overhead made the numbers pathologically slow (~100-1000x slower than serial on small kernels). OpenMP is what CTSM uses for CPU threading in production, so it's both more realistic and not bottlenecked on the multicore runtime. Drop multicore; add OpenMP. Source layout: each loop now carries both an !$omp parallel do directive (above) and the existing !$acc parallel loop directive (below). Both are Fortran-comment-prefixed sentinels — build flag picks which one activates. Without -mp or -acc, both ignored (serial). The !$acc data regions are no-ops on -mp builds because OpenMP threads share host memory. Retroactive directive additions: - accum_sminn_tot: !$omp parallel do private(c). The loop-swap #ifdef changes from _OPENACC alone to (_OPENACC || _OPENMP) so the swap fires when EITHER parallel target is active (parallel correctness requires fc-outer to give each thread a unique c, avoiding races on sminn_tot(c)). - compute_nuptake_prof: !$omp parallel do collapse(2) private(c) matching the OpenACC collapse(2) shape. README's "CPU vs GPU measurement targets" rewritten: 3-row table swaps multicore for OpenMP; new explanatory paragraph on why multicore was dropped; speedup-interpretation section refers to OpenMP throughout. New EXTRA_FFLAGS values: - Serial: (none) - OpenMP: EXTRA_FFLAGS="-mp" - GPU: EXTRA_FFLAGS="-acc=gpu -gpu=cc80" Per-call timings (--fast, INNER_TIMING=1): accum_sminn_tot Serial OpenMP GPU was multicore=67ms -> 60 µs 329 µs 50 µs compute_nuptake_prof was multicore=70ms -> 373 µs 709 µs 11 µs OpenMP still loses to serial on these tiny kernels (per-region setup overhead is real for ~30-100 µs of compute work) but it's no longer catastrophically slow. GPU numbers unchanged from before. Both --fast and --all checksums MATCH on all three targets. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The tutorial queue routes through the hackathon reservation R6065080, which is a single shared GPU node and frequently backs up. The develop queue returned GPU node allocations in seconds during testing. All qsub references in run_gpu.sh, debug_gpu.sh, and the README docs switched accordingly. build.sh wraps `. ../env.sh && make clean && make "$@"` so iterative build loops don't need an inline-sourced env file (which requires a fresh permission approval per shape). debug_gpu.sh submits a PBS job that runs ./driver directly without verify.sh's grep filter — useful when the GPU run is producing crash output or unexpected diagnostics that verify.sh would hide. .gitignore picks up sbgc_dbg.o*/.e* output filenames produced by the new debug_gpu.sh helper. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…(Step 5c) Add !$acc parallel loop collapse(2) default(present) private(c, l) to the main_competition loop. !$acc routine seq is added to all 5 helpers (compete_nh4, compete_no3, compute_n2o_emissions, apply_carbon_only_adjustment, compute_competition_summary) so they can be called from the device kernel. private(c, l) is required because c = filter_bgc_soilc(fc) and l = landunit(c) are scalar assignments inside the collapsed loop body and the compiler's auto-private detection isn't reliable for them. l is set but unused; kept to mirror the in-tree CTSM code. Collapse the previous three nested !$acc data regions (outer constants, middle sminn_tot/nuptake_prof, and the proposed inner main_competition scratch region) into a single !$acc data region that opens after init_sminn_tot and closes at the bottom of the use_nitrif_denitrif branch. After main_competition's kernel, an !$acc update self(...) brings the device-computed arrays back to the host so the CPU loops below (sum_sminn_to_plant, residual_uptake_*, sum_immobilization, compute_fpg_fpi) can read fresh values. Each subsequent step that GPU-ifies one of those CPU loops will drop the arrays it consumes from the update self list; once everything is on GPU, the update self goes away. Critical clause choice: smin_nh4_to_plant_vr, smin_no3_to_plant_vr, sminn_to_plant_vr are in create(), not copyout(). The CPU residual_uptake_nh4 / residual_uptake_no3 loops modify these arrays on the host inside the data region. With copyout(), end-of-region device-to-host copy would clobber those host writes (they would be overwritten with the device's pre-residual values). create() avoids the end-of-region transfer; the host's residual updates survive. These arrays move to copyout() in subsequent steps as their CPU producers become GPU kernels. main_competition per-call wall-clock (--fast): serial: 2.93e-3 s openmp: 6.92e-3 s (-mp; parallelization overhead exceeds the benefit at this problem size, same pattern as Step 5a/5b) gpu: 1.69e-5 s (-acc=gpu -gpu=cc80; ~174x faster than serial) Total per-call wall-clock (--fast): serial: 5.55e-3 s openmp: 1.09e-2 s gpu: 7.65e-3 s (kernel speedup is partially eaten by data transfers + the still-CPU loops below; Steps 5d-g should narrow this gap) Baseline checksums updated to the GPU values: baseline_checksum_fast.txt: old: 9.5857105051752981E+06 (serial value) new: 9.5857123908133078E+06 (gpu value, +1.886 absolute) baseline_checksum.txt: old: 7.6772246368780300E+07 new: 7.6777929984949291E+07 (+5683.62 absolute) GPU runs are deterministic: 5 successive --fast invocations in the same allocation produce bit-identical checksums, ruling out a race condition. The diff between GPU and CPU checksums comes from nvfortran generating ~1 ULP of difference in compete_nh4's branch-2 calc, which gets amplified through residual_uptake_no3's `if (residual_plant_ndemand(c) > 0._r8)` branch — for c values where residual_plant_ndemand(c) is on a knife edge, the ULP perturbation flips the branch and a residual-redistribution computation runs (or doesn't). Per project policy tolerance stays tight (1e-10 relative). Serial and OpenMP runs will now MISMATCH against the (gpu-tracked) baselines by 1.886 absolute (--fast) and 5683.62 (--all); those specific diff values are the new expected cpu-vs-baseline diffs and should remain stable across subsequent commits unless something materially changes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…aselines Second-pass review caught another instance of the same bug class as the copyout-clobber issue from the previous commit, in the inverse direction this time: nlimit_nh4, nlimit_no3, sum_nh4_demand_scaled, sum_no3_demand_scaled were in create() — written on device by compete_nh4/compete_no3 inside main_competition's kernel — but missing from the !$acc update self list right after the kernel. With create() there is no end-of-region D2H transfer, so the host arrays kept whatever was on the stack when the routine entered (uninitialized integers and doubles). - residual_uptake_nh4 / residual_uptake_no3 read nlimit_nh4(c,j) and nlimit_no3(c,j) via `if (...nlimit_*(c,j) .eq. 0)` checks that gate the residual-N redistribution. With stale stack values, those branches misfire for some (c,j). - The mimics_decomp branch reads sum_nh4_demand_scaled and sum_no3_demand_scaled (only in --all configs that exercise that path). The previous commit's commit message attributed the GPU-vs-CPU diff to ULP-level codegen differences amplified through residual_uptake_no3 branch sensitivity. That hypothesis was masking this bug. With the four arrays added to !$acc update self, the GPU produces bit-identical checksums to the serial and OpenMP runs: --fast all three: 9.5857105051752981E+06 (MATCH, |diff| = 0.0) --all all three: 7.6772246368780300E+07 (MATCH, |diff| = 0.0) Both baseline_checksum.txt and baseline_checksum_fast.txt are reverted to those original (CPU) values. The previously committed values (9.5857123908133078E+06 / 7.6777929984949291E+07) were the buggy GPU result, and reverting brings tight-tolerance matching back across all three targets — the goal stated in the project's tracking-changes policy. Note that the GPU run was deterministic across 5 invocations even with the bug present: the uninitialized stack values for nlimit_nh4(c,j) etc. happened to be consistent across calls, likely because the routine's stack frame layout repeats and the underlying memory is zero-initialized by the OS at process start. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Add !$acc parallel loop / !$omp parallel do to the two sub-loops of sum_sminn_to_plant. The init loop (sminn_to_plant(c) = 0) is naturally parallel over fc. The accumulation loop is fc-outer / j-inner under _OPENACC || _OPENMP (so each thread owns a unique c and the dz-weighted sum into sminn_to_plant(c) is serial within a thread); the CPU-serial path keeps j-outer for cache-friendliness. Same race pattern as accum_sminn_tot in Step 5a. The accum_dz_weighted helper picks up !$acc routine seq so it's callable from inside the device kernel. sminn_to_plant is added to the !$acc data region's create() clause: the host residual_uptake_nh4 and residual_uptake_no3 loops modify sminn_to_plant later in the same data region, so copyout would clobber those host writes — create avoids the end-of-region D2H transfer and the host updates survive. The !$acc update self block moves from after main_competition to after sum_sminn_to_plant. It still sits before the mimics_decomp block (which reads sum_*_demand_scaled) and before residual_uptake_nh4 (which reads sminn_to_plant), so all downstream CPU loops see fresh device-computed values. sminn_to_plant joins the update self list. sum_sminn_to_plant per-call wall-clock (--fast): serial: 5.70e-5 s openmp: 3.61e-4 s (-mp; 6.3x slower than serial — same parallelization-overhead pattern as the other small kernels) gpu: 2.06e-5 s (~2.8x faster than serial) Total per-call wall-clock (--fast): serial: 5.58e-3 s openmp: 1.11e-2 s gpu: 7.82e-3 s All three targets MATCH bit-identical against the unchanged baselines (9.5857105051752981E+06 / 7.6772246368780300E+07). Data-clause checklist (memory feedback_openacc_data_clause_review.md) was walked top-to-bottom; review-agent approved. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Add !$acc parallel loop / !$omp parallel do to all eight sub-loops of the two residual_uptake blocks (each block has init / main work / re-sum init / re-sum). Loop ordering is fc-outer / j-inner under _OPENACC || _OPENMP because residual_smin_*(c) accumulates over j and the downstream branch reads the running total — j must be serial within each thread, and each thread owns a unique c. CPU-serial keeps j-outer for cache. Same race pattern as accum_sminn_tot. compute_residual_smin_vr and distribute_residual_to_plant pick up !$acc routine seq so they're callable from inside the device kernels. Data-clause restructuring driven by the OpenACC review checklist (memory feedback_openacc_data_clause_review.md): - smin_nh4_to_plant_vr, smin_no3_to_plant_vr, sminn_to_plant_vr move from create() to copyout(). Post-Step-5e nothing on the host modifies them in the region (the residual loops that did so are now GPU kernels), and the driver checksum reads them after the region. Letting end-of-region D2H copy them is simpler than an update self. - New routine-local automatics added to create(): residual_plant_ndemand, residual_smin_nh4, residual_smin_no3, residual_smin_nh4_vr, residual_smin_no3_vr. They are written and read entirely inside the residual kernels. - update self after sum_sminn_to_plant trimmed to just actual_immob_vr, sum_nh4_demand_scaled, sum_no3_demand_scaled — the only arrays the still-CPU code between that point and the next sync (mimics block, sum_immobilization init) reads. - New update self(sminn_to_plant) added after residual_uptake_no3: compute_fpg_fpi (still CPU) reads sminn_to_plant on host, and the residual kernels just modified it on device. Drop this update self once compute_fpg_fpi is also GPU-ified. residual_uptake_nh4 per-call wall-clock (--fast): serial: 8.06e-4 s openmp: 1.24e-3 s (-mp; ~1.5x slower than serial — same parallel-overhead pattern at this size) gpu: 4.75e-5 s (~17x faster than serial) residual_uptake_no3 per-call wall-clock (--fast): serial: 2.64e-4 s openmp: 6.16e-4 s (-mp; ~2.3x slower than serial) gpu: 4.44e-5 s (~6x faster than serial) Total per-call wall-clock (--fast) — GPU now beats serial overall: serial: 7.32e-3 s openmp: 1.19e-2 s gpu: 5.81e-3 s (1.26x faster than serial) All three targets MATCH bit-identical against the unchanged baselines (9.5857105051752981E+06 / 7.6772246368780300E+07). Reviewer-approved. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Add !$acc parallel loop / !$omp parallel do to both sub-loops of sum_immobilization (init zeroing + dz-weighted accumulation of actual_immob(c) and potential_immob(c) over j). Same fc-outer / j-inner under _OPENACC || _OPENMP race pattern as accum_sminn_tot and sum_sminn_to_plant; CPU-serial keeps j-outer for cache. Data-clause changes: - actual_immob, potential_immob (column-scalar pointer args from the driver) added to create(). The kernel zeros and accumulates them on device; compute_fpg_fpi (still CPU) and the driver checksum read them on host. - The update self(sminn_to_plant) that lived right after residual_uptake_no3 is consolidated into a new update self(sminn_to_plant, actual_immob, potential_immob) placed after sum_immobilization. No host code between residual_uptake_no3 and sum_immobilization reads sminn_to_plant, so the move is safe. - update self after sum_sminn_to_plant trimmed: actual_immob_vr dropped (sum_immobilization now reads it on-device), so only sum_*_demand_scaled remain (for the still-CPU mimics_decomp block). sum_immobilization per-call wall-clock (--fast): serial: 1.02e-4 s openmp: 6.14e-4 s (-mp; ~6x slower than serial — same parallelization-overhead pattern) gpu: 2.26e-5 s (~4.5x faster than serial) Total per-call wall-clock (--fast): serial: 5.68e-3 s openmp: 1.23e-2 s gpu: 5.29e-3 s All three targets MATCH bit-identical against the unchanged baselines (9.5857105051752981E+06 / 7.6772246368780300E+07). Reviewer-approved. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Add !$acc parallel loop / !$omp parallel do to compute_fpg_fpi (per-c loop computing fpg(c) = compute_fraction_or_one(sminn_to_plant(c), plant_ndemand(c)) and fpi(c) = compute_fraction_or_one(actual_immob(c), potential_immob(c))). The compute_fraction_or_one helper picks up !$acc routine seq. This is the last canonical-path CPU loop; after this commit, every loop in the use_nitrif_denitrif=.true. branch is a GPU kernel except the mimics_decomp block (which only runs in --all configs that exercise that path). Data-clause restructuring (driven by the OpenACC review checklist): - fpg, fpi added to copyout(): kernel writes them on device, driver checksum reads them on host post-region. - sminn_to_plant, actual_immob, potential_immob move from create() to copyout(): they're now produced and consumed entirely on device, no host code in the region reads them, but the driver checksum needs them on host so end-of-region D2H is the cleanest way to deliver them. - !$acc update self(sminn_to_plant, actual_immob, potential_immob) after sum_immobilization is deleted: no remaining CPU consumer. - !$acc update self(sum_*_demand_scaled) after sum_sminn_to_plant is kept: the still-CPU mimics_decomp block reads them. compute_fpg_fpi per-call wall-clock (--fast): serial: 2.35e-5 s openmp: 7.18e-5 s (-mp; ~3x slower than serial — same parallel-overhead pattern at this size) gpu: 1.08e-5 s (~2x faster than serial) Total per-call wall-clock (--fast): serial: 5.61e-3 s openmp: 1.21e-2 s gpu: 5.56e-3 s All three targets MATCH bit-identical against the unchanged baselines (9.5857105051752981E+06 / 7.6772246368780300E+07). Reviewer-approved. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

End-of-Step-5 documentation pass per the project plan: - Replace the "until parallel directives land" hedge in the speedup section with a measured per-loop wall-clock table for all three targets (serial, OpenMP, GPU), captured via INNER_TIMING=1 on the canonical --fast path. - Note that OpenMP is consistently slower than serial at this problem size (parallel-region overhead dominates on 8000 columns × 10 levels), and that the GPU's per-kernel speedups don't translate to total-time wins yet — host/device transfers and the still-CPU mimics_decomp block bound the result. - Add a "Status after Step 5" section that documents the single !$acc data region, the surviving update self for the mimics_decomp block's needs, and the two data-clause failure modes (copyout host-write clobber, create host-read of stale memory) that bit Step 5c twice during debugging. The discipline is also captured in the feedback_openacc_data_clause_review memory. - Add an "Open work" section calling out the next two natural optimizations: GPU-ify the mimics_decomp block (which would let us delete the surviving update self), and hoist the data region out of SoilBiogeochemCompetition's per-call scope to amortize open/close transfers across iterations. No code changes; baselines unaffected. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The previous Open Work entry suggested hoisting the !$acc data region up into the driver's per-call timing loop to amortize open/close transfers. That change would only improve the harness's measured speedup — production CTSM has no such repeated-call loop, so the amortization wouldn't translate to any real-world performance gain. Replace with the honest description of the bottleneck: per-call PCIe transfers of the copyin/copyout arrays. The realistic fix is to keep data on the device across routine boundaries, which requires GPU-ifying upstream producers and downstream consumers — same trajectory CTSM as a whole has to take, not a harness-specific trick. Add an explicit "do not" to flag the hoist anti-pattern for future reviewers. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The host only reads sum_nh4_demand_scaled and sum_no3_demand_scaled inside the `if (decomp_method == mimics_decomp)` block; the else branch just zeros c_overflow_vr and never touches them. Move the !$acc update self into the if so the D2H transfer fires only when mimics is actually entered, saving a wasted transfer on every canonical (non-MIMICS) call (~80 KB × 2 arrays per call). All three targets (serial / OpenMP / GPU) still MATCH bit-identical on both --fast (canonical path, update self now skipped) and --all (includes mimics configs that still trigger the transfer): --fast: 9.5857105051752981E+06 --all : 7.6772246368780300E+07 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Previously the perf_timers_mod routines were no-ops (empty bodies) when INNER_TIMING was undefined, but the call sites still fired — each unbuilt-INNER_TIMING run paid 16 empty-subroutine call costs per timestep (~1-2 µs total per call). Negligible against ~5 ms totals, but it muddies any "production-style" measurement and makes INNER_TIMING-on vs -off not actually a clean A/B. Now every `call perf_timer_start(...)` and `call perf_timer_stop(...)` in SoilBiogeochemCompetition.F90 is wrapped in #ifdef INNER_TIMING / #endif, and so are the `use perf_timers_mod, only : ...` lines in both SoilBiogeochemCompetition.F90 and driver.F90. The `write_inner_timings` print/dump_csv calls were already gated. With INNER_TIMING undefined the compiler emits no perf_timer instructions at all. Verified all three targets (serial / OpenMP / GPU) still MATCH bit-identical with INNER_TIMING off and INNER_TIMING=1, on both --fast and --all. Per-call timing for serial and GPU is within run-to-run noise either way (~5.7 ms serial, ~5.5 ms GPU). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

samsrabin and others added 30 commits May 5, 2026 11:54

100 iterations

bf72976

Polish comments in SoilBiogeochemCompetition

fcc0ff9

Sentence-case loop section headers and add banner comments around the "second pass" residual uptake block in the use_nitrif_denitrif =.false. branch. No code changes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

samsrabin and others added 14 commits May 6, 2026 16:34

debug_gpu.sh: Add commented-out command for nsys call.

6808663

samsrabin self-assigned this May 18, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OpenACC directives for SoilBiogeochemCompetition#4027

OpenACC directives for SoilBiogeochemCompetition#4027
samsrabin wants to merge 44 commits into
ESCOMP:nvidia-hackathon-26from
samsrabin:soilbiogeochemcompetition-openacc

samsrabin commented May 18, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

samsrabin commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description of changes

Remaining work

Specific notes

Requirements before merge:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

samsrabin commented May 18, 2026 •

edited

Loading