Improve / Add GPU documentation

davidrohr · davidrohr · commit 2a11afc3af82 · 2025-04-25T11:55:17.000+02:00
diff --git a/GPU/documentation/README.md b/GPU/documentation/README.md
@@ -0,0 +1,13 @@
+[build-O2.md](https://github.com/AliceO2Group/AliceO2/blob/dev/GPU/documentation/build-O2.md) :
+- Instructions how to build O2 with GPU support.
+- Description of the CMake variables used.
+
+[build-standalone.md](https://github.com/AliceO2Group/AliceO2/blob/dev/GPU/documentation/build-standalone.md) :
+- Instructions how to build and run the standalone benchmark.
+- Instructions how to extract data sets for the standalone benchmark from real data or using simulation.
+
+[deterministic-mode.md](https://github.com/AliceO2Group/AliceO2/blob/dev/GPU/documentation/deterministic-mode.md) :
+- Instructions how to use the deterministic mode for both the standalone benchmark and O2.
+
+[run-time-compilation.md](https://github.com/AliceO2Group/AliceO2/blob/dev/GPU/documentation/run-time-compilation.md) :
+- Instructions how to use run time compilation (RTC) for the GPU code.
diff --git a/GPU/documentation/build-O2.md b/GPU/documentation/build-O2.md
@@ -12,17 +12,17 @@ If you just want to reproduce the GPU build locally without running it, it might
 The provisioning script of the container also demonstrates which patches need to be applied such that everything works correctly.
 
 *GPU Tracking with CUDA*
- * The CMake option -DENABLE_CUDA=ON/OFF/AUTO steers whether CUDA is forced enabled / unconditionally disabled / auto-detected.
- * The CMake option -DCUDA_COMPUTETARGET= fixes a GPU target, e.g. 61 for PASCAL or 75 for Turing (if unset, it compiles for the lowest supported architecture)
+ * The CMake option `-DENABLE_CUDA=ON/OFF/AUTO` steers whether CUDA is forced enabled / unconditionally disabled / auto-detected.
+ * The CMake option `-DCUDA_COMPUTETARGET=...` fixes a GPU target, e.g. 61 for PASCAL or 75 for Turing (if unset, it compiles for the lowest supported architecture)
  * CUDA is detected via the CMake language feature, so essentially nvcc must be in the Path.
- * We require CUDA version >= 11.2
+ * We require CUDA version >= 12.8
  * CMake will report "Building GPUTracking with CUDA support" when enabled.
 
 *GPU Tracking with HIP*
  * HIP and HCC must be installed, and CMake must be able to detect HIP via find_package(hip).
- * If HIP and HCC are not installed to /opt/rocm, the environment variables $HIP_PATH and $HCC_HOME must point to the installation directories.
+ * If HIP and HCC are not installed to /opt/rocm, the environment variables `$HIP_PATH` and `$HCC_HOME` must point to the installation directories.
  * HIP from ROCm >= 4.0 is required.
- * The CMake option -DHIP_AMDGPUTARGET= forces a GPU target, e.g. gfx906 for Radeon VII (if unset, it auto-detects the GPU).
+ * The CMake option `-DHIP_AMDGPUTARGET=...` forces a GPU target, e.g. gfx906 for Radeon VII (if unset, it auto-detects the GPU).
  * CMake will report "Building GPUTracking with HIP support" when enabled.
  * It may be that some patches must be applied to ROCm after the installation. You find the details in the provisioning script of the GPU CI container below.
 
@@ -49,14 +49,14 @@ The provisioning script of the container also demonstrates which patches need to
  * The docker images is `alisw/slc8-gpu-builder`.
  * The container exports the `ALIBUILD_O2_FORCE_GPU` env variable, which force-enables all GPU builds.
  * Note that it might not be possible out-of-the-box to run the GPU version from within the container. In case of HIP it should work when you forwards the necessary GPU devices in the container. For CUDA however, you would either need to (in addition to device forwarding) match the system CUDA driver and toolkit installation to the files present in the container, or you need to use the CUDA docker runtime, which is currently not installed in the container.
- * There are currently some patches needed to install all the GPU backends in a proper way and together. Please refer to the container provisioning script https://github.com/alisw/docks/blob/master/slc9-gpu-builder/provision.sh. If you want to reproduce the installation locally, it is recommended to follow the steps from the script.
+ * There are currently some patches needed to install all the GPU backends in a proper way and together. Please refer to the container provisioning script [provision.sh](https://github.com/alisw/docks/blob/master/slc9-gpu-builder/provision.sh). If you want to reproduce the installation locally, it is recommended to follow the steps from the script.
 
 *Summary*
 
 If you want to enforce the GPU builds on a system without GPU, please set the following CMake settings:
- * ENABLE_CUDA=ON
- * ENABLE_HIP=ON
- * ENABLE_OPENCL=ON
- * HIP_AMDGPUTARGET=gfx906;gfx908
- * CUDA_COMPUTETARGET=86 89
-Alternatively you can set the environment variables ALIBUILD_ENABLE_CUDA and ALIBUILD_ENABLE_HIP to enforce building CUDA or HIP without modifying the alidist scripts.
+ * `ENABLE_CUDA=ON`
+ * `ENABLE_HIP=ON`
+ * `ENABLE_OPENCL=ON
+ * `HIP_AMDGPUTARGET=default`
+ * `CUDA_COMPUTETARGET=default`
+Alternatively you can set the environment variables `ALIBUILD_ENABLE_CUDA=1` and `ALIBUILD_ENABLE_HIP=1` to enforce building CUDA or HIP without modifying the alidist scripts.
diff --git a/GPU/documentation/build-standalone.md b/GPU/documentation/build-standalone.md
@@ -30,7 +30,7 @@ nano config.cmake # edit config file to enable / disable dependencies as needed.
 make install -j32
 ```
 
-You can edit certain build settings in `config.cmake`. Some of them are identical to the GPU build settings for O2, as described in O2-786.
+You can edit certain build settings in `config.cmake`. Some of them are identical to the GPU build settings for O2, as described in [build-O2.md](https://github.com/AliceO2Group/AliceO2/blob/dev/GPU/documentation/build-O2.md).
 And there are plenty of additional settings to enable/disable event display, qa, usage of ROOT, FMT, etc. libraries.
 
 This will create the `ca` binary in `~/standalone`, which is basically the same as the `o2-gpu-standalone-benchmark`, but built outside of O2.
@@ -68,7 +68,7 @@ This will dump the event data to the local folder, all dumped files have a `.dum
 
 Data can be dumped from raw data, or from MC data, e.g. generated by the Full System Test. In case of MC data, also MC labels are dumped, such that they are used in the `./ca --qa` mode.
 
-To get a dump from simulated data, please run e.g. the FST simulation as described in O2-2633.
+To get a dump from simulated data, please run e.g. the FST simulation as described in [full-system-test-setup.md](https://github.com/AliceO2Group/AliceO2/blob/dev/prodtests/full-system-test/documentation/full-system-test-setup.md).
 A simple run as
 ```
 DISABLE_PROCESSING=1 NEvents=5 NEventsQED=100 SHMSIZE=16000000000 $O2_ROOT/prodtests/full_system_test.sh
diff --git a/GPU/documentation/deterministic-mode.md b/GPU/documentation/deterministic-mode.md
@@ -0,0 +1,31 @@
+The TPC tracking code is not fully deterministic, i.e. running multiple times on the same data set might yield a slightly different number of tracks on the O(per mille) level.
+- This comes from concurrency, i.e. when tracks are processed in parallel, the output order might change, which might have small effects on the consecutive steps.
+- Also compile options and optimizations play a row, e.g. using ffast-math or fused-multiply-add might slightly change the rounding of floating point, and in rare cases lead to the acceptance or rejection of a track, and thus a different number of tracks.
+
+For debugging, testing, and validation, a deterministic mode is implemented, which should yield 100% reproducible results, on CPU and on GPU and when running multiple times.
+It uses a combination of
+- Compile time options, e.g. disabling all optimizations that change floating point rounding.
+- Run time options, e.g. to use deterministic sorting, and add additional sorting steps after kernels to make the output deterministic, also intermediate outputs.
+
+This is steered by 3 options:
+- The `-DGPUCA_DETERMINISTIC_MODE` Cmake setting : Compile-time setting.
+- The `--PROCdeterministicGPUReconstruction` command line option / `GPU_proc.deterministicGPUReconstruction` `--configKeyValue` setting : Run time setting.
+- The `--RTCdeterministic` command line option / `GPU_proc_rtc.deterministic` `--configKeyValue` setting. (Auto-enabled by the `deterministicGPUReconstruction` setting.) : Compile-time setting for RTC code.
+
+In order to be fully deterministic, all settings must be enabled, where the RTC setting is automatically enabled if not explicitly disabled.
+
+`GPUCA_DETERMINISTIC_MODE` has multiple levels, which are described here: [FindO2GPU.cmake](https://github.com/AliceO2Group/AliceO2/blob/80a80a17f5a1d9cb77743e2a39b15b653fe1a4f9/dependencies/FindO2GPU.cmake#L72).
+- In order to have fully deterministic GPUReconstruction (i.e. all algorithms that come with the GPUTracking library, like TPC tracking), the level `GPUCA_DETERMINISTIC_MODE=GPU` is needed.
+- In order to apply it to all of O2, e.g. for ITS tracking, please use `GPUCA_DETERMINISTIC_MODE=WHOLEO2`
+
+Enabling the options is a bit different for O2 and for the standalone benchmark:
+- For enabling it in the standalone benchmark, please set GPUCA_DETERMINISTIC_MODE=GPU in [config.cmake](https://github.com/AliceO2Group/AliceO2/blob/dev/GPU/GPUTracking/Standalone/cmake/config.cmake) and use the command line argument `--PROCdeterministicGPUReconstruction 1`.
+- For O2, Either add `set(GPUCA_DETERMINISTIC_MODE GPU)` to the beginning of the [GPU CMakeLists.txt](https://github.com/AliceO2Group/AliceO2/blob/dev/GPU/CMakeLists.txt) or add `set(GPUCA_DETERMINISTIC_MODE WHOLEO2)` to the beginning of the [Global CMakeLists.txt](https://github.com/AliceO2Group/AliceO2/blob/dev/CMakeLists.txt), and use the `configKeyValue` `GPU_proc.deterministicGPUReconstruction`. In order to enable this for the Full-System-Test or with [dpl-workflow.sh](https://github.com/AliceO2Group/AliceO2/blob/dev/prodtests/full-system-test/dpl-workflow.sh), please export `CONFIG_EXTRA_PROCESS_o2_gpu_reco_workflow=GPU_proc.deterministicGPUReconstruction=1;`.
+
+With these settings, if one runs multiple times, the number of clusters and number of tracks should be always fully identical.
+Note that this yields a significant performance penalty during the processing, therefore the deterministic mode is not compiled in by default, but it must be enabled explicitly and code must be recompiled.
+
+Beyond comparing only the number of clusters and number of tracks, it is also possible to compare intermediate results. To do so, please use the standalone benchmark (either `./ca` or `o2-gpu-standalone-benchmark` binary) with the `--debug 6` option.
+It will create a dump container all (most) intermediate results in text form, which can be compared. The output files is called `CPU.out` if using the CPU backend, and `GPU.out` for the GPU backend.
+Note that the dump files will be huge and the processing will be slow and consume much more memory than normal with `--debug 6 . It has been tested with datasets containing up to 50 Pb-Pb collisions, and might fail for larger data.
+The dump files (if the deterministic mode is used with both compile- and runtime-activation), the files should be 100% identical and can just be compared with `diff`.
diff --git a/GPU/documentation/run-time-compilation.md b/GPU/documentation/run-time-compilation.md
@@ -0,0 +1,21 @@
+Run time compilation is a feature of the GPUReconstruction library, which can recompile the GPU code for HIP and for CUDA at runtime, and apply some optimizations and changes. It is planned to add support for CPU code and OpenCL code in the future.
+
+The changes that can be applied are:
+- `constexpr` optimization: configuration values that are constant during the processing are replaced by `constexpr` expressions, which allows the compiler to optimize the code better. Benchmarks in 2024 habe shown 5% performance improvement with CUDA and 2% improvement with HIP.
+- Disabling of unused code, in particular this is currently used to remove the TPC code for V/M shape correction during online processing, simplifying the code, and yielding better compiler optimization, for a 20%-30% speedup on the MI50 GPUs.
+- Use different GPU constant parameters / launch bounds: These are tuning parameters, which are architecutre-dependent. The default values are taken from the first architecture the GPU code is compiled for in the normal compilation phase. If the architecture we are running on is different, different parameters can be loaded for RTC.
+- Compiling for different target architectures. This allows us to enable running on hardware, for which the code was not compiled in the original compilation.
+
+Generally, RTC is enabled via the `--RTCenable` flag for the standalone benchmark, or via the `GPU_proc_rtc.enable=1` `configKeyValue` for O2.
+For a list of RTC options, please see [GPUSettingsList.h](https://github.com/AliceO2Group/AliceO2/blob/80a80a17f5a1d9cb77743e2a39b15b653fe1a4f9/GPU/GPUTracking/Definitions/GPUSettingsList.h#L215).
+
+Caching the output:
+- The RTC output can be cached and reused, so that when running multiple times, compilation is not repeated. This is enabled via the `--RTCcacheOutput` setting. The folder to store the cache files can be selected via `--RTCTECHcacheFolder` and with `--RTCTECHcacheMutex` (default: enabled), a file-lock mutex can be used to synchronize access to the cache folder. The cached code is checked against the to-be-compiled source code with SHA1 hashes, and only if the code is not change the cache is used, otherwise the code is recompiled and the cache updated. It is possible to force using outdated cache files via the `--RTCTECHignoreCacheValid` option.
+
+For chaning the launch bounds and other parameters, please consider `--RTCTECHloadLaunchBoundsFromFile` (and `--RTCTECHprintLaunchBounds`), which can launch a parameter set which can be created via [dumpGPUDefParam.C](https://github.com/AliceO2Group/AliceO2/blob/dev/GPU/GPUTracking/Standalone/tools/dumpGPUDefParam.C). A set of default parameters is stored in `[INSTALL_FOLDER]/share/GPU`.
+
+It is possible to select a different target architecture for the compilation via `--RTCTECHoverrideArchitecture`, and the compilation can be prepended by a command with `--RTCTECHprependCommand`, e.g. for CPU pinning. See for example [dpl-workflow.sh](https://github.com/AliceO2Group/AliceO2/blob/80a80a17f5a1d9cb77743e2a39b15b653fe1a4f9/prodtests/full-system-test/dpl-workflow.sh#L335).
+
+`--RTCdeterministic` enables the [Deterministic Mode](https://github.com/AliceO2Group/AliceO2/blob/dev/GPU/documentation/deterministic-mode.md) (compile-time setting) for RTC. Usually you don't need to bother, as for the deterministic mode it is autoenabled from `--PROCdeterministicGPUReconstruction`, but the explicit `--RTCdeterministic` is available for tests.
+
+Finally, `--RTCoptConstexpr` and `--RTCoptSpecialCode` enable the constexpr and code removal optimizations. For an example how the TPC V/M shape corrections are removed, see [TPCFastTransform.h](https://github.com/AliceO2Group/AliceO2/blob/fc3ace17eca580c338751163ef4528e3ec47f9d6/GPU/TPCFastTransformation/TPCFastTransform.h#L445).
diff --git a/prodtests/full-system-test/documentation/README.md b/prodtests/full-system-test/documentation/README.md
@@ -0,0 +1,17 @@
+[full-system-test.md](https://github.com/AliceO2Group/AliceO2/blob/dev/prodtests/full-system-test/documentation/full-system-test.md) :
+- Full system test quick start guide
+
+[full-system-test-setup.md](https://github.com/AliceO2Group/AliceO2/blob/dev/prodtests/full-system-test/documentation/full-system-test-setup.md) :
+- More detailed description of full-system-test scripts, simulation of data set, and script to run the workflow
+
+[full-system-test-as-stress-test.md](https://github.com/AliceO2Group/AliceO2/blob/dev/prodtests/full-system-test/documentation/full-system-test-as-stress-test.md) :
+- Details on how to use the full system test as stress test and for validation of an EPN online compute node
+
+[dpl-workflow-options.md](https://github.com/AliceO2Group/AliceO2/blob/dev/prodtests/full-system-test/documentation/dpl-workflow-options.md) :
+- Description of the main workflow script [dpl-workflow.sh](https://github.com/AliceO2Group/AliceO2/blob/dev/prodtests/full-system-test/dpl-workflow.sh) and its options.
+
+[env-variables.md](https://github.com/AliceO2Group/AliceO2/blob/dev/prodtests/full-system-test/documentation/env-variables.md) :
+- List of common environment variables used by the workflow scripts (defaults set by https://github.com/davidrohr/O2DPG/blob/master/DATA/common/setenv.sh)
+
+[raw-tf-conversion.md](https://github.com/AliceO2Group/AliceO2/blob/dev/prodtests/full-system-test/documentation/raw-tf-conversion.md) :
+- This is automated in a script now, but just in case details how readout files are converted to a .tf file for usage in the full system test with replay from DataDistribution.
diff --git a/prodtests/full-system-test/documentation/env-variables.md b/prodtests/full-system-test/documentation/env-variables.md
@@ -1,4 +1,4 @@
-The `setenv-sh` script sets the following environment options
+The [setenv-sh](https://github.com/davidrohr/O2DPG/blob/master/DATA/common/setenv.sh) script sets the following environment options
 * `NTIMEFRAMES`: Number of time frames to process.
 * `TFDELAY`: Delay in seconds between publishing time frames (1 / rate).
 * `NGPUS`: Number of GPUs to use, data distributed round-robin.
@@ -25,7 +25,7 @@ The `setenv-sh` script sets the following environment options
 * `EXTINPUT`: Receive input from raw FMQ channel instead of running o2-raw-file-reader.
   * 0: `dpl-workflow.sh` can run as standalone benchmark, and will read the input itself.
   * 1: To be used in combination with either `datadistribution.sh` or `raw-reader.sh` or with another DataDistribution instance.
-* `CTFINPUT`: Read input from CTF ROOT file. This option is incompatible to EXTINPUT=1. The CTF ROOT file can be stored via SAVECTF=1.
+* `CTFINPUT`: Read input from CTF ROOT file. This option is incompatible to `EXTINPUT=1`. The CTF ROOT file can be stored via `SAVECTF=1`.
 * `NHBPERTF`: Time frame length (in HBF)
 * `GLOBALDPLOPT`: Global DPL workflow options appended to o2-dpl-run.
 * `EPNPIPELINES`: Set default EPN pipeline multiplicities.
diff --git a/prodtests/full-system-test/documentation/full-system-test-as-stress-test.md b/prodtests/full-system-test/documentation/full-system-test-as-stress-test.md
@@ -7,7 +7,7 @@ This is a quick summary how to run the full system test (FST) as stress test on
   - Enter the O2PDPSuite environment either vie `alienv enter O2PDPSuite/latest Readout/latest`.
   - Go to an empty directory.
   - Run the FST simulation via: `NEvents=650 NEventsQED=10000 SHMSIZE=128000000000 TPCTRACKERSCRATCHMEMORY=40000000000 SPLITTRDDIGI=0 GENERATE_ITSMFT_DICTIONARIES=1 $O2_ROOT/prodtests/full_system_test.sh`
-  - Get a current matbud.root (e.g. from here https://alice.its.cern.ch/jira/browse/O2-2288) and place it in that folder.
+  - Material budget table (e.g. from here https://alice.its.cern.ch/jira/browse/O2-2288) now comes from CCDB, no need any more to pull it manually.
   - Create a timeframe file from the raw files: `$O2_ROOT/prodtests/full-system-test/convert-raw-to-tf-file.sh`.
   - Prepare the ramdisk folder: `mv raw/timeframe raw/timeframe-org; mkdir raw/timeframe-tmpfs; ln -s timeframe-tmpfs raw/timeframe`
 
diff --git a/prodtests/full-system-test/documentation/full-system-test-setup.md b/prodtests/full-system-test/documentation/full-system-test-setup.md
diff --git a/prodtests/full-system-test/documentation/raw-tf-conversion.md b/prodtests/full-system-test/documentation/raw-tf-conversion.md