GPU: Add documentation

davidrohr · davidrohr · commit 8ed4d1083b94 · 2025-04-25T07:33:39.000+02:00
diff --git a/GPU/documentation/README.md b/GPU/documentation/README.md
diff --git a/GPU/documentation/build-O2.md b/GPU/documentation/build-O2.md
@@ -0,0 +1,62 @@
+This ticket will serve as documentation how to enable which GPU features and collect related issues.
+
+So far, the following features exist:
+ * GPU Tracking with CUDA
+ * GPU Tracking with HIP
+ * GPU Tracking with OpenCL (>= 2.1)
+ * OpenGL visualization of the tracking
+ * ITS GPU tracking
+
+GPU support should be detected and enabled automatically.
+If you just want to reproduce the GPU build locally without running it, it might be easiest to use the GPU CI container (see below).
+The provisioning script of the container also demonstrates which patches need to be applied such that everything works correctly.
+
+*GPU Tracking with CUDA*
+ * The CMake option -DENABLE_CUDA=ON/OFF/AUTO steers whether CUDA is forced enabled / unconditionally disabled / auto-detected.
+ * The CMake option -DCUDA_COMPUTETARGET= fixes a GPU target, e.g. 61 for PASCAL or 75 for Turing (if unset, it compiles for the lowest supported architecture)
+ * CUDA is detected via the CMake language feature, so essentially nvcc must be in the Path.
+ * We require CUDA version >= 11.2
+ * CMake will report "Building GPUTracking with CUDA support" when enabled.
+
+*GPU Tracking with HIP*
+ * HIP and HCC must be installed, and CMake must be able to detect HIP via find_package(hip).
+ * If HIP and HCC are not installed to /opt/rocm, the environment variables $HIP_PATH and $HCC_HOME must point to the installation directories.
+ * HIP from ROCm >= 4.0 is required.
+ * The CMake option -DHIP_AMDGPUTARGET= forces a GPU target, e.g. gfx906 for Radeon VII (if unset, it auto-detects the GPU).
+ * CMake will report "Building GPUTracking with HIP support" when enabled.
+ * It may be that some patches must be applied to ROCm after the installation. You find the details in the provisioning script of the GPU CI container below.
+
+*GPU Tracking with OpenCL (Needs Clang >= 18 for compilation)*
+ * Needs OpenCL library with version >= 2.1, detectable via CMake find_package(OpenCL).
+ * Needs the SPIR-V LLVM translator together with LLVM to create the SPIR-V binaries, also detectable via CMake.
+
+*OpenGL visualization of TPC tracking*
+ * Needs the following libraries (all detectable via CMake find_package): libOpenGL, libGLEW, libGLFW, libGLU.
+ * OpenGL must be at least version 4.5, but this is not detectable at CMake time. If the supported OpenGL version is below, the display is not/partially built, and not available at runtime. (Whether it is not or partially built depends on whether the maximum OpenGL version supported by GLEW or that of the system runtime in insufficient.)
+ * Note: If ROOT does not detect the system GLEW library, ROOT will install its own very outdated GLEW library, which will be insufficient for the display. Since the ROOT include path will come first in the order, this will prevent the display from being built.
+ * CMake will report "Building GPU Event Display" when enabled.
+
+*Vulkan visualization*
+ * similar to OpenCL visualization, but with Vulkan.
+
+*ITS GPU Tracking*
+ * So far supports only CUDA and HIP, support for OpenCL might come.
+ * The build is enabled when the "GPU Tracking with CUDA" (as explained above) detects CUDA, same for HIP.
+ * CMake will report "Building ITS CUDA tracker" when enabled, same for HIP.
+
+*Using the GPU CI container*
+ * Setting up everything locally might be somewhat time-consuming, instead you can use the GPU CI cdocker container.
+ * The docker images is `alisw/slc8-gpu-builder`.
+ * The container exports the `ALIBUILD_O2_FORCE_GPU` env variable, which force-enables all GPU builds.
+ * Note that it might not be possible out-of-the-box to run the GPU version from within the container. In case of HIP it should work when you forwards the necessary GPU devices in the container. For CUDA however, you would either need to (in addition to device forwarding) match the system CUDA driver and toolkit installation to the files present in the container, or you need to use the CUDA docker runtime, which is currently not installed in the container.
+ * There are currently some patches needed to install all the GPU backends in a proper way and together. Please refer to the container provisioning script https://github.com/alisw/docks/blob/master/slc9-gpu-builder/provision.sh. If you want to reproduce the installation locally, it is recommended to follow the steps from the script.
+
+*Summary*
+
+If you want to enforce the GPU builds on a system without GPU, please set the following CMake settings:
+ * ENABLE_CUDA=ON
+ * ENABLE_HIP=ON
+ * ENABLE_OPENCL=ON
+ * HIP_AMDGPUTARGET=gfx906;gfx908
+ * CUDA_COMPUTETARGET=86 89
+Alternatively you can set the environment variables ALIBUILD_ENABLE_CUDA and ALIBUILD_ENABLE_HIP to enforce building CUDA or HIP without modifying the alidist scripts.
diff --git a/GPU/documentation/build-standalone.md b/GPU/documentation/build-standalone.md
@@ -0,0 +1,86 @@
+This ticket describes how to build the O2 GPU TPC Standalone benchmark (in its 2 build types), and how to run it.
+
+The purpose of the standalone benchmark is to make the O2 GPU TPC reconstruction code available standalone. It provides
+- external tests when people do not have / want to build O2, have no access to alien for CCDB, etc.
+- fast standalone tests without running O2 workflows and overhead from CCTD.
+- faster build times than rebuilding O2 for development.
+
+# Compiling
+
+The standalone benchmark is build as part of O2, and it can be built standalone.
+
+As part of O2, it is available from the normal O2 build as the executable `o2-gpu-standalone-benchmark`, GPU support is available for all GPU types supported by the O2 build.
+
+Building it as standalone benchmark requires several dependencies, and provides more control which features to enable / disable.
+The dependencies can be taken from the system, or we can use alidist to build O2 and take the dependencies from there.
+
+In order to do the latter, please execute:
+```
+cd ~/alice # or your alice folder
+aliBuild build --defaults o2 O2
+source O2/GPU/GPUTracking/Standalone/cmake/prepare.sh
+```
+
+Then, in order to compile the standalone tool, assuming to have it in ~/standalone and build in ~/standalone/build, please run:
+```
+mkdir -p ~/standalone/build
+cd ~/standalone/build
+cmake -DCMAKE_INSTALL_PREFIX=../ ~/alice/O2/GPU/GPUTracking/Standalone/
+nano config.cmake # edit config file to enable / disable dependencies as needed. In case cmake failed, and you disabled the dependency, just rerun the above command.
+make install -j32
+```
+
+You can edit certain build settings in `config.cmake`. Some of them are identical to the GPU build settings for O2, as described in O2-786.
+And there are plenty of additional settings to enable/disable event display, qa, usage of ROOT, FMT, etc. libraries.
+
+This will create the `ca` binary in `~/standalone`, which is basically the same as the `o2-gpu-standalone-benchmark`, but built outside of O2.
+
+# Running
+
+The following command lines will use `./ca`, in case you use the executable from the O2 build, please replace by `o2-gpu-standalone-benchmark`.
+
+You can get a list of command line options by `./ca --help` and `./ca --helpall`.
+
+In order to run, you need a dataset. See the next section for how to create a dataset. Datasets are stored in `~/standalone/events`, and are identified by their folder names. The following commands assume a testdataset of name `o2-pbpb-100`.
+
+To run on that data, the simpled command is `./ca -e o2-pbpb-100`. This will automatically use a GPU if available, trying all backends, otherwise fall back to CPU.
+You can force using GPU or CPU with `-g` and `-c`.
+You can select the backend via `--gpuType CUDA|HIP|OCL|OCL2`, and inside the backend you can select the device number, if multiple devices exist, via `--gpuDevice i`.
+
+The flag `--debug` (-2 to 6) enables increasingly extensive debug output, and `--debug 6` stores full data dumpts of all intermediate steps to files.
+>= `--debug 1` has a performance impact since it adds serialization points for debugging. For timing individual kernels, `--debug 1` prints timing information for all kernels.
+An example line would .e.g. be
+```
+./ca -e o2-pbpb-100 -g --gpuType CUDA --gpuDevice 0 --debug 1
+```
+
+Some other noteworthy options are `--display` to run the GPU event display, `--qa` to run a QA task on MC data, `--runs` and `--runs2` to run multiple iterations of the benchmark, `--printSettings` to print all the settings that were used, `--memoryStat` to print memory statistics, `--sync` to run with settings for online reco, `--syncAsync` to run online reco first, and then offline reco on the produced TPC CTF data, `--setO2Settings` to use some defaults as they are in O2 not in the standalone version, `--PROCdoublePipeline` to enable the double-threaded pipeline for best performance (works only with multiple iterations, and not in async mode), and `--RTCenable` to enable the run time compilation improvements (check also `--RTCcacheOutput`).
+An example for a benchmark in online mode would be:
+```
+./ca -e o2-pbpb-100 -g --sync --setO2Settings --PROCdoublePipeline --RTCenable --runs 10
+```
+
+# Generating a dataset
+
+The standalone benchmark supports running on Run2 data exported from AliRoot, or to run on Run3 data from O2. This document covers only the O2 case.
+In o2, `o2-tpc-reco-workflow` and the `o2-gpu-reco-workflow` can dump event data with the `configKeyValue` `GPU_global.dump=1;`.
+This will dump the event data to the local folder, all dumped files have a `.dump` file extension. If there are multiple TFs/events processed, there will be multiple `event.i.dump` files. In order to create a standalone dataset out of these, just copy all the `.dump` files to a subfolder in `~/standalone/events/[FOLDERNAME]`.
+
+Data can be dumped from raw data, or from MC data, e.g. generated by the Full System Test. In case of MC data, also MC labels are dumped, such that they are used in the `./ca --qa` mode.
+
+To get a dump from simulated data, please run e.g. the FST simulation as described in O2-2633.
+A simple run as
+```
+DISABLE_PROCESSING=1 NEvents=5 NEventsQED=100 SHMSIZE=16000000000 $O2_ROOT/prodtests/full_system_test.sh
+```
+should be enough.
+
+Afterwards run the following command to dump the data:
+```
+SYNCMODE=1 CONFIG_EXTRA_PROCESS_o2_gpu_reco_workflow="GPU_global.dump=1;" WORKFLOW_DETECTORS=TPC SHMSIZE=16000000000 $O2_ROOT/prodtests/full-system-test/dpl-workflow.sh
+```
+
+To dump standalone data from CTF raw data in `myctf.root`, you can use the same script, e.g.:
+```
+CTFINPUT=1 INPUT_FILE_LIST=myctf.root CONFIG_EXTRA_PROCESS_o2_gpu_reco_workflow="GPU_global.dump=1;" WORKFLOW_DETECTORS=TPC SHMSIZE=16000000000 $O2_ROOT/prodtests/full-system-test/dpl-workflow.sh
+```