GitHub - Minerest/llama.cpp_RDNA2_FlashAttnEnabled: Found the happy path :)

Title RDNA2: cudaOccupancyMaxActiveBlocksPerMultiprocessor returns 0 for 256-thread fattn tile (ncols1=4), causing assertion failure during prefill

Post Hi friends, looks like I found the happy assertion path referenced in #16643 running on my dual RDNA2 gpu setup. I bypassed this case and I got a massive speed boost compared to the modern vulkan binaries from less than 30 tok/s to over 50tok/s generation for the Qwen3.6-35B-A3B Q4_K_M! I had it writing code edits straight through the qwen code CLI no problems. Super excited this finally feels usable. I thought I would try to answer the open ended question about the compilation details in #16633 and maybe provide some additional datapoints related to this issue.

Hardware Radeon RX 6800 (gfx1030, nsm=30) RX 6700 XT (gfx1031 reporting as gfx1030, nsm=20) Fedora 44, kernel 7.0.6 with CONFIG_HSA_AMD=y ROCm 7.1.1 (distro package, /usr prefix — not TheRock) HIP 7.1.52802-9999 HIP compiler /usr/lib64/rocm/llvm/bin/clang++ (clang 20.0.0.rocm) llama.cpp commit cc7200bf1 (version 9166), upstream master with and without patch

crash output fattn-common.cuh had the upstream GGML_ASSERT(max_blocks_per_sm > 0) left intact; only the diagnostic prints were added under #ifdef GGML_FATTN_TRACE.

[fattn-path] tile (DKQ=256, DV=256, ncols2=8, Q->ne[1]=512)
[fattn-trace] void launch_fattn(...) [DV = 256, ncols1 = 4, ncols2 = 8]
[fattn-trace]   device=0 cc=16781360 nsm=30  warp_size=32 nwarps=8  threads/block=256
[fattn-trace]   nbytes_shared=0  nbatch_fa=64  stream_k=0  need_f16_K=1 need_f16_V=1
[fattn-trace]   cudaOccupancyMaxActiveBlocksPerMultiprocessor -> max_blocks_per_sm=0
/home/ediaz/llama/rocm/llama.cpp/ggml/src/ggml-cuda/template-instances/../fattn-common.cuh:1068: GGML_ASSERT(max_blocks_per_sm > 0) failed

Backtrace (relevant frames):

ggml_abort
launch_fattn<256, 4, 8>(...)                              libggml-hip.so
ggml_cuda_flash_attn_ext_tile_case<256, 256>(...)         libggml-hip.so
ggml_cuda_graph_evaluate_and_capture(...)                 libggml-hip.so
ggml_backend_cuda_graph_compute(...)                      libggml-hip.so

Exit code 134 (SIGABRT).

Local workaround (what's running now)

ggml/src/ggml-cuda/fattn-common.cuh around the assertion site:

- GGML_ASSERT(max_blocks_per_sm > 0);
+ if (max_blocks_per_sm <= 0) {
+     GGML_LOG_WARN("cudaOccupancyMaxActiveBlocksPerMultiprocessor returned %d, falling back to 1\n", max_blocks_per_sm);
+     max_blocks_per_sm = 1;
+ }

Patch and build:

cmake -S . -B build-instrumented \
  -DCMAKE_BUILD_TYPE=Release \
  -DGGML_HIP=ON \
  -DGPU_TARGETS="gfx1030;gfx1031" \
  -DROCM_PATH=/usr \
  -DBUILD_SHARED_LIBS=ON \
  -DCMAKE_HIP_FLAGS="-DGGML_FATTN_TRACE"
cmake --build build-instrumented --target llama-bench -j4

Workload	Prefill (t/s)	Decode (t/s)
-fa off (no flash attention)	~203	~54
-fa on (with fallback patch)	~1314	~60

flash_attn_tile

┌───────────────────────────────────────────────────────┬───────────────────────┬───────────────────────┐ │ Resource │ <256,256,4,8> (fails) │ <256,256,1,8> (works) │ ├───────────────────────────────────────────────────────┼───────────────────────┼───────────────────────┤ │ Threads/block │ 256 (8 waves) │ 128 (4 waves) │ ├───────────────────────────────────────────────────────┼───────────────────────┼───────────────────────┤ │ VGPRs/thread │ 203 │ 102 │ ├───────────────────────────────────────────────────────┼───────────────────────┼───────────────────────┤ │ TotalSGPRs │ 42 │ 44 │ ├───────────────────────────────────────────────────────┼───────────────────────┼───────────────────────┤ │ VGPR/SGPR spills │ 0 / 0 │ 0 / 0 │ ├───────────────────────────────────────────────────────┼───────────────────────┼───────────────────────┤ │ Static LDS/block │ 37,888 B │ 21,504 B │ ├───────────────────────────────────────────────────────┼───────────────────────┼───────────────────────┤ │ Dynamic shared (runtime) │ 0 │ 0 │ ├───────────────────────────────────────────────────────┼───────────────────────┼───────────────────────┤ │ Compiler-reported occupancy │ 4 waves/SIMD │ 6 waves/SIMD │ ├───────────────────────────────────────────────────────┼───────────────────────┼───────────────────────┤ │ Runtime cudaOccupancyMaxActiveBlocksPerMultiprocessor │ 0 │ 2 │ └───────────────────────────────────────────────────────┴───────────────────────┴───────────────────────┘

Same kernel family (fattn-tile), same hardware, same binary. The occupancy API answers correctly for one and incorrectly for the other.

Field	tg8 (decode, works)	pp512 (prefill, fails)
Dispatcher	`ggml_cuda_flash_attn_ext_tile_case<256, 256>`	(same)
`launch_fattn` template	`<DV=256, ncols1=1, ncols2=8>`	`<DV=256, ncols1=4, ncols2=8>`
`Q->ne[1]`	1	512
`nwarps`	4	8
threads/block	128	256
`nbatch_fa`	32	64
`nbytes_shared` (dynamic)	0	0
`cudaOccupancyMaxActiveBlocksPerMultiprocessor` return	2	0
Outcome	runs, ~54 t/s	assert fires, SIGABRT

The only kernel-launch difference between working and failing case is threads/block (128 → 256) and the register pressure implied by ncols1=4. There is no dynamic shared memory request in either case.

Name		Name	Last commit message	Last commit date
Latest commit History 9,226 Commits
.devops		.devops
.gemini		.gemini
.github		.github
.pi/gg		.pi/gg
benches		benches
ci		ci
cmake		cmake
common		common
conversion		conversion
docs		docs
examples		examples
ggml		ggml
gguf-py		gguf-py
grammars		grammars
include		include
licenses		licenses
media		media
models		models
pocs		pocs
requirements		requirements
scripts		scripts
src		src
tests		tests
tools		tools
vendor		vendor
.clang-format		.clang-format
.clang-tidy		.clang-tidy
.dockerignore		.dockerignore
.ecrc		.ecrc
.editorconfig		.editorconfig
.flake8		.flake8
.gitignore		.gitignore
.gitmodules		.gitmodules
.pre-commit-config.yaml		.pre-commit-config.yaml
AGENTS.md		AGENTS.md
AUTHORS		AUTHORS
CLAUDE.md		CLAUDE.md
CMakeLists.txt		CMakeLists.txt
CMakePresets.json		CMakePresets.json
CODEOWNERS		CODEOWNERS
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
SECURITY.md		SECURITY.md
build-xcframework.sh		build-xcframework.sh
convert_hf_to_gguf.py		convert_hf_to_gguf.py
convert_hf_to_gguf_update.py		convert_hf_to_gguf_update.py
convert_llama_ggml_to_gguf.py		convert_llama_ggml_to_gguf.py
convert_lora_to_gguf.py		convert_lora_to_gguf.py
flake.nix		flake.nix
mypy.ini		mypy.ini
pyproject.toml		pyproject.toml
pyrightconfig.json		pyrightconfig.json
rdna2-patch.diff		rdna2-patch.diff
requirements.txt		requirements.txt
ty.toml		ty.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

About

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages