Removing TUNER compile flag #75

JRPan · 2025-11-26T01:56:10Z

No description provided.

…rations Updates to GPU microbenchmark build system and hardware configurations: - Add clean_GPU_Microbenchmark target to remove compiled binaries - Update Blackwell B200 and common GPU hardware definitions - Fix various microbenchmark implementations across atomics, cache, and memory tests 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

…ious benchmarks

LAhmos · 2025-12-02T20:01:47Z

Can you try trace and sim, before we merge.
Also do we need to update the tuner script after those changes?

…d memory frequency

…ify Makefile to include 'cutlass' and 'cuda_samples' in ci target

JRPan · 2025-12-09T23:53:19Z

tuner script requests some changes. Included here accel-sim/accel-sim-framework@d70e3e1

Ivecia · 2025-12-21T10:28:29Z

@JRPan I have tested this PR with a NVIDIA-A100-PCIe-40GB, and meet an assert error at line 85 of l2_config.cu. Here are details before meeting this assert:

mem_channel = 12
l2_banks_num = 24
l2_size_per_bank = 1747626
L2_CACHE_LINE_SIZE = 128
total_cache_lines = 13653
pow2i = 8192
L2 Associativity = 0

Another issue is a warning about kernel launch latency: The reported latency above can be slightly higher than real. For accurate evaultion using nvprof event, exmaple: make events ./kernel_lat.

Other parts work well under CUDA 12.8.0 (with CUDA Driver 565.57.01, E. Process, within a docker container).

By the way, you may need to check common.mk. I meet an assert GPUassert: the provided PTX was compiled with an unsupported toolchain in several files. I manually remove the $(CUOPTS) and add -arch=sm_80, then it works well.

UPD: I noticed that the minimal required CUDA Driver for CUDA 12.8 is 570.26 (shown in here). I'm trying an older CUDA Toolkit (which may take few days, because I'm working on RTX 4090). Sorry for that.

CC @PrabinKuSabat

Ivecia · 2025-12-22T11:40:49Z

@JRPan After getting access permission, I have tested this PR with a NVIDIA-GeForce-RTX-4090-24GB (under Toolkit 12.4.1 with Driver 550.144.03, Default Mode, within a docker container).

First, the calculation of MEM_CLK_FREQUENCY is wired. There are lots of places that multiply the raw clock frequency (in kHz) with 1e-3, including:

// GPU_Microbenchmark/hw_def/common/gpuConfig.h#L109-112
config.MAX_WARPS_PER_SM = config.MAX_THREADS_PER_SM / config.WARP_SIZE;
config.MEM_CLK_FREQUENCY = config.MEM_CLK_FREQUENCY * 1e-3f;  // [1st] should multiply here, it's correct.
config.BLOCKS_PER_SM = config.MAX_THREADS_PER_SM / config.THREADS_PER_BLOCK;
config.THREADS_PER_SM = config.BLOCKS_PER_SM * config.THREADS_PER_BLOCK;
config.TOTAL_THREADS = config.THREADS_PER_BLOCK * config.BLOCKS_NUM;

// GPU_Microbenchmark/hw_def/common/gpuConfig.h#L273-276
config.MEM_SIZE = deviceProp.totalGlobalMem;
config.MEM_CLK_FREQUENCY = deviceProp.memoryClockRate * 1e-3f; // [2nd] should NOT multiply here.
config.MEM_BITWIDTH = deviceProp.memoryBusWidth;
config.CLK_FREQUENCY = clockRateKHz * 1e-3f;

// GPU_Microbenchmark/ubench/system/system_config/system_config.cu#L21-34
std::cout << "\n//Accel_Sim config: \n";

float mem_freq_MHZ = (config.MEM_CLK_FREQUENCY * 1e-3f * 2) /  // [3rd] should NOT multiply here.
                     dram_model_freq_ratio[DRAM_MODEL];
std::cout << "-gpgpu_compute_capability_major " << deviceProp.major
          << std::endl;
std::cout << "-gpgpu_compute_capability_minor " << deviceProp.minor
          << std::endl;
std::cout << "-gpgpu_n_clusters " << config.SM_NUMBER
          << std::endl;
std::cout << "-gpgpu_n_cores_per_cluster 1" << std::endl;
std::cout << "-gpgpu_clock_domains " << config.CLK_FREQUENCY << ":"
          << config.CLK_FREQUENCY << ":" << config.CLK_FREQUENCY << ":" << mem_freq_MHZ
          << std::endl;

Second, the resulted configuration still leads to simulation error as described in accel-sim-framework/issues/518. Directly reading FBP and L2 Banks from CUDA Driver seems wired:

# ===== Logs =====
DEBUG GR: Successfully queried index 0x15 = 6
DEBUG GR: Successfully queried index 0x25 = 12
SM_NUMBER: 128
WARP_SIZE: 32
MAX_THREADS_PER_SM: 1536
MAX_SHARED_MEM_SIZE: 102400
MAX_WARPS_PER_SM: 48
MAX_REG_PER_SM: 65536
MAX_THREAD_BLOCK_SIZE: 1024
MAX_SHARED_MEM_SIZE_PER_BLOCK: 49152
MAX_REG_PER_BLOCK: 65536
L1_SIZE: 49152
L2_SIZE: 75497472
MEM_SIZE: 25160908800
MEM_CLK_FREQUENCY: 10
MEM_BITWIDTH: 384
CLK_FREQUENCY: 2520
THREADS_PER_BLOCK: 1024
BLOCKS_PER_SM: 1
THREADS_PER_SM: 1024
BLOCKS_NUM: 128
TOTAL_THREADS: 131072
FBP_COUNT: 6
L2_BANKS: 12
Device Name = NVIDIA GeForce RTX 4090
GPU Max Clock rate = 3 MHz
SM Count = 128
CUDA version number = 8.9

# ===== Generated Configuration =====
# high level architecture configuration
-gpgpu_n_clusters 128
-gpgpu_n_cores_per_cluster 1
-gpgpu_n_mem 6
-gpgpu_n_sub_partition_per_mchannel 2

# L2 cache
-gpgpu_cache:dl2 S:2048:128:24,L:B:m:L:P,A:192:4,32:0,32
-gpgpu_cache:dl2_texture_only 0
-gpgpu_dram_partition_queues 64:64:64:64
-gpgpu_perf_sim_memcpy 1
-gpgpu_memory_partition_indexing 0

Finally, just for reference, if someone wants to test the new tuner within a docker container, you may set --gpus all --privileged=true while creating container (not 100% sure, sometimes it reports rm_alloc error, but does not influence the running of tuner and producing correct results).

Ivecia · 2025-12-23T12:00:27Z

@JRPan I found that the initiation cycle are wrongly set to 0. This PR removes internal modification for config.BLOCK_NUM = 1 and config.TOTAL_THREADS = config.THREADS_PER_BLOCK * config.BLOCKS_NUM, which leads to increased the result of flops = (float)(REPEAT_TIMES * config.THREADS_PER_BLOCK * 8) / ((float)(stopClk[0] - startClk[0])). This problem exists in lots of units, such as dpu, fpu, etc.

This PR leads to -trace_opcode_latency_initiation_int 4,0:

THREADS_PER_BLOCK: 1024
BLOCKS_PER_SM: 1
THREADS_PER_SM: 1024
BLOCKS_NUM: 128
TOTAL_THREADS: 131072
int32 FLOP per SM = 16242.728516 (flop/clk/SM)
Total Clk number = 66106
int32 latency = 4.178467 (clk)
Total Clk number = 17115
    (debug-print) Integer32 FLOPS = 16242.7
    (debug-print) Integer32 Latency = 4.17847 cycles
    (debug-print) throughput per SM = 8192
    (debug-print) throughput per sched = 2048
    (debug-print) warp size = 32

After replace TOTAL_THREADS to THREADS_PER_BLOCK, we have -trace_opcode_latency_initiation_int 4,2 (possibly, we should not modify like what I did):

THREADS_PER_BLOCK: 1024
BLOCKS_PER_SM: 1
THREADS_PER_SM: 1024
BLOCKS_NUM: 128
TOTAL_THREADS: 131072
int32 FLOP per SM = 126.786995 (flop/clk/SM)
Total Clk number = 66163
int32 latency = 4.173584 (clk)
Total Clk number = 17095
    (debug-print) Integer32 FLOPS = 126.787
    (debug-print) Integer32 Latency = 4.17358 cycles
    (debug-print) throughput per SM = 64
    (debug-print) throughput per sched = 16
    (debug-print) warp size = 32

Further, after correcting above mistakes, I noticed that the result of dpu may be wrong: -trace_opcode_latency_initiation_dp 54,64.

PrabinKuSabat · 2026-01-01T10:12:55Z

-trace_opcode_latency_initiation_dp 54,64
this results in segmentation fault, at shader.cc:2566, while simulating the backprop-rodinia-2.0-ft app where the start_stage(index) to m_pipeline_reg becomes -10.

JRPan · 2026-01-05T22:11:59Z

-trace_opcode_latency_initiation_dp 54,64 this results in segmentation fault, at shader.cc:2566, while simulating the backprop-rodinia-2.0-ft app where the start_stage(index) to m_pipeline_reg becomes -10.

yes the first number should be larger than the second.

JRPan and others added 4 commits November 24, 2025 16:00

Remove redundant configuration of BLOCKS_NUM and TOTAL_THREADS in var…

8b2fab4

…ious benchmarks

Add argument parsing in initializeDeviceProp for GPU configuration

ea4cc64

Update GPU configuration parameters and adjust array sizes in benchmarks

3e5b32a

JRPan mentioned this pull request Nov 26, 2025

Refactor CI workflow; Refactor GPU_ubench; Add -f flag to hw stats accel-sim/accel-sim-framework#512

Open

JRPan requested a review from LAhmos November 26, 2025 02:05

JRPan added 3 commits November 25, 2025 21:07

Remove tuner target from Makefile and clean up related variables

5d34937

query L2 banks and mem controllers from driver

2556cf5

Add <cstdint> header for fixed-width integer types in gpuConfig.h

13adbd2

JRPan added 3 commits December 2, 2025 19:37

replace make with $(MAKE)

5f6018d

Refactor system_config.cu to use config parameters for clock rates an…

c4c6cec

…d memory frequency

Update test-build.yml to run test-build.sh with 'ci' argument and mod…

554d266

…ify Makefile to include 'cutlass' and 'cuda_samples' in ci target

JRPan mentioned this pull request Dec 20, 2025

Requesting Help for RTX 4090 Tuning accel-sim/accel-sim-framework#518

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Removing TUNER compile flag #75

Removing TUNER compile flag #75

Uh oh!

JRPan commented Nov 26, 2025

Uh oh!

LAhmos commented Dec 2, 2025

Uh oh!

JRPan commented Dec 9, 2025

Uh oh!

Ivecia commented Dec 21, 2025 •

edited

Loading

Uh oh!

Ivecia commented Dec 22, 2025 •

edited

Loading

Uh oh!

Ivecia commented Dec 23, 2025 •

edited

Loading

Uh oh!

PrabinKuSabat commented Jan 1, 2026

Uh oh!

JRPan commented Jan 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Removing TUNER compile flag #75

Are you sure you want to change the base?

Removing TUNER compile flag #75

Uh oh!

Conversation

JRPan commented Nov 26, 2025

Uh oh!

LAhmos commented Dec 2, 2025

Uh oh!

JRPan commented Dec 9, 2025

Uh oh!

Ivecia commented Dec 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Ivecia commented Dec 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Ivecia commented Dec 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

PrabinKuSabat commented Jan 1, 2026

Uh oh!

JRPan commented Jan 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Ivecia commented Dec 21, 2025 •

edited

Loading

Ivecia commented Dec 22, 2025 •

edited

Loading

Ivecia commented Dec 23, 2025 •

edited

Loading