Skip to content

Conversation

@JRPan
Copy link
Contributor

@JRPan JRPan commented Nov 26, 2025

No description provided.

JRPan and others added 4 commits November 24, 2025 16:00
…rations

Updates to GPU microbenchmark build system and hardware configurations:
- Add clean_GPU_Microbenchmark target to remove compiled binaries
- Update Blackwell B200 and common GPU hardware definitions
- Fix various microbenchmark implementations across atomics, cache, and memory tests

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@LAhmos
Copy link
Collaborator

LAhmos commented Dec 2, 2025

Can you try trace and sim, before we merge.
Also do we need to update the tuner script after those changes?

@JRPan
Copy link
Contributor Author

JRPan commented Dec 9, 2025

tuner script requests some changes. Included here accel-sim/accel-sim-framework@d70e3e1

@Ivecia
Copy link

Ivecia commented Dec 21, 2025

@JRPan I have tested this PR with a NVIDIA-A100-PCIe-40GB, and meet an assert error at line 85 of l2_config.cu. Here are details before meeting this assert:

mem_channel = 12
l2_banks_num = 24
l2_size_per_bank = 1747626
L2_CACHE_LINE_SIZE = 128
total_cache_lines = 13653
pow2i = 8192
L2 Associativity = 0

Another issue is a warning about kernel launch latency: The reported latency above can be slightly higher than real. For accurate evaultion using nvprof event, exmaple: make events ./kernel_lat.

Other parts work well under CUDA 12.8.0 (with CUDA Driver 565.57.01, E. Process, within a docker container).

By the way, you may need to check common.mk. I meet an assert GPUassert: the provided PTX was compiled with an unsupported toolchain in several files. I manually remove the $(CUOPTS) and add -arch=sm_80, then it works well.

UPD: I noticed that the minimal required CUDA Driver for CUDA 12.8 is 570.26 (shown in here). I'm trying an older CUDA Toolkit (which may take few days, because I'm working on RTX 4090). Sorry for that.

CC @PrabinKuSabat

@Ivecia
Copy link

Ivecia commented Dec 22, 2025

@JRPan After getting access permission, I have tested this PR with a NVIDIA-GeForce-RTX-4090-24GB (under Toolkit 12.4.1 with Driver 550.144.03, Default Mode, within a docker container).

First, the calculation of MEM_CLK_FREQUENCY is wired. There are lots of places that multiply the raw clock frequency (in kHz) with 1e-3, including:

// GPU_Microbenchmark/hw_def/common/gpuConfig.h#L109-112
config.MAX_WARPS_PER_SM = config.MAX_THREADS_PER_SM / config.WARP_SIZE;
config.MEM_CLK_FREQUENCY = config.MEM_CLK_FREQUENCY * 1e-3f;  // [1st] should multiply here, it's correct.
config.BLOCKS_PER_SM = config.MAX_THREADS_PER_SM / config.THREADS_PER_BLOCK;
config.THREADS_PER_SM = config.BLOCKS_PER_SM * config.THREADS_PER_BLOCK;
config.TOTAL_THREADS = config.THREADS_PER_BLOCK * config.BLOCKS_NUM;

// GPU_Microbenchmark/hw_def/common/gpuConfig.h#L273-276
config.MEM_SIZE = deviceProp.totalGlobalMem;
config.MEM_CLK_FREQUENCY = deviceProp.memoryClockRate * 1e-3f; // [2nd] should NOT multiply here.
config.MEM_BITWIDTH = deviceProp.memoryBusWidth;
config.CLK_FREQUENCY = clockRateKHz * 1e-3f;

// GPU_Microbenchmark/ubench/system/system_config/system_config.cu#L21-34
std::cout << "\n//Accel_Sim config: \n";

float mem_freq_MHZ = (config.MEM_CLK_FREQUENCY * 1e-3f * 2) /  // [3rd] should NOT multiply here.
                     dram_model_freq_ratio[DRAM_MODEL];
std::cout << "-gpgpu_compute_capability_major " << deviceProp.major
          << std::endl;
std::cout << "-gpgpu_compute_capability_minor " << deviceProp.minor
          << std::endl;
std::cout << "-gpgpu_n_clusters " << config.SM_NUMBER
          << std::endl;
std::cout << "-gpgpu_n_cores_per_cluster 1" << std::endl;
std::cout << "-gpgpu_clock_domains " << config.CLK_FREQUENCY << ":"
          << config.CLK_FREQUENCY << ":" << config.CLK_FREQUENCY << ":" << mem_freq_MHZ
          << std::endl;

Second, the resulted configuration still leads to simulation error as described in accel-sim-framework/issues/518. Directly reading FBP and L2 Banks from CUDA Driver seems wired:

# ===== Logs =====
DEBUG GR: Successfully queried index 0x15 = 6
DEBUG GR: Successfully queried index 0x25 = 12
SM_NUMBER: 128
WARP_SIZE: 32
MAX_THREADS_PER_SM: 1536
MAX_SHARED_MEM_SIZE: 102400
MAX_WARPS_PER_SM: 48
MAX_REG_PER_SM: 65536
MAX_THREAD_BLOCK_SIZE: 1024
MAX_SHARED_MEM_SIZE_PER_BLOCK: 49152
MAX_REG_PER_BLOCK: 65536
L1_SIZE: 49152
L2_SIZE: 75497472
MEM_SIZE: 25160908800
MEM_CLK_FREQUENCY: 10
MEM_BITWIDTH: 384
CLK_FREQUENCY: 2520
THREADS_PER_BLOCK: 1024
BLOCKS_PER_SM: 1
THREADS_PER_SM: 1024
BLOCKS_NUM: 128
TOTAL_THREADS: 131072
FBP_COUNT: 6
L2_BANKS: 12
Device Name = NVIDIA GeForce RTX 4090
GPU Max Clock rate = 3 MHz
SM Count = 128
CUDA version number = 8.9

# ===== Generated Configuration =====
# high level architecture configuration
-gpgpu_n_clusters 128
-gpgpu_n_cores_per_cluster 1
-gpgpu_n_mem 6
-gpgpu_n_sub_partition_per_mchannel 2

# L2 cache
-gpgpu_cache:dl2 S:2048:128:24,L:B:m:L:P,A:192:4,32:0,32
-gpgpu_cache:dl2_texture_only 0
-gpgpu_dram_partition_queues 64:64:64:64
-gpgpu_perf_sim_memcpy 1
-gpgpu_memory_partition_indexing 0

Finally, just for reference, if someone wants to test the new tuner within a docker container, you may set --gpus all --privileged=true while creating container (not 100% sure, sometimes it reports rm_alloc error, but does not influence the running of tuner and producing correct results).

@Ivecia
Copy link

Ivecia commented Dec 23, 2025

@JRPan I found that the initiation cycle are wrongly set to 0. This PR removes internal modification for config.BLOCK_NUM = 1 and config.TOTAL_THREADS = config.THREADS_PER_BLOCK * config.BLOCKS_NUM, which leads to increased the result of flops = (float)(REPEAT_TIMES * config.THREADS_PER_BLOCK * 8) / ((float)(stopClk[0] - startClk[0])). This problem exists in lots of units, such as dpu, fpu, etc.

This PR leads to -trace_opcode_latency_initiation_int 4,0:

THREADS_PER_BLOCK: 1024
BLOCKS_PER_SM: 1
THREADS_PER_SM: 1024
BLOCKS_NUM: 128
TOTAL_THREADS: 131072
int32 FLOP per SM = 16242.728516 (flop/clk/SM)
Total Clk number = 66106
int32 latency = 4.178467 (clk)
Total Clk number = 17115
    (debug-print) Integer32 FLOPS = 16242.7
    (debug-print) Integer32 Latency = 4.17847 cycles
    (debug-print) throughput per SM = 8192
    (debug-print) throughput per sched = 2048
    (debug-print) warp size = 32

After replace TOTAL_THREADS to THREADS_PER_BLOCK, we have -trace_opcode_latency_initiation_int 4,2 (possibly, we should not modify like what I did):

THREADS_PER_BLOCK: 1024
BLOCKS_PER_SM: 1
THREADS_PER_SM: 1024
BLOCKS_NUM: 128
TOTAL_THREADS: 131072
int32 FLOP per SM = 126.786995 (flop/clk/SM)
Total Clk number = 66163
int32 latency = 4.173584 (clk)
Total Clk number = 17095
    (debug-print) Integer32 FLOPS = 126.787
    (debug-print) Integer32 Latency = 4.17358 cycles
    (debug-print) throughput per SM = 64
    (debug-print) throughput per sched = 16
    (debug-print) warp size = 32

Further, after correcting above mistakes, I noticed that the result of dpu may be wrong: -trace_opcode_latency_initiation_dp 54,64.

@PrabinKuSabat
Copy link

-trace_opcode_latency_initiation_dp 54,64
this results in segmentation fault, at shader.cc:2566, while simulating the backprop-rodinia-2.0-ft app where the start_stage(index) to m_pipeline_reg becomes -10.

@JRPan
Copy link
Contributor Author

JRPan commented Jan 5, 2026

-trace_opcode_latency_initiation_dp 54,64 this results in segmentation fault, at shader.cc:2566, while simulating the backprop-rodinia-2.0-ft app where the start_stage(index) to m_pipeline_reg becomes -10.

yes the first number should be larger than the second.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants