-
Notifications
You must be signed in to change notification settings - Fork 56
Removing TUNER compile flag #75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
Conversation
…rations Updates to GPU microbenchmark build system and hardware configurations: - Add clean_GPU_Microbenchmark target to remove compiled binaries - Update Blackwell B200 and common GPU hardware definitions - Fix various microbenchmark implementations across atomics, cache, and memory tests 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
|
Can you try trace and sim, before we merge. |
…d memory frequency
…ify Makefile to include 'cutlass' and 'cuda_samples' in ci target
|
tuner script requests some changes. Included here accel-sim/accel-sim-framework@d70e3e1 |
|
@JRPan I have tested this PR with a NVIDIA-A100-PCIe-40GB, and meet an assert error at line 85 of Another issue is a warning about kernel launch latency: Other parts work well under CUDA 12.8.0 (with CUDA Driver 565.57.01, E. Process, within a docker container). By the way, you may need to check UPD: I noticed that the minimal required CUDA Driver for CUDA 12.8 is 570.26 (shown in here). I'm trying an older CUDA Toolkit (which may take few days, because I'm working on RTX 4090). Sorry for that. |
|
@JRPan After getting access permission, I have tested this PR with a NVIDIA-GeForce-RTX-4090-24GB (under Toolkit 12.4.1 with Driver 550.144.03, Default Mode, within a docker container). First, the calculation of // GPU_Microbenchmark/hw_def/common/gpuConfig.h#L109-112
config.MAX_WARPS_PER_SM = config.MAX_THREADS_PER_SM / config.WARP_SIZE;
config.MEM_CLK_FREQUENCY = config.MEM_CLK_FREQUENCY * 1e-3f; // [1st] should multiply here, it's correct.
config.BLOCKS_PER_SM = config.MAX_THREADS_PER_SM / config.THREADS_PER_BLOCK;
config.THREADS_PER_SM = config.BLOCKS_PER_SM * config.THREADS_PER_BLOCK;
config.TOTAL_THREADS = config.THREADS_PER_BLOCK * config.BLOCKS_NUM;
// GPU_Microbenchmark/hw_def/common/gpuConfig.h#L273-276
config.MEM_SIZE = deviceProp.totalGlobalMem;
config.MEM_CLK_FREQUENCY = deviceProp.memoryClockRate * 1e-3f; // [2nd] should NOT multiply here.
config.MEM_BITWIDTH = deviceProp.memoryBusWidth;
config.CLK_FREQUENCY = clockRateKHz * 1e-3f;
// GPU_Microbenchmark/ubench/system/system_config/system_config.cu#L21-34
std::cout << "\n//Accel_Sim config: \n";
float mem_freq_MHZ = (config.MEM_CLK_FREQUENCY * 1e-3f * 2) / // [3rd] should NOT multiply here.
dram_model_freq_ratio[DRAM_MODEL];
std::cout << "-gpgpu_compute_capability_major " << deviceProp.major
<< std::endl;
std::cout << "-gpgpu_compute_capability_minor " << deviceProp.minor
<< std::endl;
std::cout << "-gpgpu_n_clusters " << config.SM_NUMBER
<< std::endl;
std::cout << "-gpgpu_n_cores_per_cluster 1" << std::endl;
std::cout << "-gpgpu_clock_domains " << config.CLK_FREQUENCY << ":"
<< config.CLK_FREQUENCY << ":" << config.CLK_FREQUENCY << ":" << mem_freq_MHZ
<< std::endl;Second, the resulted configuration still leads to simulation error as described in accel-sim-framework/issues/518. Directly reading FBP and L2 Banks from CUDA Driver seems wired: Finally, just for reference, if someone wants to test the new tuner within a docker container, you may set |
|
@JRPan I found that the initiation cycle are wrongly set to 0. This PR removes internal modification for This PR leads to After replace Further, after correcting above mistakes, I noticed that the result of dpu may be wrong: |
|
|
yes the first number should be larger than the second. |
No description provided.