Finish project 03_nf4_dequant#41
Finish project 03_nf4_dequant#41xfarawayx wants to merge 2 commits intoInfiniTensor:2025-winter-projectfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Adds a complete NF4 (double-quant) dequantization workflow: data generation via bitsandbytes, a CUDA kernel implementation + build/run automation, correctness verification and benchmarking, plus ports for several non-CUDA vendor stacks and an accompanying report.
Changes:
- Add Python scripts to generate NF4 test vectors, verify CUDA output vs bitsandbytes, and benchmark bitsandbytes dequant performance.
- Add CUDA implementation (kernel + host runner) with CMake build and helper profiling script.
- Add non-CUDA vendor adaptations (MACA/MUSA/Iluvatar) with build/run/verify scripts and documentation; add project README + technical report.
Reviewed changes
Copilot reviewed 25 out of 27 changed files in this pull request and generated 11 comments.
Show a summary per file
| File | Description |
|---|---|
| 03_nf4_dequant/scripts/generate_data.py | Generates NF4-packed weights + reference outputs and writes custom binary format |
| 03_nf4_dequant/scripts/verify.py | Loads outputs and reports elementwise error metrics with exit code |
| 03_nf4_dequant/scripts/bench_bnb.py | Benchmarks bitsandbytes dequantize_4bit latency/bandwidth |
| 03_nf4_dequant/kernel/nf4_dequant_kernel.cuh | CUDA NF4 dequant kernel implementation (optimized) |
| 03_nf4_dequant/kernel/main.cu | CUDA host runner: loads binary, launches kernel, times, writes output |
| 03_nf4_dequant/kernel/CMakeLists.txt | CMake build with CUDA arch auto-detection and temp dir handling |
| 03_nf4_dequant/kernel/run_test_ncu.sh | Nsight Compute profiling helper |
| 03_nf4_dequant/run.sh | End-to-end driver script (generate/build/test/bench/all) |
| 03_nf4_dequant/kernel_noncuda/mutex/* | MACA port (kernel + runner + Makefile + docs + runner script) |
| 03_nf4_dequant/kernel_noncuda/moore/* | MUSA port (kernel + runner + Makefile + docs + runner script) |
| 03_nf4_dequant/kernel_noncuda/iluvatar/* | Iluvatar CUDA-compatible port (kernel + runner + Makefile + docs + runner script) |
| 03_nf4_dequant/docs/report.md | Detailed design + optimization report |
| 03_nf4_dequant/README.md | Top-level usage/docs for running the project |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
fcf493c to
0da1f75
Compare
There was a problem hiding this comment.
Pull request overview
该 PR 完成 03_nf4_dequant 项目:用 bitsandbytes 生成/对齐 NF4 双重量化数据格式,实现并基准测试 CUDA NF4 反量化 kernel,同时提供 MUSA/MACA/Iluvatar 平台的适配版本与一键脚本,便于验证正确性与对比性能。
Changes:
- 新增数据生成、结果验证与 bitsandbytes 基准脚本,统一端到端流程脚本
run.sh - 新增 CUDA kernel(含构建系统、NCU profiling 脚本)以及国产 GPU 平台的移植版本(MUSA/MACA/Iluvatar)
- 新增项目 README 与实现报告文档
Reviewed changes
Copilot reviewed 25 out of 28 changed files in this pull request and generated 11 comments.
Show a summary per file
| File | Description |
|---|---|
| 03_nf4_dequant/xfarawayx/scripts/verify.py | 对比 CUDA 输出与 bitsandbytes 参考输出,输出误差指标并用退出码表示 PASS/FAIL |
| 03_nf4_dequant/xfarawayx/scripts/generate_data.py | 生成随机权重并用 bitsandbytes 量化为 NF4(双重量化),导出自定义二进制权重与参考解量化输出 |
| 03_nf4_dequant/xfarawayx/scripts/bench_bnb.py | 基准测试 bitsandbytes dequantize_4bit 性能并计算带宽 |
| 03_nf4_dequant/xfarawayx/run.sh | 统一入口脚本:generate/build/test/bench/all 流程编排 |
| 03_nf4_dequant/xfarawayx/kernel_noncuda/mutex/run_mutex.sh | MACA 平台的一键 build/run/verify 脚本 |
| 03_nf4_dequant/xfarawayx/kernel_noncuda/mutex/nf4_dequant_kernel.maca | MACA 平台 NF4 反量化 kernel(手写 half/bf16 位转换) |
| 03_nf4_dequant/xfarawayx/kernel_noncuda/mutex/main.maca | MACA 平台主程序:读取权重文件、启动 kernel、计时并写回输出 |
| 03_nf4_dequant/xfarawayx/kernel_noncuda/mutex/README.md | MACA 适配说明与用法 |
| 03_nf4_dequant/xfarawayx/kernel_noncuda/mutex/Makefile | MACA 编译产物 nf4_dequant_maca 的 Makefile |
| 03_nf4_dequant/xfarawayx/kernel_noncuda/moore/run_moore.sh | MUSA 平台的一键 build/run/verify 脚本 |
| 03_nf4_dequant/xfarawayx/kernel_noncuda/moore/nf4_dequant_kernel.mu | MUSA 平台 NF4 反量化 kernel(手写 half/bf16 位转换) |
| 03_nf4_dequant/xfarawayx/kernel_noncuda/moore/main.mu | MUSA 平台主程序:读取权重文件、启动 kernel、计时并写回输出 |
| 03_nf4_dequant/xfarawayx/kernel_noncuda/moore/README.md | MUSA 适配说明与用法 |
| 03_nf4_dequant/xfarawayx/kernel_noncuda/moore/Makefile | MUSA 编译产物 nf4_dequant_musa 的 Makefile |
| 03_nf4_dequant/xfarawayx/kernel_noncuda/iluvatar/run_iluvatar.sh | Iluvatar 平台的一键 build/run/verify 脚本 |
| 03_nf4_dequant/xfarawayx/kernel_noncuda/iluvatar/nf4_dequant_kernel.cuh | Iluvatar 平台 kernel 头文件(CUDA 兼容风格) |
| 03_nf4_dequant/xfarawayx/kernel_noncuda/iluvatar/main.cu | Iluvatar 平台主程序(cuda* 兼容运行时) |
| 03_nf4_dequant/xfarawayx/kernel_noncuda/iluvatar/README.md | Iluvatar 适配说明与用法 |
| 03_nf4_dequant/xfarawayx/kernel_noncuda/iluvatar/Makefile | Iluvatar 编译产物 nf4_dequant_iluvatar 的 Makefile |
| 03_nf4_dequant/xfarawayx/kernel/run_test_ncu.sh | Nsight Compute profiling 脚本入口 |
| 03_nf4_dequant/xfarawayx/kernel/nf4_dequant_kernel.cuh | CUDA NF4 双重量化反量化 kernel(shared 表、向量化读写、位移优化) |
| 03_nf4_dequant/xfarawayx/kernel/main.cu | CUDA 主程序:读取权重、启动 kernel、计时、写回输出 |
| 03_nf4_dequant/xfarawayx/kernel/CMakeLists.txt | CUDA 构建系统(支持 GPU arch 自动检测与 TMPDIR 配置) |
| 03_nf4_dequant/xfarawayx/docs/report.md | 项目实现报告、原理与性能实验结果 |
| 03_nf4_dequant/xfarawayx/README.md | 项目总览、目录结构与快速开始 |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
No description provided.