Skip to content

Finish project 03_nf4_dequant#41

Open
xfarawayx wants to merge 2 commits intoInfiniTensor:2025-winter-projectfrom
xfarawayx:2025-winter-project
Open

Finish project 03_nf4_dequant#41
xfarawayx wants to merge 2 commits intoInfiniTensor:2025-winter-projectfrom
xfarawayx:2025-winter-project

Conversation

@xfarawayx
Copy link

No description provided.

Copilot AI review requested due to automatic review settings March 15, 2026 03:49
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a complete NF4 (double-quant) dequantization workflow: data generation via bitsandbytes, a CUDA kernel implementation + build/run automation, correctness verification and benchmarking, plus ports for several non-CUDA vendor stacks and an accompanying report.

Changes:

  • Add Python scripts to generate NF4 test vectors, verify CUDA output vs bitsandbytes, and benchmark bitsandbytes dequant performance.
  • Add CUDA implementation (kernel + host runner) with CMake build and helper profiling script.
  • Add non-CUDA vendor adaptations (MACA/MUSA/Iluvatar) with build/run/verify scripts and documentation; add project README + technical report.

Reviewed changes

Copilot reviewed 25 out of 27 changed files in this pull request and generated 11 comments.

Show a summary per file
File Description
03_nf4_dequant/scripts/generate_data.py Generates NF4-packed weights + reference outputs and writes custom binary format
03_nf4_dequant/scripts/verify.py Loads outputs and reports elementwise error metrics with exit code
03_nf4_dequant/scripts/bench_bnb.py Benchmarks bitsandbytes dequantize_4bit latency/bandwidth
03_nf4_dequant/kernel/nf4_dequant_kernel.cuh CUDA NF4 dequant kernel implementation (optimized)
03_nf4_dequant/kernel/main.cu CUDA host runner: loads binary, launches kernel, times, writes output
03_nf4_dequant/kernel/CMakeLists.txt CMake build with CUDA arch auto-detection and temp dir handling
03_nf4_dequant/kernel/run_test_ncu.sh Nsight Compute profiling helper
03_nf4_dequant/run.sh End-to-end driver script (generate/build/test/bench/all)
03_nf4_dequant/kernel_noncuda/mutex/* MACA port (kernel + runner + Makefile + docs + runner script)
03_nf4_dequant/kernel_noncuda/moore/* MUSA port (kernel + runner + Makefile + docs + runner script)
03_nf4_dequant/kernel_noncuda/iluvatar/* Iluvatar CUDA-compatible port (kernel + runner + Makefile + docs + runner script)
03_nf4_dequant/docs/report.md Detailed design + optimization report
03_nf4_dequant/README.md Top-level usage/docs for running the project

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

该 PR 完成 03_nf4_dequant 项目:用 bitsandbytes 生成/对齐 NF4 双重量化数据格式,实现并基准测试 CUDA NF4 反量化 kernel,同时提供 MUSA/MACA/Iluvatar 平台的适配版本与一键脚本,便于验证正确性与对比性能。

Changes:

  • 新增数据生成、结果验证与 bitsandbytes 基准脚本,统一端到端流程脚本 run.sh
  • 新增 CUDA kernel(含构建系统、NCU profiling 脚本)以及国产 GPU 平台的移植版本(MUSA/MACA/Iluvatar)
  • 新增项目 README 与实现报告文档

Reviewed changes

Copilot reviewed 25 out of 28 changed files in this pull request and generated 11 comments.

Show a summary per file
File Description
03_nf4_dequant/xfarawayx/scripts/verify.py 对比 CUDA 输出与 bitsandbytes 参考输出,输出误差指标并用退出码表示 PASS/FAIL
03_nf4_dequant/xfarawayx/scripts/generate_data.py 生成随机权重并用 bitsandbytes 量化为 NF4(双重量化),导出自定义二进制权重与参考解量化输出
03_nf4_dequant/xfarawayx/scripts/bench_bnb.py 基准测试 bitsandbytes dequantize_4bit 性能并计算带宽
03_nf4_dequant/xfarawayx/run.sh 统一入口脚本:generate/build/test/bench/all 流程编排
03_nf4_dequant/xfarawayx/kernel_noncuda/mutex/run_mutex.sh MACA 平台的一键 build/run/verify 脚本
03_nf4_dequant/xfarawayx/kernel_noncuda/mutex/nf4_dequant_kernel.maca MACA 平台 NF4 反量化 kernel(手写 half/bf16 位转换)
03_nf4_dequant/xfarawayx/kernel_noncuda/mutex/main.maca MACA 平台主程序:读取权重文件、启动 kernel、计时并写回输出
03_nf4_dequant/xfarawayx/kernel_noncuda/mutex/README.md MACA 适配说明与用法
03_nf4_dequant/xfarawayx/kernel_noncuda/mutex/Makefile MACA 编译产物 nf4_dequant_maca 的 Makefile
03_nf4_dequant/xfarawayx/kernel_noncuda/moore/run_moore.sh MUSA 平台的一键 build/run/verify 脚本
03_nf4_dequant/xfarawayx/kernel_noncuda/moore/nf4_dequant_kernel.mu MUSA 平台 NF4 反量化 kernel(手写 half/bf16 位转换)
03_nf4_dequant/xfarawayx/kernel_noncuda/moore/main.mu MUSA 平台主程序:读取权重文件、启动 kernel、计时并写回输出
03_nf4_dequant/xfarawayx/kernel_noncuda/moore/README.md MUSA 适配说明与用法
03_nf4_dequant/xfarawayx/kernel_noncuda/moore/Makefile MUSA 编译产物 nf4_dequant_musa 的 Makefile
03_nf4_dequant/xfarawayx/kernel_noncuda/iluvatar/run_iluvatar.sh Iluvatar 平台的一键 build/run/verify 脚本
03_nf4_dequant/xfarawayx/kernel_noncuda/iluvatar/nf4_dequant_kernel.cuh Iluvatar 平台 kernel 头文件(CUDA 兼容风格)
03_nf4_dequant/xfarawayx/kernel_noncuda/iluvatar/main.cu Iluvatar 平台主程序(cuda* 兼容运行时)
03_nf4_dequant/xfarawayx/kernel_noncuda/iluvatar/README.md Iluvatar 适配说明与用法
03_nf4_dequant/xfarawayx/kernel_noncuda/iluvatar/Makefile Iluvatar 编译产物 nf4_dequant_iluvatar 的 Makefile
03_nf4_dequant/xfarawayx/kernel/run_test_ncu.sh Nsight Compute profiling 脚本入口
03_nf4_dequant/xfarawayx/kernel/nf4_dequant_kernel.cuh CUDA NF4 双重量化反量化 kernel(shared 表、向量化读写、位移优化)
03_nf4_dequant/xfarawayx/kernel/main.cu CUDA 主程序:读取权重、启动 kernel、计时、写回输出
03_nf4_dequant/xfarawayx/kernel/CMakeLists.txt CUDA 构建系统(支持 GPU arch 自动检测与 TMPDIR 配置)
03_nf4_dequant/xfarawayx/docs/report.md 项目实现报告、原理与性能实验结果
03_nf4_dequant/xfarawayx/README.md 项目总览、目录结构与快速开始

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants