Skip to content

weich97/NPUKernelBench

Repository files navigation

NPUKernelBench V2.0: An AscendC Operator Generation and Evaluation Framework with Domain-Knowledge-Injected Models

[License](LICENSE)

1. Introduction

NPUKernelBench V2.0 is a framework designed for Huawei Ascend NPUs, with a particular focus on large-model-driven operator generation and systematic evaluation. It provides a comprehensive benchmarking pipeline for AscendC operators automatically generated by large language models (LLMs).

In addition to the framework itself, we also release a specialized LLM trained on high-quality Chain-of-Thought (CoT) data deeply integrated with AscendC domain knowledge and programming paradigms. Through this training process, the model learns to emulate the design reasoning of operator development engineers under hardware constraints, substantially improving its capability for operator synthesis. Given natural language requirements or functional descriptions, the model can directly generate more than 50 usable AscendC kernels, representing a 400% increase over the feasibility-oriented V1.0 release, while also achieving marked improvements in code quality and practical usability.

2. Core Features

  • Standardized hierarchical task suite: The benchmark provides a collection of operator tasks covering multiple difficulty levels and application scenarios; details are described in the benchmark design section.
  • LLM-driven code generation: The framework includes built-in modules for interaction with LLMs, enabling the interpretation of task requirements and the automated generation of complete AscendC operator implementations for NPUs.
  • Automated evaluation pipeline: A full set of automation scripts is provided for large-scale management of operator compilation, numerical correctness validation, and performance benchmarking.
  • Accurate verification and scoring mechanism: Each task is accompanied by a PyTorch reference implementation and detailed test cases to ensure reliable and reproducible evaluation.
  • Comprehensive documentation: The project includes extensive documentation covering installation, task specifications, evaluation rules, and LLM usage; see the docs directory for details.

3. Benchmark Design Principles

To systematically evaluate the implementation quality of different operators, the benchmark is carefully designed from two complementary perspectives: task construction and evaluation methodology.

3.1 Task Organization and Categorization

All benchmark tasks are stored under the /tasks directory and follow the hierarchical structure level/Category/OperatorName.

Difficulty Levels:

  • level1: Basic operators. These are typically single-input, single-output operators without complex attributes, such as Sqrt and Equal.
  • level2: Common composite operators. These involve multiple inputs, more sophisticated computation logic, or fused execution patterns, such as AddLayerNorm and GeluGrad.
  • level3: Advanced complex operators. These often require dynamic shapes, complex parallel logic, or specialized data orchestration strategies, such as TopKV3 and BasicMatmul.

Internal Structure: Each operator task contains three subdirectories: question (problem description and code templates), answer (golden reference implementation), and validation (validation utilities and test data).

👉 For further details, please refer to the task design specification.

3.2 Evaluation and Scoring Criteria

The evaluation pipeline covers the three core stages of operator development:

  1. Compilation correctness: Whether the generated code can be successfully compiled by the CANN toolchain.
  2. Numerical correctness: Whether the output of the NPU operator remains within an acceptable error tolerance relative to the PyTorch reference implementation under the same input conditions. In general, a preliminary core numerical criterion of 2‰ is adopted.
  3. Performance: Subject to correctness constraints, the runtime efficiency of the operator on the NPU is measured. The principal performance metrics are latency and normalized computational throughput in FLOPS.

👉 For additional details, please refer to the evaluation and scoring specification.

4. Quick Start

4.1 Environment Setup

  1. Clone the repository

    git clone https://github.com/weich97/NPUKernelBench.git
    cd NPUKernelBench
  2. Set up the framework environment

    source set_framework_env.sh
  3. Install Python dependencies

    pip install -r requirements.txt
  4. Install vLLM-Ascend

    Please refer to the official vLLM-Ascend installation guide: https://docs.vllm.ai/projects/ascend/en/latest/installation.html. You need vllm 0.7.3-dev or higher

  5. Launch the vLLM service

    Modify the model checkpoint path in base_config.yaml to the actual path of your model weights, and then start the vLLM process:

    nohup bash start_vllm_server.sh > vllm_server.log 2>&1 &

    👉 For startup instructions, please refer to the vLLM service launch guide.

4.2 Example Run

Using the Sqrt and SwiGlu operators as examples, the following command performs a complete evaluation with the officially provided reference answers:

python run_multi_test.py -chat -task_name Sqrt SwiGlu

By inspecting the terminal output, users can observe the full pipeline, including model-based generation, compilation, and numerical validation for both operators.

👉 For a more detailed walkthrough, please refer to the getting started guide.

5. Typical Workflow

A typical workflow for using NPUKernelBench is illustrated below:

NPUKernelBench.png

5.1 Select a Task

Select one or more operator tasks from the tasks directory, for example, tasks/level1/Math/Sqrt.

5.2 Obtain an Operator Implementation

Option 1: Automatic generation with a large language model (LLM)

Run the code generation script to allow the LLM to automatically produce the kernel implementation:

python run_multi_test.py -chat -task_name Sqrt -stages code_gen

The generated code will be saved under runs/msopgen/lvl1/Math/Sqrt/fixed_case_0/samplex, where samplex corresponds to the x-th sampled result produced by the model. By default, the system invokes the model 100 times to generate kernel candidates under static-shape testing mode with a kernel-only template.

👉 For detailed usage of the LLM functionality, please refer to the LLM kernel generation guide.

Option 2: Manual development

Users may read api_desc.md and the code templates in question/ under the task directory, and then manually complete the operator logic in op_kernel/*.cpp.

5.3 Execute Evaluation

Run the main script run_multi_test.py, specify the target task and code path, and launch the fully automated evaluation pipeline.

  • Evaluate LLM-generated code (including compilation and numerical validation):

    python run_multi_test.py -chat -task_name Sqrt -stages compile precision
  • Evaluate manually written code (including compilation and numerical validation):

    python run_multi_test.py -task_name Sqrt -stages compile precision

6. Repository Structure Overview

NPUKernelBench/
├── docs/                      # Project documentation and user guides
├── framework/                 # Core logic of the automated testing and evaluation framework
├── kernel_generator/          # LLM-based kernel generator
├── libs/                      # Dependency libraries (e.g., CANN, JSON)
├── tasks/                     # Core benchmark task suite
├── base_config.yaml           # Global configuration file
├── run_multi_test.py          # Main entry script
├── start_vllm_server.sh       # Script for launching vLLM
└── set_framework_env.sh       # Environment setup script

7. Representative Case Studies

7.1 Example: Effect of Knowledge Injection

The following case study, which concerns the usage of the Muls instruction in AscendC, demonstrates the qualitative improvement brought by CoT-based domain knowledge injection in both professional knowledge comprehension and code generation capability.

knowledge_injection_effect.png

🤔 Original response (before training):
The response exhibits substantial uncertainty, as indicated by expressions such as “possibly” and “assuming.” The model attempts to reason by analogy from general programming knowledge, yet lacks an accurate understanding of AscendC-specific APIs, the precise usage of the Muls instruction, and its architectural context within the Ascend processor. The provided code example (e.g., involving hip/hip_runtime.h) is incorrect and explicitly reveals the model’s lack of domain-specific knowledge, as evidenced by statements such as “insufficient documentation” and “cannot be implemented.”

🎓 Improved response (after training):
The response demonstrates clear expert-level reasoning.

  • 🧠 Structured reasoning (<think>): The model first analyzes the Muls instruction in terms of its operational background (vector multiplication), key parameters (e.g., src0, dst), and essential data layout constraints (e.g., whether src0 must be a scalar or allocated in a specific region).
  • ✅ Accurate implementation: It then provides correct and concrete implementation steps, including initialization, local tensor memory management through InQueue and OutQueue, data movement (CopyIn), and the core computation invocation (ScalarValue).
  • ⌨️ Code correctness: The final generated code example (aicore__void Compute()) correctly employs the AscendC API and demonstrates an accurate understanding of both data queuing and vector instructions.

7.2 Example: Comparison of Generated Operators

The following example, which concerns the implementation of a Swish operator in AscendC, illustrates the qualitative improvement brought by CoT-based domain knowledge injection in handling complex operator implementation and tiling strategy design.

compare_swish_impl.png

🤔 Original response (before training):
The response exhibits substantial uncertainty in the critical tiling strategy, with phrases such as “possibly,” “obviously not feasible,” and “assuming.” Although the model attempts a simple arithmetic division (1024/48 = 21.33), it fails to understand Ascend core scheduling mechanisms and cannot handle non-divisible remainders appropriately. Its implementation of the Swish operator itself is also confused, with an inaccurate composition of operations such as Negate, Reciprocal, and Multiply, and an imprecise understanding of the required APIs. The response ultimately concludes that compilation would fail.

🎓 Improved response (after training):
The response demonstrates clear expert-level reasoning.

  • 🧠 Structured reasoning (<think>): The model first accurately analyzes the mathematical formulation of the Swish operator, namely $y = x \times \mathrm{sigmoid}(x)$, together with the input specification. More importantly, it designs a robust tiling strategy by correctly identifying the total workload (1024) and the number of cores (48), and by formulating an uneven remainder-aware workload distribution scheme (the first 47 cores process 21 elements each, while the last core processes 37 elements).
  • ✅ Accurate implementation: The model then correctly realizes this tiling strategy in the Init() function by computing blockLength and remainder. In the Compute() function, it provides a correct and concrete operator implementation pipeline, combining AscendC instructions such as Muls, Exp, Adds, and Div to construct the sigmoid function step by step and ultimately complete the Swish computation.
  • ⌨️ Code correctness: The final generated code examples (aicore__void Init() and aicore__void Compute()) correctly employ the AscendC API, demonstrating not only an accurate understanding of the operator’s mathematical logic, but also a deep mastery of multi-core parallelization and tiling strategies.

8. License

This project is released under the Apache 2.0 License.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors