CUDA Backend

The CUDA backend is the ExecuTorch solution for running models on NVIDIA GPUs. It leverages the AOTInductor compiler to generate optimized CUDA kernels with libtorch-free execution, and uses Triton for high-performance GPU kernel generation.

Features

Optimized GPU Execution: Uses AOTInductor to generate highly optimized CUDA kernels for model operators
Triton Kernel Support: Leverages Triton for GEMM (General Matrix Multiply), convolution, and SDPA (Scaled Dot-Product Attention) kernels.
Quantization Support: INT4 weight quantization with tile-packed format for improved performance and reduced memory footprint
Cross-Platform: Supports both Linux and Windows platforms
Multiple Model Support: Works with various models including LLMs, vision-language models, and audio models

Target Requirements

Below are the requirements for running a CUDA-delegated ExecuTorch model:

Hardware: NVIDIA GPU with CUDA compute capability
CUDA Toolkit: CUDA 11.x or later (CUDA 12.x recommended)
Operating System: Linux or Windows
Drivers: PyTorch-Compatible NVIDIA GPU drivers installed

Development Requirements

To develop and export models using the CUDA backend:

Python: Python 3.8+
PyTorch: PyTorch with CUDA support
ExecuTorch: Install ExecuTorch with CUDA backend support

Using the CUDA Backend

Exporting Models with Python API

The CUDA backend uses the CudaBackend and CudaPartitioner classes to export models. Here is a complete example:

import torch
from executorch.backends.cuda.cuda_backend import CudaBackend
from executorch.backends.cuda.cuda_partitioner import CudaPartitioner
from executorch.exir import EdgeCompileConfig, to_edge_transform_and_lower
from executorch.extension.export_util.utils import save_pte_program

# Configure edge compilation
edge_compile_config = EdgeCompileConfig(
    _check_ir_validity=False,
    _skip_dim_order=True,
)

# Define your model
model = YourModel().eval()
example_inputs = (torch.randn(1, 3, 224, 224),)

# Export the model using torch.export
exported_program = torch.export.export(model, example_inputs)

# Create the CUDA partitioner
partitioner = CudaPartitioner(
    [CudaBackend.generate_method_name_compile_spec(model_name)]
)

# Add decompositions for Triton to generate kernels
exported_program = exported_program.run_decompositions({
    torch.ops.aten.conv1d.default: conv1d_to_conv2d,
})

# Lower to ExecuTorch with CUDA backend
et_program = to_edge_transform_and_lower(
    exported_program,
    partitioner=[partitioner],
    compile_config=edge_compile_config,
)

# Convert to executable program and save
exec_program = et_program.to_executorch()
save_pte_program(exec_program, model_name, "./output_dir")

This generates .pte and .ptd files that can be executed on CUDA devices.

For a complete working example, see the CUDA export script.

Runtime Integration

To run the model on device, use the standard ExecuTorch runtime APIs. See Running on Device for more information.

When building from source, pass -DEXECUTORCH_BUILD_CUDA=ON when configuring the CMake build to compile the CUDA backend.

# CMakeLists.txt
add_subdirectory("executorch")
...
target_link_libraries(
    my_target
    PRIVATE executorch
    extension_module_static
    extension_tensor
    aoti_cuda_backend)

No additional steps are necessary to use the backend beyond linking the target. CUDA-delegated .pte and .ptd files will automatically run on the registered backend.

Examples

For complete end-to-end examples of exporting and running models with the CUDA backend, see:

Whisper — Audio transcription model with CUDA support
Voxtral — Audio multimodal model with CUDA support
Gemma3 — Vision-language model with CUDA support

These examples demonstrate the full workflow including model export, quantization options, building runners, and runtime execution.

ExecuTorch provides Makefile targets for building these example runners:

make whisper-cuda   # Build Whisper runner with CUDA
make voxtral-cuda   # Build Voxtral runner with CUDA
make gemma3-cuda    # Build Gemma3 runner with CUDA

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA Backend

Features

Target Requirements

Development Requirements

Using the CUDA Backend

Exporting Models with Python API

Runtime Integration

Examples

FilesExpand file tree

cuda-overview.md

Latest commit

History

cuda-overview.md

File metadata and controls

CUDA Backend

Features

Target Requirements

Development Requirements

Using the CUDA Backend

Exporting Models with Python API

Runtime Integration

Examples