[DEPRECATED] Moved to ROCm/rocm-systems repo
-
Updated
May 4, 2026 - Python
[DEPRECATED] Moved to ROCm/rocm-systems repo
Online CUDA Occupancy Calculator
Agents, and RL environment, for optimizing GPU kernels on AMD ROCm using LLM agents. Benchmarks LLM serving workloads end-to-end, profiles bottleneck kernels, optimizes them via Claude Code or Codex, and scores on compilation, correctness, and speedup.
(Spring 2017) Assignment 2: GPU Executor
Runs a single CUDA/OpenCL kernel, taking its source from a file and arguments from the command-line
GPU Drano Static Analysis for GPU programs.
The official implementation for paper "AdaExplore: Failure-Driven Adaptation and Diversity-Preserving Search for Efficient Kernel Generation"
Repo containing artifacts for Neurips 2025 tutorial- How to Build Agents to Generate Kernels for Faster LLMs (and Other Models!)
AgentKernelArena provides an end-to-end siloed-benchmarking environment where different LLM-powered agents—such as Cursor Agent, Claude Code, Codex, SWE-agent, and GEAK—can be evaluated side-by-side on the same GPU kernel tasks, using objective and reproducible metrics.
Prototype for a SPIR-V assembler and dissasembler. It provides a composable Java interface for generating SPIR-V code at runtime.
Open source skill library for AI coding agents to write, optimize, and debug high performance compute kernels across CUDA, Triton, and quantized workloads.
From-scratch reimplementation of DeepSeek's Native Sparse Attention (arXiv:2502.11089) in Triton + CUDA Hopper WGMMA. 7.4x faster than FlashAttention-3 at 64k context. Five-model training fleet, perplexity sweep, LongBench v2, MoBA comparison.
Fused Triton kernels for Transformer inference: RMSNorm+RoPE, Gated MLP, FP8 GEMM — CPU-testable references, autotuning, and benchmarking
Real-time NVIDIA GPU command capture, decoding, and visualization
A self-hosted low-level functional-style programming language 🌀
High-performance GPU-accelerated C# scripting for Rhino Grasshopper, powered by ILGPU
Noeris — autonomous kernel fusion discovery + Triton autotuning for LLM kernels and Gemma layer deeper fusion (A100/H100 wins).
22 progressive Triton GPU kernels, from elementwise ops to Flash Attention v2, featuring correctness tests and PyTorch throughput/TFLOPS benchmarks.
CUDA fast paths for MoE dispatch and weighted combine with TensorRT-LLM-oriented trace replay benchmarks
Add a description, image, and links to the gpu-kernels topic page so that developers can more easily learn about it.
To associate your repository with the gpu-kernels topic, visit your repo's landing page and select "manage topics."