I bridge the gap between Python's flexibility and Silicon's raw power. While others fine-tune models, I engineer the infrastructure they run onโsqueezing every last FLOP out of GPUs.
My core expertise lies in Hardware-Software Co-Design. I specialize in identifying severe memory bottlenecks in LLM inference pipelines and writing custom fused kernels to bypass them. I am currently transitioning from manual kernel engineering to automated AI Compiler (MLIR) development.
Current Focus: Deep-diving into MLIR and reverse-engineering the OpenAI Triton backend to write custom compiler passes for automated IR-level optimizations.
Developed for the global MLSys'26 FlashInfer-Bench Contest (NVIDIA Track).
- The Challenge: Standard PyTorch MoE implementations suffer from extreme kernel launch starvation and memory-bound
aten::index_add_reductions. - The Fix: Engineered custom Fused FP8 (E4M3) GEMM kernels in OpenAI Triton and wrote custom Atomic Scatter-Add operations to bypass CPU dispatch latency.
- The Result: Hit 832 TFLOPS peak throughput and achieved a 4.68x end-to-end speedup over PyTorch BF16 on highly fragmented architectural configs (Grok/DeepSeek styles).
Optimizing enterprise-scale LLM serving infrastructure (15K+ Stars).
- The Fix: Engineered an inference-optimized Fused RMSNorm kernel in Triton.
- The Result: Achieved a 24.6% throughput speedup (126 โ 157 tok/s) and eliminated 2.1 GB/s of redundant HBM traffic by removing RSTD gradient state writes. Validated strictly via NVIDIA Nsight Compute profiling.
Accelerating GPU data science workflows (10K+ Stars).
- The Fix: Reversed-engineered C++/Cython bindings (
.pxd,.pyx) and authored a custom "Lazy Extraction" CUDA C++ kernel to prevent token materialization during string splits. - The Result: Achieved a ~160x execution speedup (137ms โ 0.85ms) on 10M+ row DataFrames with zero memory overhead. Merged into the main repository.
Built for the AMD AI Developer Lemonade Challenge.
- Details: An open-source profiling suite to benchmark Local AI workflows (NPU/iGPU). Connects to OpenAI-compatible APIs to automatically measure TTFT (Time To First Token) and TPS (Tokens Per Second) across varying FP16/INT4 quantizations.
| Languages & Compilers | GPU Compute & Profiling | AI Infrastructure |
|---|---|---|
I don't just use libraries; I understand how they map to physical hardware.
- GPU Architecture: Memory Coalescing, Shared Memory Banking (avoiding conflicts), Warp Divergence, FP8 Tensor Cores utilization, Occupancy tuning.
- Systems Engineering: Intermediate Representations (TTIR/TTGIR), Kernel Fusion logic, Atomic Operations & Contention management.
- Operating Systems: Virtual Memory & Paging, Concurrency (Mutex/Deadlocks), Process vs Thread memory models.