Skip to content
View Umang-projects's full-sized avatar
๐Ÿ’ญ
๐Ÿ“™ Aspiring ML Systems & Efficient AI.
๐Ÿ’ญ
๐Ÿ“™ Aspiring ML Systems & Efficient AI.

Block or report Umang-projects

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don't include any personal information such as legal names or email addresses. Markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this userโ€™s behavior. Learn more about reporting abuse.

Report abuse
Umang-projects/README.md

Hi there, I'm Umang ๐Ÿ‘‹

GPU Systems & AI Compiler Engineer | High-Performance Computing (HPC)

I bridge the gap between Python's flexibility and Silicon's raw power. While others fine-tune models, I engineer the infrastructure they run onโ€”squeezing every last FLOP out of GPUs.


โš™๏ธ What I Do

My core expertise lies in Hardware-Software Co-Design. I specialize in identifying severe memory bottlenecks in LLM inference pipelines and writing custom fused kernels to bypass them. I am currently transitioning from manual kernel engineering to automated AI Compiler (MLIR) development.

Current Focus: Deep-diving into MLIR and reverse-engineering the OpenAI Triton backend to write custom compiler passes for automated IR-level optimizations.


๐Ÿ”ฅ Signature Projects & Open Source Impact

1. Accelerated FP8 MoE Kernels for NVIDIA B200 (Blackwell)

Developed for the global MLSys'26 FlashInfer-Bench Contest (NVIDIA Track).

  • The Challenge: Standard PyTorch MoE implementations suffer from extreme kernel launch starvation and memory-bound aten::index_add_ reductions.
  • The Fix: Engineered custom Fused FP8 (E4M3) GEMM kernels in OpenAI Triton and wrote custom Atomic Scatter-Add operations to bypass CPU dispatch latency.
  • The Result: Hit 832 TFLOPS peak throughput and achieved a 4.68x end-to-end speedup over PyTorch BF16 on highly fragmented architectural configs (Grok/DeepSeek styles).

2. Liger-Kernel (LinkedIn Open Source) | Core Contributor

Optimizing enterprise-scale LLM serving infrastructure (15K+ Stars).

  • The Fix: Engineered an inference-optimized Fused RMSNorm kernel in Triton.
  • The Result: Achieved a 24.6% throughput speedup (126 โ†’ 157 tok/s) and eliminated 2.1 GB/s of redundant HBM traffic by removing RSTD gradient state writes. Validated strictly via NVIDIA Nsight Compute profiling.

3. NVIDIA RAPIDS (cuDF) | Core Contributor

Accelerating GPU data science workflows (10K+ Stars).

  • The Fix: Reversed-engineered C++/Cython bindings (.pxd, .pyx) and authored a custom "Lazy Extraction" CUDA C++ kernel to prevent token materialization during string splits.
  • The Result: Achieved a ~160x execution speedup (137ms โ†’ 0.85ms) on 10M+ row DataFrames with zero memory overhead. Merged into the main repository.

4. LemonadeBench (AMD)

Built for the AMD AI Developer Lemonade Challenge.

  • Details: An open-source profiling suite to benchmark Local AI workflows (NPU/iGPU). Connects to OpenAI-compatible APIs to automatically measure TTFT (Time To First Token) and TPS (Tokens Per Second) across varying FP16/INT4 quantizations.

๐Ÿ› ๏ธ The Arsenal (Tech Stack)

Languages & Compilers GPU Compute & Profiling AI Infrastructure
C++ CUDA PyTorch
Python Triton vLLM
MLIR Nsight HuggingFace

๐Ÿง  Core Competencies (Under the Hood)

I don't just use libraries; I understand how they map to physical hardware.

  • GPU Architecture: Memory Coalescing, Shared Memory Banking (avoiding conflicts), Warp Divergence, FP8 Tensor Cores utilization, Occupancy tuning.
  • Systems Engineering: Intermediate Representations (TTIR/TTGIR), Kernel Fusion logic, Atomic Operations & Contention management.
  • Operating Systems: Virtual Memory & Paging, Concurrency (Mutex/Deadlocks), Process vs Thread memory models.

Umang's Stats

Pinned Loading

  1. Triton-Inference-Kernels Triton-Inference-Kernels Public

    Custom OpenAI Triton kernels for high-performance models inference. Accelerates models on NVIDIA GPUs by leveraging Triton's productivity and CUDA-level performance.

    Jupyter Notebook 1

  2. gpu-systems-playgrund gpu-systems-playgrund Public

    GPU Systems playground with cuda kernel expriments and performance profilling.

    Cuda 2

  3. Veritas-AI-Tracking-Misinformation-with-Autonomous-Agents Veritas-AI-Tracking-Misinformation-with-Autonomous-Agents Public

    Veritas AI: An autonomous agent crew that scrapes prediction markets to create a RAG-powered chatbot for tracking misinformation and public belief in real-time.

    Python 1

  4. Hy-LoRA-A-Hybrid-SVD-LoRA-Strategy-for-Efficient-LLM-Adaptation Hy-LoRA-A-Hybrid-SVD-LoRA-Strategy-for-Efficient-LLM-Adaptation Public

    Achieve >60% LLM compression with near-baseline perplexity using a novel "Compress-then-Adapt" strategy.

    Python 1

  5. cudf cudf Public

    Forked from rapidsai/cudf

    cuDF - GPU DataFrame Library

    C++ 1

  6. cudf-lazy-string-poc cudf-lazy-string-poc Public

    Proof of Concept: Achieving ~160x speedup on cuDF string extraction (split().get()) using fused lazy CUDA kernels on Tesla T4.

    Python 1