Add mHC (Manifold-Constrained Hyper-Connections) fused kernels to Liger-Kernel

### 🚀 The feature, motivation and pitch

## Background (Paper)
mHC: *Manifold-Constrained Hyper-Connections* (arXiv:2512.24880v2)  
https://arxiv.org/abs/2512.24880

**Paper alignment (what this proposal implements)**
- Constrain the residual mapping **H_res** via **Sinkhorn-Knopp** onto the doubly-stochastic set (Birkhoff polytope), restoring identity-mapping stability while keeping multi-stream residual benefits.
- Follow the paper’s fused-kernel decomposition and recompute strategy (Sec. 4.3.1, Eq.(14)–(19)), including mixed precision (e.g., x in BF16/FP16, projections/coeffs in fp32/TF32-style).

## Why this matters for Liger-Kernel
- Strong fit with Liger-Kernel’s goals: fusion, bandwidth reduction, and activation memory efficiency.
- mHC is a paper-defined, kernelization-friendly target with practical value for large-scale LM training.
- Extends coverage beyond single-stream residual style blocks.

## Proposal

https://github.com/linkedin/Liger-Kernel/pull/1065

- Add Triton fused kernels for mHC (coeffs / Sinkhorn / apply) with forward + backward.
- Add `LigerMHC` module + `liger_mhc_*` functional APIs following existing Liger naming.
- Add `allow_fp32` as **opt-in** (default remains BF16/FP16 mixed precision; intended for specific/debug use cases).
- Add correctness tests (ops + transformer-level + convergence) and benchmarks.

## Repro / Environment
- GPU: RTX 3090 (CUDA)
- torch: 2.10.0+cu128, cuda: 12.3
- Benchmark measurement: warmup + median over 20 runs

## Benchmark Results

### Environment

| Item | Value |
|------|-------|
| GPU | NVIDIA GeForce RTX 3090 |
| CUDA | 12.8 |
| PyTorch | 2.10.0+cu128 |
| Triton | 3.6.0 |
| Python | 3.12.4 |
| OS | Linux 6.8.0-90-generic (x86_64) |

---

### Micro-Benchmarks (B=4, HC=4, C=4096, tmax=20, BF16, T=128~2048)

| Kernel | Forward | Backward | Full | Memory |
|--------|---------|----------|------|--------|
| **coeffs** (3.5x faster, 77% less mem) | <img width="1000" height="600" alt="Image" src="https://github.com/user-attachments/assets/64554d46-5a44-4b09-961b-ecf685ad8b2c" /> | <img width="1000" height="600" alt="Image" src="https://github.com/user-attachments/assets/b02c47ba-f746-4b33-ba14-5884040b47b3" /> | <img width="1000" height="600" alt="Image" src="https://github.com/user-attachments/assets/3f8d9973-e54c-4dc3-9a2e-ae265166c97e" /> | <img width="1000" height="600" alt="Image" src="https://github.com/user-attachments/assets/fd89bf29-8513-4fd3-b8f8-ecc692e547e0" /> |
| **pre** (2.2x faster, 33% less mem) | <img width="1000" height="600" alt="Image" src="https://github.com/user-attachments/assets/90dc7bfa-b3fe-4d95-9398-c7306ee68851" /> | <img width="1000" height="600" alt="Image" src="https://github.com/user-attachments/assets/626afcc4-c5af-46bd-a9e3-d8987f78f39b" /> | <img width="1000" height="600" alt="Image" src="https://github.com/user-attachments/assets/38dc4120-3b69-4bdf-bc6a-30c52792b56f" /> | <img width="1000" height="600" alt="Image" src="https://github.com/user-attachments/assets/5890766e-cad0-4a6c-b406-064fec0133d0" /> |
| **post_res** (1.9x faster, 31% less mem) | <img width="1000" height="600" alt="Image" src="https://github.com/user-attachments/assets/f36b6308-d429-4b29-804f-a40646ef9e3b" /> | <img width="1000" height="600" alt="Image" src="https://github.com/user-attachments/assets/ebbfc195-6632-41ef-a0cd-065feb4022a5" /> | <img width="1000" height="600" alt="Image" src="https://github.com/user-attachments/assets/81000b0e-588a-4218-8c11-31fed2e329b9" /> | <img width="1000" height="600" alt="Image" src="https://github.com/user-attachments/assets/81000b0e-588a-4218-8c11-31fed2e329b9" /> |

### End-to-End LM Benchmark (B=2, T=256, HC=4, layers=2, heads=8, vocab=4096, BF16, hidden_size=256~1024)

| Kernel | Forward | Backward | Full | Memory |
|--------|---------|----------|------|--------|
| **mhc_llama_like_lm** (1.5x faster, 18% less mem) | <img width="1000" height="600" alt="Image" src="https://github.com/user-attachments/assets/e92215d6-e208-41b9-9ac2-e49f1abd3495" /> | <img width="1000" height="600" alt="Image" src="https://github.com/user-attachments/assets/149f92f0-f804-4bf6-ad62-ffa96d167661" /> | <img width="1000" height="600" alt="Image" src="https://github.com/user-attachments/assets/85bdb187-84d7-4ca4-add4-a1210c3f5cd7" /> | <img width="1000" height="600" alt="Image" src="https://github.com/user-attachments/assets/e7224eab-e03f-4079-a30f-dc380be61099" /> |

### Alternatives

_No response_

### Additional context

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add mHC (Manifold-Constrained Hyper-Connections) fused kernels to Liger-Kernel #1066

🚀 The feature, motivation and pitch

Background (Paper)

Why this matters for Liger-Kernel

Proposal

Repro / Environment

Benchmark Results

Environment

Micro-Benchmarks (B=4, HC=4, C=4096, tmax=20, BF16, T=128~2048)

End-to-End LM Benchmark (B=2, T=256, HC=4, layers=2, heads=8, vocab=4096, BF16, hidden_size=256~1024)

Alternatives

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Item	Value
GPU	NVIDIA GeForce RTX 3090
CUDA	12.8
PyTorch	2.10.0+cu128
Triton	3.6.0
Python	3.12.4
OS	Linux 6.8.0-90-generic (x86_64)

Kernel	Forward	Backward	Full	Memory
coeffs (3.5x faster, 77% less mem)
pre (2.2x faster, 33% less mem)
post_res (1.9x faster, 31% less mem)

Add mHC (Manifold-Constrained Hyper-Connections) fused kernels to Liger-Kernel #1066

Description

🚀 The feature, motivation and pitch

Background (Paper)

Why this matters for Liger-Kernel

Proposal

Repro / Environment

Benchmark Results

Environment

Micro-Benchmarks (B=4, HC=4, C=4096, tmax=20, BF16, T=128~2048)

End-to-End LM Benchmark (B=2, T=256, HC=4, layers=2, heads=8, vocab=4096, BF16, hidden_size=256~1024)

Alternatives

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions