|
| 1 | +# Optimization Overview |
| 2 | + |
| 3 | +This tutorial series walks through key optimization techniques in ML compilers using MLIR, ordered by pedagogical progression. Each stage builds on concepts from the previous one. |
| 4 | + |
| 5 | +## Environment Setup |
| 6 | + |
| 7 | +### Environment Preparation with conda (Optional) |
| 8 | + |
| 9 | +- OS must be higher than ubuntu 22.04. |
| 10 | +- install gcc-13 and g++-13 |
| 11 | + |
| 12 | +```bash |
| 13 | +apt update -y && \ |
| 14 | +apt install -yq gcc-13 g++-13 |
| 15 | +# apt install -yq software-properties-common \ |
| 16 | +# add-apt-repository -y ppa:ubuntu-toolchain-r/test \ |
| 17 | +# apt update -y |
| 18 | +# apt install -yq gcc-11 g++-11 |
| 19 | +update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-13 20 |
| 20 | +update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-13 20 |
| 21 | +``` |
| 22 | + |
| 23 | +- install cmake and ninja you can choose one way you like. conda is best for me. |
| 24 | + |
| 25 | +```bash |
| 26 | +conda create -n mlir -y |
| 27 | +conda activate mlir |
| 28 | +# conda install cmake ninja clang-format clang lld ncurses mlir llvm -c conda-forge |
| 29 | +conda install cmake ninja clang-format clang clang-tools mlir zlib spdlog fmt lit llvm=19.* -c conda-forge -y |
| 30 | +# create -n mlir cmake ninja clang-format clang mlir zlib spdlog fmt lit llvm -c conda-forge -y |
| 31 | +``` |
| 32 | + |
| 33 | +- build example with conda |
| 34 | + |
| 35 | +```bash |
| 36 | +cd example |
| 37 | +bash build_with_conda.sh all |
| 38 | +``` |
| 39 | + |
| 40 | +### Environment Preparation with dev containers |
| 41 | + |
| 42 | +Please choose the `Dev Containers: Open Folder in Container...` |
| 43 | + |
| 44 | +- build example with dev containers |
| 45 | + |
| 46 | +```bash |
| 47 | +cd example |
| 48 | +bash scripts/sync_deps.sh |
| 49 | +bash scripts/build_deps.sh |
| 50 | +bash build.sh all |
| 51 | +``` |
| 52 | + |
| 53 | +## Configure the Clangd |
| 54 | + |
| 55 | +```bash |
| 56 | +cd example |
| 57 | +# after you configure the project with cmake, you can configure the clangd by run the following command |
| 58 | +compdb -p build list > compile_commands.json |
| 59 | +``` |
| 60 | + |
| 61 | +## Plan |
| 62 | + |
| 63 | +### Phase 1: MatMul (Foundation) |
| 64 | + |
| 65 | +**Goal:** Establish core optimization vocabulary and mechanics. |
| 66 | + |
| 67 | +| Topic | Description | |
| 68 | +| -------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------- | |
| 69 | +| Structured Op | Define and lower a matmul via `linalg.generic` / named ops; understand the iteration domain, indexing maps, and payload. | |
| 70 | +| Tiling | Apply `scf.forall` / `scf.for` tile-and-fuse to decompose the M×N×K loop nest; explore tile-size trade-offs. | |
| 71 | +| Locality | Demonstrate cache-friendly access via loop permutation (MKN vs MNK), packing, and micro-kernel promotion to registers. | |
| 72 | +| Simple Cost Model | Introduce a basic analytical model (FLOPs, memory traffic, arithmetic intensity) to guide tile-size selection. | |
| 73 | +| Pipeline Abstraction | Compose the above into a reusable pass pipeline: tile → promote → vectorize → lower, showing how MLIR pass infrastructure orchestrates transformations. | |
| 74 | + |
| 75 | +**Deliverable:** An end-to-end optimized matmul that is competitive with a naive BLAS call, with clear before/after IR at every stage. |
| 76 | + |
| 77 | +--- |
| 78 | + |
| 79 | +### Phase 2: Conv2D + Activation Fusion (Spatial & Fusion) |
| 80 | + |
| 81 | +**Goal:** Extend tiling to spatial dimensions and introduce operator fusion. |
| 82 | + |
| 83 | +| Topic | Description | |
| 84 | +| -------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------- | |
| 85 | +| Fusion | Fuse an elementwise activation (ReLU, GELU) into the convolution producer-consumer pair; understand producer-consumer analysis and the legality of fusion. | |
| 86 | +| Spatial Tiling | Tile output height and width dimensions; manage the resulting input tile expansion due to the kernel window (halo). | |
| 87 | +| Layout | Explore NHWC vs NCHW (and packed variants like NCHWc); understand how data layout affects vectorization and memory access patterns. | |
| 88 | +| Halo / Reuse | Handle overlapping input regions across tiles; compute the halo size from kernel size, stride, and dilation; demonstrate data reuse. | |
| 89 | + |
| 90 | +**Deliverable:** A fused conv2d + activation kernel with explicit spatial tiling, demonstrating measurable speedup from fusion and layout selection. |
| 91 | + |
| 92 | +--- |
| 93 | + |
| 94 | +### Phase 3: LayerNorm / Softmax (Reduction Scheduling) |
| 95 | + |
| 96 | +**Goal:** Tackle reduction-heavy operations where numerical stability and scheduling are tightly coupled. |
| 97 | + |
| 98 | +| Topic | Description | |
| 99 | +| -------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------- | |
| 100 | +| Reduction Scheduling | Implement multi-pass (mean → variance → normalize) vs single-pass (Welford) reduction strategies; tile reductions across threads. | |
| 101 | +| Scratch Buffer | Allocate and manage intermediate buffers (`memref.alloca` / workspace) for partial results; understand buffer lifetime and placement. | |
| 102 | +| Numerics–Schedule Coupling | Show how the softmax "max-subtract" trick and log-sum-exp rewriting are not just numerical choices but directly constrain the legal schedules. | |
| 103 | + |
| 104 | +**Deliverable:** A numerically stable, tiled LayerNorm/Softmax implementation with clear discussion of how algorithmic rewrites enable (or block) specific schedules. |
| 105 | + |
| 106 | +--- |
| 107 | + |
| 108 | +### Phase 4: Subgraph Fusion & Memory Planning (Graph Level) |
| 109 | + |
| 110 | +**Goal:** Move from single-op to multi-op / graph-level optimization. |
| 111 | + |
| 112 | +| Topic | Description | |
| 113 | +| ------------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------- | |
| 114 | +| Graph Scheduling | Decide fusion groups and execution order across a small subgraph (e.g., matmul → bias → layernorm); model the trade-off between parallelism and memory pressure. | |
| 115 | +| Peak Memory Optimization | Apply operator reordering, in-place updates, and buffer sharing (liveness analysis) to minimize peak memory; visualize the memory waterline before/after. | |
| 116 | + |
| 117 | +**Deliverable:** A small end-to-end subgraph whose peak memory and kernel count are jointly optimized, with tooling to visualize the memory timeline. |
| 118 | + |
| 119 | +--- |
| 120 | + |
| 121 | +### Suggested Timeline |
| 122 | + |
| 123 | +| Week | Phase | Key Milestone | |
| 124 | +| ---- | ----------------------------- | -------------------------------------------- | |
| 125 | +| 1–3 | Phase 1 – MatMul | Tiled + vectorized matmul with pass pipeline | |
| 126 | +| 4–5 | Phase 2 – Conv2D + Activation | Fused conv2d-relu with spatial tiling | |
| 127 | +| 6–7 | Phase 3 – LayerNorm / Softmax | Numerically stable tiled reduction | |
| 128 | +| 8–9 | Phase 4 – Subgraph Fusion | Graph-level fusion with memory planning | |
| 129 | +| 10 | Wrap-up | Benchmarking, profiling, and write-up | |
| 130 | + |
0 commit comments