Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
62 commits
Select commit Hold shift + click to select a range
5cddcb5
feat: Add model analysis and conversion framework with Transformers i…
antmikinka Mar 14, 2026
0aa1505
fix: Use Transformers integration for HF Hub models in gap analysis
antmikinka Mar 14, 2026
61fb52a
Fix CLI scan command to print summary directly from info object
antmikinka Mar 14, 2026
d890840
Remove silent AST scanner fallback from gap analysis
antmikinka Mar 14, 2026
6236d65
Fix gap analysis to properly detect sliding window as unsupported
antmikinka Mar 14, 2026
1bf709d
Add operator specification generator (#76)
antmikinka Mar 14, 2026
f3c30fe
Fix Transformers 5.x compatibility for multi-modal models (#77)
antmikinka Mar 14, 2026
b06fce7
Add operator creation guide and update README (#78)
antmikinka Mar 14, 2026
bc4cda2
Archive duplicate files from model_convert (#79)
antmikinka Mar 14, 2026
8a0fa4b
Consolidate model_analysis imports and improve documentation (#80)
antmikinka Mar 14, 2026
ef842ca
Add comprehensive data sources guide for operator creation (#81)
antmikinka Mar 15, 2026
ce9002e
Add master document generator for operator implementation (#82)
antmikinka Mar 15, 2026
c5818bd
Export generate_master_document in __init__.py (#82)
antmikinka Mar 15, 2026
ace8c76
Add Reduction operator for AIE2 and AIE2P (#83)
antmikinka Mar 15, 2026
154acc2
Add Conv2D operator for AIE2 and AIE2P (#84)
antmikinka Mar 15, 2026
aa1cbcd
Add MaxPool operator for AIE2 and AIE2P (#85)
antmikinka Mar 15, 2026
dc2039f
Add AveragePool operator for AIE2 and AIE2P (#86)
antmikinka Mar 15, 2026
11da5b6
Add Conv3D operator for AIE2 and AIE2P (#87)
antmikinka Mar 15, 2026
9023b4b
Fix syntax error in conv3d_bf16_large_kernel weight_idx calculation
antmikinka Mar 15, 2026
6c4f30d
Update CONV3D_STRATEGY.md to reflect completed implementation
antmikinka Mar 15, 2026
afcb559
Add conv3d_bf16_large_kernel for AIE2 architecture
antmikinka Mar 15, 2026
6364a54
Update CONV3D_STRATEGY.md for complete AIE2 large_kernel support
antmikinka Mar 15, 2026
ee61d48
Add conv3d_bf16_scalar for AIE2P architecture
antmikinka Mar 15, 2026
f3378e2
Update CONV3D_STRATEGY.md to reflect complete kernel parity
antmikinka Mar 15, 2026
46baf11
Add ONNX Runtime GenAI Windows backend for NPU runtime (Task #52)
antmikinka Mar 15, 2026
a69a610
Complete ONNX Runtime GenAI API implementation (Task #53)
antmikinka Mar 15, 2026
26a7bc9
Add Task #52 & #53 completion report
antmikinka Mar 15, 2026
556655b
Add IronServer C++ backend implementation and integration guide
antmikinka Mar 15, 2026
3027cf0
Add session summary for continuation session
antmikinka Mar 15, 2026
127304a
docs: Add comprehensive IronServer integration documentation
antmikinka Mar 15, 2026
9d24489
docs: Add Llama3.2 operator analysis and support plan
antmikinka Mar 16, 2026
4d642b9
feat: Phase 2 Baseline Complete - Benchmark Framework + Operator Impl…
antmikinka Mar 16, 2026
40a029c
feat: Phase 3 Week 1 complete - Foundation components for Llama3.2 in…
antmikinka Mar 16, 2026
6745eab
feat: Phase 3 Week 2 complete - Llama3.2 model config and weight loader
antmikinka Mar 16, 2026
904c8e6
docs: Update PROJECT_STATUS_TRACKER for Week 2 completion
antmikinka Mar 16, 2026
991dca7
feat: Phase 3 Week 3 generation infrastructure - STRUCTURE COMPLETE
antmikinka Mar 16, 2026
4cfc824
feat: Phase 3 Week 3 REMEDIATION COMPLETE - _forward_layer() implemented
antmikinka Mar 18, 2026
fe9a5d8
feat: Add block_size config for paged KV cache integration
antmikinka Mar 18, 2026
06f3bee
feat: Implement P0 benchmark regression fixes across 10 operator files
antmikinka Mar 18, 2026
eaeaab4
feat: P3 benchmark infrastructure complete - tile/column scaling stud…
antmikinka Mar 19, 2026
969594f
docs: Update .gitignore to exclude documentation and AI folders
antmikinka Mar 19, 2026
0b35142
fix: Gracefully skip NPU hardware tests when AIE toolchain unavailable
antmikinka Mar 19, 2026
36b9929
docs: Add cross-analysis verification report for comprehensive benchm…
antmikinka Mar 19, 2026
7fc8191
fix(p0-critical): Resolve severe performance regressions in 6 operators
antmikinka Mar 19, 2026
84b2333
fix(p1-high): Address bandwidth and stability regressions in 5 operators
antmikinka Mar 19, 2026
380714e
fix(p2-medium): Resolve stddev regressions in GEMM and GEMV operators
antmikinka Mar 19, 2026
6bdf735
fix(p1-high): Resolve AXPY 4-column 2-channel bandwidth regression
antmikinka Mar 19, 2026
5a0bd8d
docs: Update benchmark analysis tracking documentation
antmikinka Mar 19, 2026
c6d330f
docs: Add SWIGLU_DECODE fix plan documentation
antmikinka Mar 21, 2026
589a793
docs: Add SWIGLU_DECODE-FIX-PLAN.md to task tracking table
antmikinka Mar 21, 2026
82f3f14
fix(p2-medium): Add FIFO depth=3 for TANH 2-column stability
antmikinka Mar 21, 2026
b814d9e
docs: Update task tracking with TANH 2-column fix (Task #119)
antmikinka Mar 21, 2026
ef079f6
docs: Add TRANSPOSE fix status and update task tracking (Task #120)
antmikinka Mar 21, 2026
24fa898
fix(p1-high): Enhanced FIFO depth for WEIGHTED_RMS_NORM stability
antmikinka Mar 21, 2026
8cb875d
docs: Update task tracking with WEIGHTED_RMS_NORM fix (Task #121)
antmikinka Mar 21, 2026
64e745f
fix: Batch commit for 17 operator benchmark fixes
antmikinka Mar 21, 2026
ffd699d
chore: Apply Black formatting to Python files
antmikinka Mar 21, 2026
dae6f6c
fix: Critical import regression and numpy.softmax errors in generatio…
antmikinka Mar 21, 2026
fd7783c
fix(p0-critical): AXPY operator FIFO depth with tile_size_factor
antmikinka Mar 21, 2026
5ee11e3
fix(p1-high): DEQUANT operator FIFO depth with tile_size_factor
antmikinka Mar 21, 2026
63f0d6f
fix(p1-high): DEQUANT operator add large tile (>=2048) factor
antmikinka Mar 21, 2026
878d0e0
chore: Untrack agent docs and dev docs folders
antmikinka Mar 30, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .clang-format
Original file line number Diff line number Diff line change
Expand Up @@ -40,3 +40,4 @@ AllowAllParametersOfDeclarationOnNextLine: false
BinPackParameters: false
BinPackArguments: false
ConstructorInitializerAllOnOneLineOrOnePerLine: true
UseCRLF: true
5 changes: 5 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -20,3 +20,8 @@ id_ed25519.pub
*.model
.cline_storage
*.egg-info

# Documentation and AI folders
docs/
chroma-data/
.claude/
349 changes: 349 additions & 0 deletions CONV3D_STRATEGY.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,349 @@
<!--
SPDX-FileCopyrightText: Copyright (C) 2025 Advanced Micro Devices, Inc. All rights reserved.
SPDX-License-Identifier: Apache-2.0
-->

# Conv3D Strategy: Convolution as Compute Primitive for Text and Video Models

## Executive Summary

This document captures key insights about repurposing convolution operators (Conv2D, Conv3D) as **compute primitives** for both video AND text models through strategic shape manipulation. The Conv3D operator is identified as the next critical implementation to enable efficient LLM operations on AMD Ryzen AI NPUs.

---

## 1. Current Operator Status

| Operator | Status | AIE2 | AIE2P | Location |
|----------|--------|------|-------|----------|
| Conv2D | ✅ Complete | ✓ | ✓ | `iron/operators/conv2d/` |
| MaxPool2D | ✅ Complete | ✓ | ✓ | `iron/operators/maxpool/` |
| AveragePool2D | ✅ Complete | ✓ | ✓ | `iron/operators/avgpool/` |
| Reduction | ✅ Complete | ✓ | ✓ | `iron/operators/reduction/` |
| **Conv3D** | ✅ **Complete** | ✓ | ✓ | `iron/operators/conv3d/` |

### Original Request Completion Status

User's original list: **"CONVOLUTION, MAX POOL, AVERAGE POOL AND Reduction"**

- ✅ Convolution (Conv2D + Conv3D)
- ✅ Max Pool (2D)
- ✅ Average Pool (2D)
- ✅ Reduction (sum, mean, max, min)

---

## 2. Key Insight: Convolution as Compute Primitive

### 2.1 The Fundamental Realization

> **Convolution operators are not just for semantic convolution - they are COMPUTE PRIMITIVES that can be repurposed through shape manipulation.**

This insight transforms how we view Conv3D:
- **Before**: Conv3D = video model operator only
- **After**: Conv3D = 5D compute primitive for video + text models

### 2.2 Apple's Conv2D Trick (Proven Pattern)

Apple's Neural Engine uses this proven technique for Linear layers:

```
Original: (B, S, D) # Batch, Sequence, Hidden
Reshape: (B, D, 1, S) # Treat as image: (B, C, H, W)
Conv2D: kernel=(1,1) # Pointwise convolution = Matrix multiply
Output: (B, D_out, 1, S) # Result
Reshape: (B, S, D_out) # Back to sequence format
```

**Our Conv2D already supports this** via `pointwise_conv2d_bf16_vector` kernel when `kernel_size=(1,1)`.

### 2.3 Extending to Conv3D for Text Models

The 5D structure of Conv3D naturally maps to blocked LLM tensor layouts:

#### MHA 5D Blocked Format
```
(B, G, H, S, D_h) where:
B = Batch
G = Groups (for Grouped Query Attention)
H = Heads per group
S = Sequence length (tiled)
D_h = Head dimension (e.g., 128)
```

#### Conv3D 5D Structure
```
(N, C, T, H, W) where:
N = Batch
C = Channels
T = Temporal/Depth
H = Height
W = Width
```

#### Proposed Mapping
| Conv3D | MHA | Use Case |
|--------|-----|----------|
| N | B | Batch processing |
| C | G | GQA groups |
| T | H | Head dimension |
| H | S_tiles | Sequence tiles |
| W | D_h_tiles | Head dimension tiles |

---

## 3. Conv3D Implementation Strategy

### 3.1 Dual-Purpose Design

Conv3D must support two usage patterns:

#### Pattern A: Semantic Video Convolution
```python
# Standard video input: (N, C, T, H, W)
conv3d = AIEConv3d(
in_channels=64,
out_channels=128,
kernel_size=(3, 3, 3),
stride=(1, 2, 2),
padding=(1, 1, 1)
)
# Video classification, action recognition, etc.
```

#### Pattern B: Text Model Compute Primitive
```python
# MHA blocked format: (B, G, H, S_tiles, D_h_tiles)
conv3d = AIEConv3d(
in_channels=G, # Groups
out_channels=G, # Same groups
kernel_size=(1, 3, 3), # Process local S x D_h windows
stride=(1, 1, 1),
padding=(0, 1, 1)
)
# Reshape MHA tensors to 5D, apply Conv3D as attention primitive
```

### 3.2 Kernel Configurations

| Kernel Size | Use Case | Description |
|-------------|----------|-------------|
| (1, 1, 1) | Channel projection | Linear layer equivalent for 5D |
| (1, 3, 3) | Local attention | Windowed attention over S × D_h |
| (3, 3, 3) | Full 3D convolution | Video models, spatiotemporal |
| (1, 1, k) | Cross-head mixing | Mix information across heads |

### 3.3 Vectorization Strategy

Based on our existing patterns:

| Architecture | vec_factor | Kernel File |
|--------------|------------|-------------|
| AIE2 (NPU) | 8 | `aie_kernels/aie2/conv3d.cc` |
| AIE2P (NPU2) | 16 | `aie_kernels/aie2p/conv3d.cc` |

---

## 4. Shape Manipulation Patterns for Text Models

### 4.1 Tiling for NPU Efficiency

Standard PyTorch: `(B, S, D)`

NPU-optimized 5D: `(B, S_outer, S_inner, D_outer, D_inner)`

Where:
- `S_inner` = tile size (e.g., 32 for NPU vector width)
- `D_inner` = tile size (e.g., 32 or 64)

Example for Llama 3 (S=128, D=4096, tile=32):
```
Original: (1, 128, 4096)
5D Tiled: (1, 4, 32, 128, 32) # (B, S_outer, S_inner, D_outer, D_inner)
Permuted: (1, 4, 128, 32, 32) # For NPU memory layout
```

### 4.2 The Conv3D Trick Workflow

```
Step 1: Start with MHA tensors
Q, K, V: (B, num_heads, S, D_h)

Step 2: Reshape for GQA format
(B, G, H, S, D_h) where G = groups, H = heads_per_group

Step 3: Tile for NPU
(B, G, H, S_tiles, D_h_tiles) where tile_size matches NPU vector width

Step 4: Apply Conv3D with kernel (1, 3, 3)
Processes local 3x3 windows over (S × D_h) space
Efficient attention computation

Step 5: Collapse back to standard format
(B, num_heads * S, D_h) → project to output
```

---

## 5. Implementation Plan

### 5.1 Files to Create

```
iron/operators/conv3d/
├── __init__.py # Module exports
├── op.py # Main operator class (AIEConv3d)
├── design.py # MLIR generation (my_conv3d)
├── reference.py # CPU reference (torch.nn.Conv3d)
└── test.py # Pytest test suite

aie_kernels/aie2/conv3d.cc # AIE2 kernel (vec_factor=8)
aie_kernels/aie2p/conv3d.cc # AIE2P kernel (vec_factor=16)
```

### 5.2 Key Design Decisions

| Decision | Rationale |
|----------|-----------|
| Support 5D input (N, C, T, H, W) | Matches both video and blocked text formats |
| Separate kernels for depthwise/pointwise | Optimization paths like Conv2D |
| Configurable num_aie_columns (1-8) | Scale from NPU to NPU2 |
| Tile size parameter | Enable NPU memory optimization |
| Groups support | Enable GQA-style operations |

### 5.3 Kernel API Design

```cpp
// AIE2: vec_factor = 8
void conv3d_bf16_vector(
bfloat16* input, bfloat16* weight, bfloat16* output,
int N, int C, int T, int H, int W, // Input dimensions
int out_T, int out_H, int out_W, // Output dimensions
int kT, int kH, int kW, // Kernel sizes
int sT, int sH, int sW, // Strides
int pT, int pH, int pW, // Padding
int groups
);

// AIE2P: vec_factor = 16 (enhanced throughput)
void conv3d_bf16_vector_enhanced(...); // Same signature, optimized implementation
```

---

## 6. After Conv3D: Related Operators

Once Conv3D is complete, consider these extensions:

| Operator | Purpose | Priority |
|----------|---------|----------|
| Conv3DTranspose | Video generation, decoding | Medium |
| MaxPool3D / AveragePool3D | Video downsampling | Low |
| Attention-specific kernels | Dedicated MHA optimization | High |
| Shape manipulation utilities | Reshape/permute helpers | High |

---

## 7. Immediate Next Steps

1. **Implement Conv3D operator** (`iron/operators/conv3d/`)
- Follow established pattern from Conv2D
- Support both semantic and compute-primitive use cases

2. **Create AIE2/AIE2P kernels** (`aie_kernels/*/conv3d.cc`)
- vec_factor=8 for AIE2
- vec_factor=16 for AIE2P

3. **Update exports and documentation**
- Add to `iron/operators/__init__.py`
- Update README.md operator dashboard

4. **Test with both use cases**
- Video convolution (semantic)
- Shape-manipulated text operations (compute primitive)

---

## 8. Verification Checklist

- [x] Conv3D op.py follows Conv2D pattern
- [x] design.py generates correct MLIR for 5D tensors
- [x] Kernels use correct vec_factor per architecture (8 for AIE2, 16 for AIE2P)
- [x] Test suite covers both video and text use cases
- [x] README.md updated with Conv3D entry
- [x] __init__.py exports AIEConv3d
- [x] Kernel files created for both AIE2 and AIE2P
- [x] Syntax errors fixed and verified

### Verification Summary (Completed)

All Conv3D implementation files have been verified:

| File | Status | Notes |
|------|--------|-------|
| `iron/operators/conv3d/op.py` | ✅ | Correct buffer calculations, kernel selection logic |
| `iron/operators/conv3d/design.py` | ✅ | 21 parameters match C++ signatures |
| `iron/operators/conv3d/reference.py` | ✅ | Uses torch.nn.functional.conv3d |
| `iron/operators/conv3d/test.py` | ✅ | Parametrized tests for all configurations |
| `iron/operators/conv3d/__init__.py` | ✅ | Exports AIEConv3d |
| `aie_kernels/aie2/conv3d.cc` | ✅ | vec_factor=8, 5 kernel variants (incl. scalar, large_kernel) |
| `aie_kernels/aie2p/conv3d.cc` | ✅ | vec_factor=16, 5 kernel variants (incl. scalar, large_kernel) |

---

## 9. References

### Internal Documentation
- [`iron/operators/conv2d/`](./iron/operators/conv2d/) - Conv2D implementation reference
- [`iron/operators/conv3d/`](./iron/operators/conv3d/) - Conv3D implementation (complete)
- [`iron/operators/reduction/`](./iron/operators/reduction/) - Reduction implementation
- [README.md](./README.md) - Operator dashboard

### External References
- Apple CoreML Conv2D trick for Linear layers
- Qualcomm Hexagon 5D/6D tiled layouts
- Huawei Ascend 5D fractal format
- Grouped Query Attention (GQA) in Llama 3, Mistral

---

## 10. Implementation Complete - Summary

The Conv3D operator has been fully implemented and verified for both AIE2 (NPU) and AIE2P (NPU2) architectures.

### Key Achievements

1. **Dual-Purpose Design**: Conv3D supports both:
- Semantic video convolution (standard 5D tensors)
- Compute primitive for text models (via shape manipulation)

2. **Kernel Variants** (both AIE2 and AIE2P - complete parity):
- `conv3d_bf16_vector` - Standard vectorized convolution
- `conv3d_bf16_scalar` - Scalar reference implementation (both architectures)
- `depthwise_conv3d_bf16_vector` - Channel-wise convolution
- `pointwise_conv3d_bf16_vector` - 1x1x1 convolution (Linear layer equivalent)
- `conv3d_bf16_large_kernel` - Optimized for large kernels

3. **Architecture Support**:
- AIE2 (NPU): 4x4 array, vec_factor=8
- AIE2P (NPU2): 4x8 array, vec_factor=16

4. **Configuration Flexibility**:
- Configurable kernel_size, stride, padding (temporal, height, width)
- Grouped convolution support (including depthwise)
- Optional bias
- Scalable column allocation (1-8 columns)

### Next Steps

With Conv3D complete, the IRON project now has a comprehensive set of operators for both video and text model inference on AMD Ryzen AI NPUs. The Conv3D operator enables:

- Video understanding models (video classification, action recognition)
- Compute primitives for LLM operations via shape manipulation
- Foundation for custom attention mechanisms
- Building block for 3D vision transformers

---

<p align="center">
Copyright&copy; 2025 Advanced Micro Devices, Inc
</p>
Loading