Conversation
|
Hi Evan, I tested this PR on three HuggingFace models (Qwen2.5-0.5B, Qwen3-0.6B, Llama-3.2-1B) with the Background: How HuggingFace uses
|
| Model | PyTorch | iattention (main) | iattention (PR+mask) | iattention (is_causal) | sdpa no cache | sdpa static_v1 | plugin |
|---|---|---|---|---|---|---|---|
| Qwen2.5-0.5B | 4743 | 5421 (0.88x) | 5633 (0.84x) | 2303 (2.06x) | 3271 (1.5x) | 1238 (3.8x) | 421 (11.3x) |
| Qwen3-0.6B | 6891 | 6792 (1.01x) | — | 3097 (2.22x) | 4031 (1.7x) | 1708 (4.0x) | 569 (12.1x) |
| Llama-3.2-1B | 7064 | 8283 (0.85x) | — | 4426 (1.60x) | 5466 (1.3x) | 1379 (5.1x) | 465 (15.2x) |
Parentheses show speedup vs PyTorch.
The is_causal=True approach makes iattention 1.6–2.2x faster than PyTorch and even faster than sdpa no cache (which goes through a separate graph-level lowering pass to achieve the same mask-to-causal conversion).
Trade-off discussion
However, unconditionally discarding the mask means the converter can no longer support arbitrary (non-causal) masks. The existing tests in test_attention_aten.py pass random float tensors as attn_bias, which would produce incorrect results if the mask were discarded.
If the iattention backend is intended primarily for HuggingFace LLM inference (where attn_bias is always a causal mask), then discarding the mask is the right optimization. But if the converter needs to remain general-purpose for arbitrary masks, we need a different approach — perhaps detecting the causal mask pattern at the graph level, or providing a flag (e.g., decompose_attention) to let users control this behavior.
What are your thoughts on how to proceed?
|
After further investigation, I think adding an if mask_tensor is not None:
mask_tensor = None
use_causal = TrueThe This keeps the converter general-purpose (arbitrary masks still work when no lowering pass is registered, e.g., in unit tests), while getting the optimal IAttention performance for LLM inference. It also feels cleaner than unconditionally discarding the mask inside the converter itself. Note: the dynamic head dimension fix from Issue 1 is still needed in the converter regardless, since What do you think? |
Description
Support
attn_biasfor efficient SDPAFixes #4129
Type of change
Checklist: