[BUG]Support for repeat_interleave operation to enable Grouped Query Attention (GQA)

Description:

  The PyTorchSim MLIR backend does not support the repeat_interleave
  operation, which is required for Grouped Query Attention (GQA) used in
  modern LLMs like Mixtral 8x7B and Llama 2. In GQA, the number of KV heads
  (n_local_heads) is smaller than the number of Q heads (n_head), and
  repeat_interleave is used to expand K/V tensors to match Q's head count
  (e.g., k.repeat_interleave(n_head // n_local_heads, dim=1)).

  The error occurs in PyTorchSimFrontend/mlir/mlir_common.py at line 682-683
   in the extract_dividers function, which raises NotImplementedError("Not 
  supporting this view operation...!") when encountering index expressions
  with multiple free symbols. The repeat_interleave operation generates a
  FloorDiv indexing pattern (e.g., index // 4 for 4x repetition) that the
  current implementation cannot handle.

  Current workaround: Set n_local_heads = -1 (which defaults to n_head),
  effectively disabling GQA and using standard Multi-Head Attention (MHA).
  This loses the memory efficiency benefits of GQA (4-8x KV cache
  reduction).

  Affected models: Mixtral 8x7B, Llama 2 70B, and other models using GQA.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG]Support for repeat_interleave operation to enable Grouped Query Attention (GQA) #198

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG]Support for repeat_interleave operation to enable Grouped Query Attention (GQA) #198

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions