You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Qualcomm AI Engine Direct - Refactor llama runner for dynamic IO dtypes (pytorch#19146)
### Summary
To enable GPU backend support in the Llama runner, refactoring is
required because the dtypes of kv_cache, attention_mask, and logits are
currently hardcoded, preventing floating‑point models from running.
This PR focuses on removing the hardcode dtype for them.
#### Key changes
- Remove template parameter <typename T> from KVManager,
LhdTokenGenerator,
MultimodalPromptProcessor, and related runner classes
- Detect kv_cache and attention_mask dtypes dynamically from MethodMeta
at
construction time instead of compile-time bitwidth detection
- Switch to std::byte* pointer arithmetic with getDtypeSize() for all
buffer
offsets; add fill_mask() helper for multi-dtype attention mask filling
- Update spec_prop pass for custom llama op for sharding case greater
than 1
### Test plan
```
python backends/qualcomm/tests/test_qnn_delegate.py -k TestExampleLLMScript.test_llama_stories_110m --model SM8650 --build_folder /local/mnt/workspace/chenweng/executorch/executorch/build-android --device acfa9311 --executorch_root . --artifact_dir ./stories_110m_pte_size --llama_artifacts . --use_fp16
```
<img width="1977" height="468" alt="image"
src="https://github.com/user-attachments/assets/8bf3bffa-9b9f-4655-9cbc-b20127c2468a"
/>
cc @cccclai@cbilgin@abhinaykukkadapu
0 commit comments