feat: integrate Llama.cpp and enhance engine stability for cross-platform usage#616
feat: integrate Llama.cpp and enhance engine stability for cross-platform usage#616krishjp wants to merge 19 commits intoPrunaAI:mainfrom
Conversation
Not up to standards ⛔🔴 Issues
|
| Category | Results |
|---|---|
| Security | 1 critical |
🟢 Metrics 152 complexity
Metric Results Complexity 152
NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes. Give us feedback
|
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
|
Hi @llcnt and @gsprochette! Here is an updated draft PR to replace #584. |
|
@cursor review |
134cf0f to
09789d0
Compare
llcnt
left a comment
There was a problem hiding this comment.
Thank you for the improved version of the PR!
We are definitely very close to the final step:)
…device checks for llama-cpp models due to a lack of model.parameters() support
…on 3.13 - addressed functools.partial object compatability with py 3.13 - integrated enum.member() in SAVE_FUNCTIONS and LOAD_FUNCTIONS - updated the LlamaCpp algorithm implementation to utilize the standardized naming convention. - cleaned up redundant commented-out logic in the save_pruna_model function. Verified through restoration of LlamaCpp integration tests and diagnostic scripts confirming Enum member registration.
…form usage - standardized LlamaCpp implementation and naming conventions within the engine - implemented cache directory cleanup to prevent shutdown errors on Windows - added a save() alias to the base model wrapper for improved API consistency - updated project configuration with Llama.cpp and dependency group - benchmarked using SmolLM2-135M-Instruct with q4_k_m quantization
- added Int class for integer-based configuration. - updated get_device and model_checks for llama_cpp. - implemented secure conversion script caching. - enabled TestLlamaCpp and removed manual test overrides.
* feat: initial implementation for rapidata * ci: add rapidata dependency and some cleanup * Guard optional rapidata metric import and tighten validation Applied via @cursor push command * refactor: address PR comments * feat: add polling and address further PR comments * refactor: add mixin for setting context * ci: add evaluation as an umbrella dep * refactor: address PR comments * ci: separate rapidata matrix * fix: minor issues * ci: make tests import safe --------- Co-authored-by: Cursor Agent <cursoragent@cursor.com>
…llama.cpp compatibility
* build: bump max python to 3.13 * build: isolate realesrgan in a extra because no 3.13 basicsr wheels are available
|
Hi @llcnt, I made some updates to your comments. Take a look when you get a chance and let me know if you spot anything that needs updates. Cheers! |
|
@cursor review |
llcnt
left a comment
There was a problem hiding this comment.
Only little changes to make, and we are good to go :)
- check the cursor comments
- change the tag of convert file (use a newer one and test it)
- make sure the user has added the tokenizer to the smash_config before smashing the model
Thanks !!
| from pruna.logging.logger import pruna_logger | ||
|
|
||
| # SHA256 hash for the pinned version (b3600) of convert_hf_to_gguf.py | ||
| LLAMA_CPP_CONVERSION_SCRIPT_URL = "https://raw.githubusercontent.com/ggml-org/llama.cpp/b3600/convert_hf_to_gguf.py" |
There was a problem hiding this comment.
can we pin a newer tag ? The tag b3600 is very old, and does not work on newer model (eg. qwen3)
| """Save HF model and convert it to GGUF format.""" | ||
| with tempfile.TemporaryDirectory(dir=str(temp_dir)) as hf_model_dir: | ||
| model.save_pretrained(hf_model_dir) | ||
| if hasattr(smash_config, "tokenizer") and smash_config.tokenizer: |
There was a problem hiding this comment.
could we add a else condition here to explain the user that s/he must have run smash_config.add_tokenizer("model_id") before, otherwise it will fails ?
Or we could even manually run this smash_config.add_tokenizer("model_id") if the tokenizer is not already present in the smash_config ?
There was a problem hiding this comment.
Got it! I added in a ValueError for that scenario and an auto-add_tokenizer flow to catch this
…acks - Pinned convert_hf_to_gguf.py to tag b8958 - Added automated model tokenizer resolution logic - Introduced .get() operators to SmashConfig wrappers - Refactored pyproject.toml
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Comment @cursor review or bugbot run to trigger another review on this PR
Reviewed by Cursor Bugbot for commit 0711b04. Configure here.
llcnt
left a comment
There was a problem hiding this comment.
minor comments added: resolve them and we are good to gooooo! :)
| from pruna.engine.utils import verify_sha256 | ||
| from pruna.logging.logger import pruna_logger | ||
|
|
||
| # SHA256 hash for the pinned version (b3600) of convert_hf_to_gguf.py |
There was a problem hiding this comment.
very minor comment : upd with the new tag b8958 ;)
| actual_key = self._prefix + key | ||
| return self._base_config[actual_key] | ||
|
|
||
| def get(self, key: str, default: Any = None) -> Any: |
There was a problem hiding this comment.
get() function defined twice: here and at line 559. I guess some duplicated code`
| "vllm>=0.16.0", | ||
| "ray", | ||
| ] | ||
| llamacpp = [ |
There was a problem hiding this comment.
Hi @krishjp , thank you for your contribution :) I just left a comment on an other PR that, like you, rightfully adds an extra to the pyproject.toml. There is some config to add to the workflows so the tests can run so I'll link the comment here and you can follow it by replacing all "kvpress" occurence with "llamacpp" :)
https://github.com/PrunaAI/pruna/pull/623/changes#r3160069151
As stated in this comment, please let me know about any difficulty concerning these steps :) (pining @begumcig also working on this)
There was a problem hiding this comment.
Hello @krishjp , just wanted to let you know that we made the process of adding an extra simpler in #653 so there are now fewer steps. You can find the instructions for what you need to do in #654: the line you have here corresponds to the first sentence "Add a new extra [...]", so you should still
- "Define the dependency group [...]"
- "[...] set
required_installon the algorithm class [...]" - "Finally, register a matching
requires_<extra>[...]"
Thanks in advance, and please let us know if these steps are unclear so we can improve the documentation :)

Description
This PR integrates the Llama.cpp quantizer engine into Pruna, enabling GGUF-based quantization. In addition to the new feature, this PR addresses critical compatibility issues for Python 3.13 and improves cross-platform robustness on Windows.
Key Changes:
llama-cpp-pythonas a new quantizer backend, supporting various GGUF quantization methods (e.g.,q4_k_m).KeyErrorin [SAVE_FUNCTIONS] andLOAD_FUNCTIONSby explicitly usingenum.member()for callable members (with a backward-compatible fallback for older Python versions).AttributeErrorduring interpreter shutdown on Windows.llamacppoptional dependency group and updated thefullextra in [pyproject.toml].Related Issue
Fixes #377
Related PRs
#583 - takes a more general look at the enum modification
Type of Change
How Has This Been Tested?
Enummember registration for engine save/load functions.SmolLM2-135M-Instructusing llama.cppq4_k_mquantization.Checklist
Additional Notes
The
TypeErroroccasionally observed duringllama-cpp-pythonshutdown is a known upstream issue in their del implementation during interpreter termination and does not affect the performance or correctness of the Smash/Save operations.