feat: integrate Llama.cpp and enhance engine stability for cross-platform usage by krishjp · Pull Request #616 · PrunaAI/pruna

krishjp · 2026-04-06T20:34:43Z

Description

This PR integrates the Llama.cpp quantizer engine into Pruna, enabling GGUF-based quantization. In addition to the new feature, this PR addresses critical compatibility issues for Python 3.13 and improves cross-platform robustness on Windows.

Note: this PR is a rework of feat: integrate Llama.cpp and enhance engine stability for cross-platform usage #584

Key Changes:

Engine Support: Integrated llama-cpp-python as a new quantizer backend, supporting various GGUF quantization methods (e.g., q4_k_m).
Python 3.13 Compatibility: Fixed a KeyError in [SAVE_FUNCTIONS] and LOAD_FUNCTIONS by explicitly using enum.member() for callable members (with a backward-compatible fallback for older Python versions).
Stability: Implemented safer cache directory cleanup in [SmashConfig] to prevent AttributeError during interpreter shutdown on Windows.
Consistency: Added a [save()] alias to [PrunaModel] to match [save_pretrained()] and ensure consistent attribute delegation for non-torch backends.
Dependencies: Added the llamacpp optional dependency group and updated the full extra in [pyproject.toml].

Related Issue

Fixes #377

Related PRs

#583 - takes a more general look at the enum modification

Type of Change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
This change requires a documentation update

How Has This Been Tested?

Integration Tests: Verified that [TestLlamaCpp] llama_cpp.py passes successfully on Windows using Python 3.12 and 3.13.
Diagnostic Scripts: Confirmed correct Enum member registration for engine save/load functions.
Local Benchmarking Script: Successfully smash'ed SmolLM2-135M-Instruct using llama.cpp q4_k_m quantization.
- Compression: 4.88x reduction in model size.
- Speedup: 4.14x faster inference results (tokens/sec) on CPU.

Checklist

My code follows the style guidelines of this project
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Additional Notes

The TypeError occasionally observed during llama-cpp-python shutdown is a known upstream issue in their del implementation during interpreter termination and does not affect the performance or correctness of the Smash/Save operations.

codacy-production · 2026-04-06T20:35:48Z

Not up to standards ⛔

🔴 Issues 1 critical

Alerts:
⚠ 1 issue (≤ 0 issues of at least minor severity)

Results:
1 new issue

Category Results

Security 1 critical

View in Codacy

🟢 Metrics 152 complexity

Metric Results

Complexity 152

View in Codacy

_{NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer}
_{TIP This summary will be updated as you push new changes. Give us feedback}

review-notebook-app · 2026-04-06T22:29:27Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

krishjp · 2026-04-06T22:41:43Z

Hi @llcnt and @gsprochette! Here is an updated draft PR to replace #584.
I'm looking at the last few codacy issues that were brought up but the main codebase changes should be in place. ruff check also found some other fixes from older commits, so they are included here as well.

krishjp · 2026-04-07T15:29:58Z

@cursor review

llcnt

Thank you for the improved version of the PR!
We are definitely very close to the final step:)

…device checks for llama-cpp models due to a lack of model.parameters() support

…on 3.13 - addressed functools.partial object compatability with py 3.13 - integrated enum.member() in SAVE_FUNCTIONS and LOAD_FUNCTIONS - updated the LlamaCpp algorithm implementation to utilize the standardized naming convention. - cleaned up redundant commented-out logic in the save_pruna_model function. Verified through restoration of LlamaCpp integration tests and diagnostic scripts confirming Enum member registration.

…form usage - standardized LlamaCpp implementation and naming conventions within the engine - implemented cache directory cleanup to prevent shutdown errors on Windows - added a save() alias to the base model wrapper for improved API consistency - updated project configuration with Llama.cpp and dependency group - benchmarked using SmolLM2-135M-Instruct with q4_k_m quantization

- added Int class for integer-based configuration. - updated get_device and model_checks for llama_cpp. - implemented secure conversion script caching. - enabled TestLlamaCpp and removed manual test overrides.

* feat: initial implementation for rapidata * ci: add rapidata dependency and some cleanup * Guard optional rapidata metric import and tighten validation Applied via @cursor push command * refactor: address PR comments * feat: add polling and address further PR comments * refactor: add mixin for setting context * ci: add evaluation as an umbrella dep * refactor: address PR comments * ci: separate rapidata matrix * fix: minor issues * ci: make tests import safe --------- Co-authored-by: Cursor Agent <cursoragent@cursor.com>

…llama.cpp compatibility

* build: bump max python to 3.13 * build: isolate realesrgan in a extra because no 3.13 basicsr wheels are available

krishjp · 2026-04-21T17:45:26Z

Hi @llcnt, I made some updates to your comments. Take a look when you get a chance and let me know if you spot anything that needs updates. Cheers!

llcnt · 2026-04-22T09:06:41Z

@cursor review

llcnt

Only little changes to make, and we are good to go :)

check the cursor comments
change the tag of convert file (use a newer one and test it)
make sure the user has added the tokenizer to the smash_config before smashing the model
Thanks !!

llcnt · 2026-04-22T11:10:36Z

+from pruna.logging.logger import pruna_logger
+
+# SHA256 hash for the pinned version (b3600) of convert_hf_to_gguf.py
+LLAMA_CPP_CONVERSION_SCRIPT_URL = "https://raw.githubusercontent.com/ggml-org/llama.cpp/b3600/convert_hf_to_gguf.py"


can we pin a newer tag ? The tag b3600 is very old, and does not work on newer model (eg. qwen3)

llcnt · 2026-04-22T11:14:59Z

+        """Save HF model and convert it to GGUF format."""
+        with tempfile.TemporaryDirectory(dir=str(temp_dir)) as hf_model_dir:
+            model.save_pretrained(hf_model_dir)
+            if hasattr(smash_config, "tokenizer") and smash_config.tokenizer:


could we add a else condition here to explain the user that s/he must have run smash_config.add_tokenizer("model_id") before, otherwise it will fails ?
Or we could even manually run this smash_config.add_tokenizer("model_id") if the tokenizer is not already present in the smash_config ?

Got it! I added in a ValueError for that scenario and an auto-add_tokenizer flow to catch this

…acks - Pinned convert_hf_to_gguf.py to tag b8958 - Added automated model tokenizer resolution logic - Introduced .get() operators to SmashConfig wrappers - Refactored pyproject.toml

cursor

Cursor Bugbot has reviewed your changes and found 2 potential issues.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

Comment @cursor review or bugbot run to trigger another review on this PR

^{Reviewed by Cursor Bugbot for commit 0711b04. Configure here.}

llcnt

minor comments added: resolve them and we are good to gooooo! :)

llcnt · 2026-04-29T07:53:55Z

+from pruna.engine.utils import verify_sha256
+from pruna.logging.logger import pruna_logger
+
+# SHA256 hash for the pinned version (b3600) of convert_hf_to_gguf.py


very minor comment : upd with the new tag b8958 ;)

llcnt · 2026-04-29T07:57:18Z

        actual_key = self._prefix + key
        return self._base_config[actual_key]

+    def get(self, key: str, default: Any = None) -> Any:


get() function defined twice: here and at line 559. I guess some duplicated code`

gsprochette · 2026-04-29T10:01:24Z

    "vllm>=0.16.0",
    "ray",
 ]
+llamacpp = [


Hi @krishjp , thank you for your contribution :) I just left a comment on an other PR that, like you, rightfully adds an extra to the pyproject.toml. There is some config to add to the workflows so the tests can run so I'll link the comment here and you can follow it by replacing all "kvpress" occurence with "llamacpp" :)
https://github.com/PrunaAI/pruna/pull/623/changes#r3160069151

As stated in this comment, please let me know about any difficulty concerning these steps :) (pining @begumcig also working on this)

Hello @krishjp , just wanted to let you know that we made the process of adding an extra simpler in #653 so there are now fewer steps. You can find the instructions for what you need to do in #654: the line you have here corresponds to the first sentence "Add a new extra [...]", so you should still

"Define the dependency group [...]"

"[...] set required_install on the algorithm class [...]"

"Finally, register a matching requires_<extra> [...]"

Thanks in advance, and please let us know if these steps are unclear so we can improve the documentation :)

Also @krishjp, do not forget to add the extra in the CI matrix as here ;)

krishjp changed the title ~~Feat/llama cpp~~ feat: integrate Llama.cpp and enhance engine stability for cross-platform usage Apr 6, 2026

cursor Bot reviewed Apr 7, 2026

View reviewed changes

Comment thread src/pruna/engine/save.py Outdated

krishjp marked this pull request as ready for review April 7, 2026 15:48

krishjp force-pushed the feat/llama-cpp branch 2 times, most recently from 134cf0f to 09789d0 Compare April 7, 2026 16:01

llcnt requested changes Apr 10, 2026

View reviewed changes

krishjp and others added 11 commits April 20, 2026 15:50

feat: implement llama.cpp algorithm

546a856

feat: llama.cpp conversion by forcing f16 for tiny models and bypass …

2774248

…device checks for llama-cpp models due to a lack of model.parameters() support

fix: integrity verification of remote scripts

d9488d7

fix: ruff typechecking and shutil.move on GGUF file handling

0e7d939

feat: updated llama support with rebased head branch commits

9f7c4cb

- added Int class for integer-based configuration. - updated get_device and model_checks for llama_cpp. - implemented secure conversion script caching. - enabled TestLlamaCpp and removed manual test overrides.

fix: ruff check fixes and llama_cpp updates

238c502

refactor: llama_cpp code length update and extra comments for visibility

3712ac2

refactor: code complexity

3535735

refactor: removed dead code from save_model_llama_cpp in save.py

4bfe002

krishjp force-pushed the feat/llama-cpp branch from 09789d0 to 4bfe002 Compare April 20, 2026 22:53

begumcig and others added 4 commits April 21, 2026 08:40

refactor: required review changes and additional comments to address …

e07d974

…llama.cpp compatibility

fix: subprocess run updates to prevent injection vulnerabilities

349ef8a

build: bump python 3.13 (PrunaAI#624)

f1ef7de

* build: bump max python to 3.13 * build: isolate realesrgan in a extra because no 3.13 basicsr wheels are available

krishjp requested a review from llcnt April 21, 2026 21:07

cursor Bot reviewed Apr 22, 2026

View reviewed changes

Comment thread src/pruna/algorithms/llama_cpp.py

Comment thread pyproject.toml Outdated

llcnt requested changes Apr 22, 2026

View reviewed changes

krishjp and others added 2 commits April 28, 2026 10:47

Merge branch 'PrunaAI:main' into feat/llama-cpp

73bcbe3

feat(llama_cpp): update conversion script and improve tokenizer fallb…

0ead29e

…acks - Pinned convert_hf_to_gguf.py to tag b8958 - Added automated model tokenizer resolution logic - Introduced .get() operators to SmashConfig wrappers - Refactored pyproject.toml

cursor Bot reviewed Apr 28, 2026

View reviewed changes

Comment thread src/pruna/algorithms/llama_cpp.py

Comment thread src/pruna/algorithms/llama_cpp.py

Comment thread src/pruna/engine/save.py

fix: updated llama_cpp default hyperparameters and save logic

0711b04

cursor Bot reviewed Apr 28, 2026

View reviewed changes

Comment thread src/pruna/algorithms/llama_cpp.py

Comment thread src/pruna/engine/save.py

fix: temp directory cleanup and ruff check

7356b20

llcnt requested changes Apr 29, 2026

View reviewed changes

gsprochette reviewed Apr 29, 2026

View reviewed changes

Conversation

krishjp commented Apr 6, 2026

Description

Related Issue

Related PRs

Type of Change

How Has This Been Tested?

Checklist

Additional Notes

Uh oh!

codacy-production Bot commented Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Not up to standards ⛔

Uh oh!

review-notebook-app Bot commented Apr 6, 2026

Uh oh!

krishjp commented Apr 6, 2026

Uh oh!

krishjp commented Apr 7, 2026

Uh oh!

Uh oh!

llcnt left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

krishjp commented Apr 21, 2026

Uh oh!

llcnt commented Apr 22, 2026

Uh oh!

Uh oh!

Uh oh!

llcnt left a comment

Choose a reason for hiding this comment

Uh oh!

llcnt Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

llcnt Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

krishjp Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

llcnt left a comment

Choose a reason for hiding this comment

Uh oh!

llcnt Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

llcnt Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

gsprochette Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

gsprochette May 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

llcnt May 6, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

codacy-production Bot commented Apr 6, 2026 •

edited

Loading

gsprochette May 2, 2026 •

edited

Loading