Skip to content

feat: integrate Llama.cpp and enhance engine stability for cross-platform usage#616

Open
krishjp wants to merge 19 commits intoPrunaAI:mainfrom
krishjp:feat/llama-cpp
Open

feat: integrate Llama.cpp and enhance engine stability for cross-platform usage#616
krishjp wants to merge 19 commits intoPrunaAI:mainfrom
krishjp:feat/llama-cpp

Conversation

@krishjp
Copy link
Copy Markdown

@krishjp krishjp commented Apr 6, 2026

Description

This PR integrates the Llama.cpp quantizer engine into Pruna, enabling GGUF-based quantization. In addition to the new feature, this PR addresses critical compatibility issues for Python 3.13 and improves cross-platform robustness on Windows.

Key Changes:

  • Engine Support: Integrated llama-cpp-python as a new quantizer backend, supporting various GGUF quantization methods (e.g., q4_k_m).
  • Python 3.13 Compatibility: Fixed a KeyError in [SAVE_FUNCTIONS] and LOAD_FUNCTIONS by explicitly using enum.member() for callable members (with a backward-compatible fallback for older Python versions).
  • Stability: Implemented safer cache directory cleanup in [SmashConfig] to prevent AttributeError during interpreter shutdown on Windows.
  • Consistency: Added a [save()] alias to [PrunaModel] to match [save_pretrained()] and ensure consistent attribute delegation for non-torch backends.
  • Dependencies: Added the llamacpp optional dependency group and updated the full extra in [pyproject.toml].

Related Issue

Fixes #377

Related PRs

#583 - takes a more general look at the enum modification

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

  • Integration Tests: Verified that [TestLlamaCpp] llama_cpp.py passes successfully on Windows using Python 3.12 and 3.13.
  • Diagnostic Scripts: Confirmed correct Enum member registration for engine save/load functions.
  • Local Benchmarking Script: Successfully smash'ed SmolLM2-135M-Instruct using llama.cpp q4_k_m quantization.
    • Compression: 4.88x reduction in model size.
    • Speedup: 4.14x faster inference results (tokens/sec) on CPU.

Checklist

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

Additional Notes

The TypeError occasionally observed during llama-cpp-python shutdown is a known upstream issue in their del implementation during interpreter termination and does not affect the performance or correctness of the Smash/Save operations.

@codacy-production
Copy link
Copy Markdown

codacy-production Bot commented Apr 6, 2026

Not up to standards ⛔

🔴 Issues 1 critical

Alerts:
⚠ 1 issue (≤ 0 issues of at least minor severity)

Results:
1 new issue

Category Results
Security 1 critical

View in Codacy

🟢 Metrics 152 complexity

Metric Results
Complexity 152

View in Codacy

NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes. Give us feedback

@review-notebook-app
Copy link
Copy Markdown

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@krishjp krishjp changed the title Feat/llama cpp feat: integrate Llama.cpp and enhance engine stability for cross-platform usage Apr 6, 2026
@krishjp
Copy link
Copy Markdown
Author

krishjp commented Apr 6, 2026

Hi @llcnt and @gsprochette! Here is an updated draft PR to replace #584.
I'm looking at the last few codacy issues that were brought up but the main codebase changes should be in place. ruff check also found some other fixes from older commits, so they are included here as well.

@krishjp
Copy link
Copy Markdown
Author

krishjp commented Apr 7, 2026

@cursor review

Comment thread src/pruna/engine/save.py Outdated
@krishjp krishjp marked this pull request as ready for review April 7, 2026 15:48
@krishjp krishjp force-pushed the feat/llama-cpp branch 2 times, most recently from 134cf0f to 09789d0 Compare April 7, 2026 16:01
Copy link
Copy Markdown
Collaborator

@llcnt llcnt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the improved version of the PR!
We are definitely very close to the final step:)

Comment thread src/pruna/algorithms/llama_cpp.py Outdated
Comment thread src/pruna/algorithms/llama_cpp.py Outdated
Comment thread src/pruna/algorithms/llama_cpp.py Outdated
Comment thread src/pruna/engine/pruna_model.py Outdated
Comment thread src/pruna/engine/load.py
Comment thread src/pruna/algorithms/llama_cpp.py
Comment thread src/pruna/engine/save.py
krishjp and others added 11 commits April 20, 2026 15:50
…device checks for llama-cpp models due to a lack of model.parameters() support
…on 3.13

- addressed functools.partial object compatability with py 3.13
- integrated enum.member() in SAVE_FUNCTIONS and LOAD_FUNCTIONS
- updated the LlamaCpp algorithm implementation to utilize the standardized
  naming convention.
- cleaned up redundant commented-out logic in the save_pruna_model function.

Verified through restoration of LlamaCpp integration tests and diagnostic
scripts confirming Enum member registration.
…form usage

- standardized LlamaCpp implementation and naming conventions within the engine
- implemented cache directory cleanup to prevent shutdown errors on Windows
- added a save() alias to the base model wrapper for improved API consistency
- updated project configuration with Llama.cpp and dependency group
- benchmarked using SmolLM2-135M-Instruct with q4_k_m quantization
- added Int class for integer-based configuration.
- updated get_device and model_checks for llama_cpp.
- implemented secure conversion script caching.
- enabled TestLlamaCpp and removed manual test overrides.
begumcig and others added 4 commits April 21, 2026 08:40
* feat: initial implementation for rapidata

* ci: add rapidata dependency and some cleanup

* Guard optional rapidata metric import and tighten validation

Applied via @cursor push command

* refactor: address PR comments

* feat: add polling and address further PR comments

* refactor: add mixin for setting context

* ci: add evaluation as an umbrella dep

* refactor: address PR comments

* ci: separate rapidata matrix

* fix: minor issues

* ci: make tests import safe

---------

Co-authored-by: Cursor Agent <cursoragent@cursor.com>
* build: bump max python to 3.13

* build: isolate realesrgan in a extra because no 3.13 basicsr wheels are available
@krishjp
Copy link
Copy Markdown
Author

krishjp commented Apr 21, 2026

Hi @llcnt, I made some updates to your comments. Take a look when you get a chance and let me know if you spot anything that needs updates. Cheers!

@krishjp krishjp requested a review from llcnt April 21, 2026 21:07
@llcnt
Copy link
Copy Markdown
Collaborator

llcnt commented Apr 22, 2026

@cursor review

Comment thread src/pruna/algorithms/llama_cpp.py
Comment thread pyproject.toml Outdated
Copy link
Copy Markdown
Collaborator

@llcnt llcnt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only little changes to make, and we are good to go :)

  • check the cursor comments
  • change the tag of convert file (use a newer one and test it)
  • make sure the user has added the tokenizer to the smash_config before smashing the model
    Thanks !!

Comment thread src/pruna/algorithms/llama_cpp.py Outdated
from pruna.logging.logger import pruna_logger

# SHA256 hash for the pinned version (b3600) of convert_hf_to_gguf.py
LLAMA_CPP_CONVERSION_SCRIPT_URL = "https://raw.githubusercontent.com/ggml-org/llama.cpp/b3600/convert_hf_to_gguf.py"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we pin a newer tag ? The tag b3600 is very old, and does not work on newer model (eg. qwen3)

"""Save HF model and convert it to GGUF format."""
with tempfile.TemporaryDirectory(dir=str(temp_dir)) as hf_model_dir:
model.save_pretrained(hf_model_dir)
if hasattr(smash_config, "tokenizer") and smash_config.tokenizer:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we add a else condition here to explain the user that s/he must have run smash_config.add_tokenizer("model_id") before, otherwise it will fails ?
Or we could even manually run this smash_config.add_tokenizer("model_id") if the tokenizer is not already present in the smash_config ?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it! I added in a ValueError for that scenario and an auto-add_tokenizer flow to catch this

krishjp and others added 2 commits April 28, 2026 10:47
…acks

- Pinned convert_hf_to_gguf.py to tag b8958
- Added automated model tokenizer resolution logic
- Introduced .get() operators to SmashConfig wrappers
- Refactored pyproject.toml
Comment thread src/pruna/algorithms/llama_cpp.py
Comment thread src/pruna/algorithms/llama_cpp.py
Comment thread src/pruna/engine/save.py
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Comment @cursor review or bugbot run to trigger another review on this PR

Reviewed by Cursor Bugbot for commit 0711b04. Configure here.

Comment thread src/pruna/algorithms/llama_cpp.py
Comment thread src/pruna/engine/save.py
Copy link
Copy Markdown
Collaborator

@llcnt llcnt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor comments added: resolve them and we are good to gooooo! :)

from pruna.engine.utils import verify_sha256
from pruna.logging.logger import pruna_logger

# SHA256 hash for the pinned version (b3600) of convert_hf_to_gguf.py
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very minor comment : upd with the new tag b8958 ;)

Comment thread src/pruna/algorithms/llama_cpp.py
actual_key = self._prefix + key
return self._base_config[actual_key]

def get(self, key: str, default: Any = None) -> Any:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get() function defined twice: here and at line 559. I guess some duplicated code`

Comment thread pyproject.toml
"vllm>=0.16.0",
"ray",
]
llamacpp = [
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @krishjp , thank you for your contribution :) I just left a comment on an other PR that, like you, rightfully adds an extra to the pyproject.toml. There is some config to add to the workflows so the tests can run so I'll link the comment here and you can follow it by replacing all "kvpress" occurence with "llamacpp" :)
https://github.com/PrunaAI/pruna/pull/623/changes#r3160069151

As stated in this comment, please let me know about any difficulty concerning these steps :) (pining @begumcig also working on this)

Copy link
Copy Markdown
Collaborator

@gsprochette gsprochette May 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @krishjp , just wanted to let you know that we made the process of adding an extra simpler in #653 so there are now fewer steps. You can find the instructions for what you need to do in #654: the line you have here corresponds to the first sentence "Add a new extra [...]", so you should still

  • "Define the dependency group [...]"
  • "[...] set required_install on the algorithm class [...]"
  • "Finally, register a matching requires_<extra> [...]"

Thanks in advance, and please let us know if these steps are unclear so we can improve the documentation :)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also @krishjp, do not forget to add the extra in the CI matrix as here ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEATURE] Integrate llama.cpp as a Quantizer

4 participants