Native llama.cpp bindings for Bare.
Run LLM inference directly in your Bare JavaScript applications with GPU acceleration support.
- CMake 3.25+
- C/C++ compiler (clang, gcc, or MSVC)
- Node.js (for npm/cmake-bare)
- Bare runtime
Clone with submodules:
git clone --recursive https://github.com/user/bare-llama.cpp
cd bare-llama.cppOr if already cloned:
git submodule update --init --recursiveInstall dependencies and build:
npm install
npm run prebuildOr manually:
bare-make generate
bare-make build
bare-make installThis creates prebuilds/<platform>-<arch>/bare-llama.bare.
For a debug build:
bare-make generate -- -D CMAKE_BUILD_TYPE=Debug
bare-make buildTo disable GPU acceleration:
bare-make generate -- -D GGML_METAL=OFF -D GGML_CUDA=OFF
bare-make buildconst { LlamaModel, LlamaContext, LlamaSampler, generate } = require('bare-llama')
// Load model (GGUF format)
const model = new LlamaModel('./model.gguf', {
nGpuLayers: 99 // Offload layers to GPU (0 = CPU only)
})
// Create context
const ctx = new LlamaContext(model, {
contextSize: 2048, // Max context length
batchSize: 512 // Batch size for prompt processing
})
// Create sampler
const sampler = new LlamaSampler({
temp: 0.7, // Temperature (0 = greedy)
topK: 40, // Top-K sampling
topP: 0.95 // Top-P (nucleus) sampling
})
// Generate text
const output = generate(model, ctx, sampler, 'The meaning of life is', 128)
console.log(output)
// Cleanup
sampler.free()
ctx.free()
model.free()See examples/ for more:
basic.js- Simple generation using the high-level APIlowlevel.js- Token-by-token generation with full control
Run examples with:
bare examples/basic.js -- /path/to/model.ggufnew LlamaModel(path, options?)| Option | Type | Default | Description |
|---|---|---|---|
nGpuLayers |
number | 0 | Number of layers to offload to GPU |
Methods:
tokenize(text, addBos?)- Convert text to tokens (Int32Array)detokenize(tokens)- Convert tokens back to textisEogToken(token)- Check if token is end-of-generationfree()- Release model resources
new LlamaContext(model, options?)| Option | Type | Default | Description |
|---|---|---|---|
contextSize |
number | 512 | Maximum context length |
batchSize |
number | 512 | Batch size for processing |
Methods:
decode(tokens)- Process tokens through the modelfree()- Release context resources
new LlamaSampler(options?)| Option | Type | Default | Description |
|---|---|---|---|
temp |
number | 0 | Temperature (0 = greedy sampling) |
topK |
number | 40 | Top-K sampling parameter |
topP |
number | 0.95 | Top-P (nucleus) sampling parameter |
Methods:
sample(ctx, idx)- Sample next token (-1 for last position)accept(token)- Accept token into sampler statefree()- Release sampler resources
generate(model, ctx, sampler, prompt, maxTokens?)Convenience function for simple text generation. Returns the generated text (not including the prompt).
This addon works with GGUF format models. You can find models at:
Example models to try:
TinyLlama-1.1B-Chat-v1.0.Q4_K_M.gguf- Small, fastMistral-7B-Instruct-v0.2.Q4_K_M.gguf- Good qualityLlama-2-13B-chat.Q4_K_M.gguf- Higher quality
| Platform | Architecture | GPU Support |
|---|---|---|
| macOS | arm64, x64 | Metal |
| Linux | x64, arm64 | CUDA (if available) |
| Windows | x64 | CUDA (if available) |
MIT