Skip to content

davkk/quickfill.nvim

Repository files navigation

quickfill-logo

quickfill.nvim

Neovim

Quick code infill suggestions by combining a local llama.cpp server with active LSP servers.

quickfill-demo

Features

  • Local AI Inference: Uses llama.cpp for low latency, on-device inference == no data leaves your machine.
  • LSP-Backed Context: Leverages your existing LSP servers for rich context (completions & signatures).
  • Prompt Caching: Caches suggestions for repeated contexts to reduce latency.
  • Cross-file Context Chunks: Automatically extracts and includes relevant code snippets from your project files.
  • Git-Aware: Respects .gitignore for context extraction.

Installation

vim.pack.add "https://github.com/davkk/quickfill.nvim"

-- no need to call setup!

-- the plugin uses `<Plug>` mappings for flexibility
-- you can map them to your preferred keys like this:
vim.keymap.set("i", "<C-y>", "<Plug>(quickfill-accept)")         -- accept full suggestion
vim.keymap.set("i", "<C-k>", "<Plug>(quickfill-accept-word)")    -- accept next word
vim.keymap.set("i", "<C-x>", "<Plug>(quickfill-trigger)")        -- trigger fresh infill request

Configuration

Customize behavior via vim.g.quickfill.

Defaults are used if not set explicitly:

vim.g.quickfill = {
    url = "http://localhost:8012",          -- llama.cpp server URL

    n_predict = 8,                          -- max tokens to predict
    top_k = 30,                             -- top-k sampling
    top_p = 0.4,                            -- top-p sampling
    repeat_penalty = 1.5,                   -- repeat penalty

    stop_chars = { "\n", "\r", "\r\n" },    -- stop characters
    stop_on_trigger_char = true,            -- stop on trigger chars defined by LSP server

    n_prefix = 16,                          -- prefix context lines
    n_suffix = 8,                           -- suffix context lines

    max_cache_entries = 32,                 -- max cache entries

    extra_chunks = false,                   -- enable extra project chunks
    max_extra_chunks = 4,                   -- max extra chunks
    chunk_lines = 16,                       -- lines per chunk

    lsp_completion = true,                  -- enable LSP completions
    max_lsp_completion_items = 15,          -- max LSP completion items

    lsp_signature_help = false,             -- enable signature help
}

Local Inference Server Setup

Before using the plugin, make sure to have a llama.cpp server running.

Here's an example command to start the server in the background:

llama-server \
    -hf bartowski/Qwen2.5-Coder-0.5B-GGUF:Q4_0 \
    --n-gpu-layers 99 \
    --threads 8 \
    --ctx-size 0 \
    --flash-attn on \
    --mlock \
    --cache-reuse 256 \
    --verbose \
    --host localhost \
    --port 8012

This starts the server on http://localhost:8012 with optimized settings for the Qwen2.5-Coder-0.5B model. Adjust the host and port as needed.

Commands

  • start plugin with :AI start or :AI
  • stop plugin with :AI stop

About

Instant AI completion powered by local LLM

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published