-
Notifications
You must be signed in to change notification settings - Fork 4.1k
Description
Error Details
- Model: Qwen2.5-Coder-32B (LAN)
- Provider: OpenAI-compatible (local llama.cpp server)
- Status Code: 400
- Client: Continue
- Server: llama.cpp
llama-server
Error Output
400 the request exceeds the available context size, try increasing it
Additional Context
Please add any additional context about the error here
Server Configuration
llama-server is started with the following flags:
/opt/llm/llama.cpp/build/bin/llama-server \
--model /mnt/models/qwen2.5-coder-32b-instruct-q4_k_m/Qwen2.5-Coder-32B-Instruct-Q4_K_M.gguf \
--host <REDACTED> \
--port <REDACTED> \
--ctx-size 16384 \
--n-predict 1024 \
--threads 16 \
--parallel 1 \
--n-gpu-layers 80
Notes:
--parallel 1(single request at a time)- Context size is explicitly set to 16384
- Completion cap is 1024 tokens
Observed Behavior
- Prompt 1 completes successfully.
- Prompt 2 fails immediately with a 400 context-overflow error.
- Server logs show prompt tokens continuing to increase across prompts, rather than starting fresh.
From llama.cpp logs (excerpt):
slot update_slots: new prompt, n_ctx_slot = 16384
task.n_tokens = 15953
send_error: the request exceeds the available context size
This indicates the effective prompt limit is 16384 tokens, and that Continue is sending cumulative context rather than a fresh prompt.
Once a prompt completes successfully, shouldn't the next prompt:
- Start with a fresh token budget (aside from intentional chat history)
- Not reuse or accumulate prior prompt tokens unless explicitly required
In this case, Prompt 2 is a new task and should not exceed context limits immediately.
Reproduction Steps
Prompt 1 (works)
npm run start
Output:
Error: Cannot find module '/home/jessica/repositories/python/AI-Test/backend/index.js'
(Qwen resolves issue successfully)
- Backend starts
- Frontend starts
Prompt 2 (fails)
Backend now running, and frontend started, now we need to build the frontend
Result:
400 the request exceeds the available context size
No additional files or large inputs were manually added between prompts.
Additional Notes
- This appears client-side, as llama.cpp correctly enforces the context window and reports the overflow.
- The issue reproduces consistently after several successful prompts, suggesting context accumulation across requests.
- Reducing
--n-predictmitigates but does not eliminate the issue. - The same server behaves correctly with other OpenAI-compatible clients when context is reset per request.
It appears Continue may be:
- Retaining full prior prompt context across requests, and/or
- Sending a
max_tokensvalue that causes excessive prompt reservation
Resulting in prompt token counts exceeding the effective server context window even on short follow-up prompts.
Happy to provide additional logs or test with a debug build if helpful.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status