Skip to content

Context window not reset between prompts → 400 “request exceeds available context size” with local llama-server #9797

@girls-whocode

Description

@girls-whocode

Error Details

  • Model: Qwen2.5-Coder-32B (LAN)
  • Provider: OpenAI-compatible (local llama.cpp server)
  • Status Code: 400
  • Client: Continue
  • Server: llama.cpp llama-server

Error Output

400 the request exceeds the available context size, try increasing it

Additional Context
Please add any additional context about the error here

Server Configuration

llama-server is started with the following flags:

/opt/llm/llama.cpp/build/bin/llama-server \
  --model /mnt/models/qwen2.5-coder-32b-instruct-q4_k_m/Qwen2.5-Coder-32B-Instruct-Q4_K_M.gguf \
  --host <REDACTED> \
  --port <REDACTED> \
  --ctx-size 16384 \
  --n-predict 1024 \
  --threads 16 \
  --parallel 1 \
  --n-gpu-layers 80

Notes:

  • --parallel 1 (single request at a time)
  • Context size is explicitly set to 16384
  • Completion cap is 1024 tokens

Observed Behavior

  • Prompt 1 completes successfully.
  • Prompt 2 fails immediately with a 400 context-overflow error.
  • Server logs show prompt tokens continuing to increase across prompts, rather than starting fresh.

From llama.cpp logs (excerpt):

slot update_slots: new prompt, n_ctx_slot = 16384
task.n_tokens = 15953
send_error: the request exceeds the available context size

This indicates the effective prompt limit is 16384 tokens, and that Continue is sending cumulative context rather than a fresh prompt.

Once a prompt completes successfully, shouldn't the next prompt:

  • Start with a fresh token budget (aside from intentional chat history)
  • Not reuse or accumulate prior prompt tokens unless explicitly required

In this case, Prompt 2 is a new task and should not exceed context limits immediately.

Reproduction Steps

Prompt 1 (works)

npm run start

Output:

Error: Cannot find module '/home/jessica/repositories/python/AI-Test/backend/index.js'

(Qwen resolves issue successfully)

  • Backend starts
  • Frontend starts

Prompt 2 (fails)

Backend now running, and frontend started, now we need to build the frontend

Result:

400 the request exceeds the available context size

No additional files or large inputs were manually added between prompts.

Additional Notes

  • This appears client-side, as llama.cpp correctly enforces the context window and reports the overflow.
  • The issue reproduces consistently after several successful prompts, suggesting context accumulation across requests.
  • Reducing --n-predict mitigates but does not eliminate the issue.
  • The same server behaves correctly with other OpenAI-compatible clients when context is reset per request.

It appears Continue may be:

  • Retaining full prior prompt context across requests, and/or
  • Sending a max_tokens value that causes excessive prompt reservation

Resulting in prompt token counts exceeding the effective server context window even on short follow-up prompts.

Happy to provide additional logs or test with a debug build if helpful.

Metadata

Metadata

Assignees

No one assigned

    Labels

    area:configurationRelates to configuration optionskind:bugIndicates an unexpected problem or unintended behavioros:linuxHappening specifically on Linux

    Type

    No type

    Projects

    Status

    Todo

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions