Context window not reset between prompts → 400 “request exceeds available context size” with local llama-server

## Error Details

* **Model:** Qwen2.5-Coder-32B (LAN)
* **Provider:** OpenAI-compatible (local llama.cpp server)
* **Status Code:** 400
* **Client:** Continue
* **Server:** llama.cpp `llama-server`

### Error Output

```
400 the request exceeds the available context size, try increasing it
```

**Additional Context**
Please add any additional context about the error here

## Server Configuration

`llama-server` is started with the following flags:

```
/opt/llm/llama.cpp/build/bin/llama-server \
  --model /mnt/models/qwen2.5-coder-32b-instruct-q4_k_m/Qwen2.5-Coder-32B-Instruct-Q4_K_M.gguf \
  --host <REDACTED> \
  --port <REDACTED> \
  --ctx-size 16384 \
  --n-predict 1024 \
  --threads 16 \
  --parallel 1 \
  --n-gpu-layers 80
```

Notes:

* `--parallel 1` (single request at a time)
* Context size is explicitly set to **16384**
* Completion cap is **1024 tokens**

## Observed Behavior

* Prompt 1 completes successfully.
* Prompt 2 fails immediately with a 400 context-overflow error.
* Server logs show **prompt tokens continuing to increase across prompts**, rather than starting fresh.

From llama.cpp logs (excerpt):

```
slot update_slots: new prompt, n_ctx_slot = 16384
task.n_tokens = 15953
send_error: the request exceeds the available context size
```

This indicates the effective prompt limit is **16384 tokens**, and that Continue is sending cumulative context rather than a fresh prompt.

Once a prompt completes successfully, shouldn't the next prompt:

* Start with a fresh token budget (aside from intentional chat history)
* Not reuse or accumulate prior prompt tokens unless explicitly required

In this case, Prompt 2 is a new task and should not exceed context limits immediately.

## Reproduction Steps

### Prompt 1 (works)

```
npm run start
```

Output:

```
Error: Cannot find module '/home/jessica/repositories/python/AI-Test/backend/index.js'
```

(Qwen resolves issue successfully)

* Backend starts
* Frontend starts

### Prompt 2 (fails)

```
Backend now running, and frontend started, now we need to build the frontend
```

Result:

```
400 the request exceeds the available context size
```

No additional files or large inputs were manually added between prompts.

## Additional Notes

* This appears **client-side**, as llama.cpp correctly enforces the context window and reports the overflow.
* The issue reproduces consistently after several successful prompts, suggesting **context accumulation across requests**.
* Reducing `--n-predict` mitigates but does not eliminate the issue.
* The same server behaves correctly with other OpenAI-compatible clients when context is reset per request.

It appears Continue may be:

* Retaining full prior prompt context across requests, and/or
* Sending a `max_tokens` value that causes excessive prompt reservation

Resulting in prompt token counts exceeding the effective server context window even on short follow-up prompts.

Happy to provide additional logs or test with a debug build if helpful.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Context window not reset between prompts → 400 “request exceeds available context size” with local llama-server #9797

Error Details

Error Output

Server Configuration

Observed Behavior

Reproduction Steps

Prompt 1 (works)

Prompt 2 (fails)

Additional Notes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Context window not reset between prompts → 400 “request exceeds available context size” with local llama-server #9797

Description

Error Details

Error Output

Server Configuration

Observed Behavior

Reproduction Steps

Prompt 1 (works)

Prompt 2 (fails)

Additional Notes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions