LocalCode is a Python middleware server that sits between OpenCode (client) and various LLM backends (currently GLM 4.7). It provides request interception, intelligent caching, and request transformation to optimize local LLM execution speed.
┌─────────────┐ ┌──────────────┐ ┌──────────────┐
│ OpenCode │────▶│ LocalCode │────▶│ Backend │
│ Client │ │ Middleware │ │ (GLM 4.7) │
└─────────────┘ └──────────────┘ └──────────────┘
- OpenAI-compatible API (
/v1/chat/completions) - Request/response logging with pretty printing
- Streaming support (native async generators)
- Tool call detection and markers
- Health check endpoint
- KV cache hashing based on system prompt + message history
- State save/load to disk (bypass prefill latency)
- LRU cache for session states
- Cache key computation:
SHA256(system_prompt + messages + tools)
- Speculative decoding orchestration (draft/target model pairs)
- KV cache state management (slot hot-swap)
- Continuous batching for multi-user throughput
- Flash attention optimization
- Grammar constraint enforcement (GBNF)
- Native function calling support
- FIM (Fill-In-The-Middle) completions
- Control vector runtime steering
- Multimodal support (vision projectors)
- Reranking for codebase search
- Multi-model routing (coder vs. general vs. vision)
- Request type classification
- Dynamic model selection based on request content
- Adaptive caching strategies
Core Server (main.py)
- FastAPI server listening on port 4242
- OpenAI-compatible
/v1/chat/completionsendpoint /healthhealth check endpoint
Request Logging
log_request()- Pretty prints incoming requests- Model, stream status, messages count
- Tool definitions logging
- Message role detection (system, user, assistant, tool)
- Content previewing (truncated at 100-150 chars)
Response Logging
log_response()- Pretty prints GLM responses- Content preview, tool calls, finish reason
- Token usage tracking (prompt, completion, total)
Tool Detection
has_tool_calls()- Detects tools in requesttransform_request_to_glm()- Passes through tool definitions- Tool call markers:
[Tool Request],[Tool Call],[Tool Call Complete]
Streaming
stream_generator()- Native Python async generators- SSE (Server-Sent Events) format
data:prefix and[DONE]marker handling- Chunk-by-chunk logging
Functional Design
- Pure functions (no OOP)
- Type hints with
Dict[str, Any],AsyncGenerator - Separated concerns: logging, transformation, forwarding
None - focusing on Phase 1 completion.
Phase 2: State Management
- Add SHA256 hashing function for cache keys
- Implement cache directory structure
- Add
save_state()andload_state()functions - Implement LRU eviction policy
Phase 3: llama.cpp Integration
- Add llama.cpp backend support alongside GLM 4.7
- Implement slot management API
- Add grammar constraint enforcement
- Implement speculative decoding orchestration
- Pure functions instead of classes
- Explicit data flow (no hidden state)
- Testable in isolation
- Easier to reason about
- Native Python async generators for streaming
- Non-blocking I/O for multiple concurrent requests
AsyncGenerator[str, None]for SSE streams
- Standard OpenAI request/response format
- Allows drop-in replacement for existing providers
- Tool calling support via OpenAI format
- Streaming via SSE
Constants in main.py:
GLM_API_URL = "https://api.z.ai/api/coding/paas/v4"
PORT = 4242
API_KEY = "dummy"LocalCode/
├── main.py # Main FastAPI server
├── pyproject.toml # Poetry configuration
├── poetry.lock # Dependency lockfile
└── ARCHITECTURE.md # This file
OpenAI-compatible chat completions endpoint.
Supported Parameters:
model- Model identifier (default: "glm-4.7")messages- Array of message objectsstream- Enable streaming (default: false)temperature- Sampling temperaturemax_tokens- Maximum tokens to generatetop_p- Nucleus sampling parametertools- Array of tool definitions (passed through)
Response Format:
- Non-streaming: Full JSON response
- Streaming: SSE stream with
data:prefix
Health check endpoint.
Response:
{
"status": "ok",
"provider": "LocalCode",
"model": "GLM-4.7"
}Core:
fastapi- Web frameworkhttpx- Async HTTP clientuvicorn[standard]- ASGI server
Development:
pytest- Testing frameworkpytest-asyncio- Async test support
- KV cache state persistence
- Session state hot-swapping
- Request hash-based caching
- llama.cpp backend integration
- Speculative decoding
- Grammar constraints
- Native function calling
- Multi-model intelligent routing
- Control vector runtime steering
- Multimodal support
- Reranking optimization
middleware.md- llama.cpp & OpenCode advanced integration researchREADME.md- Installation and usage guidepyproject.toml- Package configuration