LocalCode Middleware Architecture

Overview

LocalCode is a Python middleware server that sits between OpenCode (client) and various LLM backends (currently GLM 4.7). It provides request interception, intelligent caching, and request transformation to optimize local LLM execution speed.

┌─────────────┐     ┌──────────────┐     ┌──────────────┐
│  OpenCode   │────▶│  LocalCode   │────▶│   Backend    │
│   Client    │     │  Middleware  │     │  (GLM 4.7) │
└─────────────┘     └──────────────┘     └──────────────┘

End Goals

Phase 1: Request Interception & Logging ✅

OpenAI-compatible API (/v1/chat/completions)
Request/response logging with pretty printing
Streaming support (native async generators)
Tool call detection and markers
Health check endpoint

Phase 2: State Management (Planned)

KV cache hashing based on system prompt + message history
State save/load to disk (bypass prefill latency)
LRU cache for session states
Cache key computation: SHA256(system_prompt + messages + tools)

Phase 3: llama.cpp Integration (Planned)

Phase 4: Intelligent Routing (Planned)

Multi-model routing (coder vs. general vs. vision)
Request type classification
Dynamic model selection based on request content
Adaptive caching strategies

Current Implementation Status

✅ Completed Features

Core Server (main.py)

FastAPI server listening on port 4242
OpenAI-compatible /v1/chat/completions endpoint
/health health check endpoint

Request Logging

log_request() - Pretty prints incoming requests
Model, stream status, messages count
Tool definitions logging
Message role detection (system, user, assistant, tool)
Content previewing (truncated at 100-150 chars)

Response Logging

log_response() - Pretty prints GLM responses
Content preview, tool calls, finish reason
Token usage tracking (prompt, completion, total)

Tool Detection

has_tool_calls() - Detects tools in request
transform_request_to_glm() - Passes through tool definitions
Tool call markers: [Tool Request], [Tool Call], [Tool Call Complete]

Streaming

stream_generator() - Native Python async generators
SSE (Server-Sent Events) format
data: prefix and [DONE] marker handling
Chunk-by-chunk logging

Functional Design

Pure functions (no OOP)
Type hints with Dict[str, Any], AsyncGenerator
Separated concerns: logging, transformation, forwarding

🚧 In Progress

None - focusing on Phase 1 completion.

📋 Next Steps

Phase 2: State Management

Add SHA256 hashing function for cache keys
Implement cache directory structure
Add save_state() and load_state() functions
Implement LRU eviction policy

Phase 3: llama.cpp Integration

Add llama.cpp backend support alongside GLM 4.7
Implement slot management API
Add grammar constraint enforcement
Implement speculative decoding orchestration

Architecture Decisions

Functional Programming over OOP

Pure functions instead of classes
Explicit data flow (no hidden state)
Testable in isolation
Easier to reason about

Async/Await Pattern

Native Python async generators for streaming
Non-blocking I/O for multiple concurrent requests
AsyncGenerator[str, None] for SSE streams

OpenAI-Compatible API

Standard OpenAI request/response format
Allows drop-in replacement for existing providers
Tool calling support via OpenAI format
Streaming via SSE

Configuration

Constants in main.py:

GLM_API_URL = "https://api.z.ai/api/coding/paas/v4"
PORT = 4242
API_KEY = "dummy"

File Structure

LocalCode/
├── main.py              # Main FastAPI server
├── pyproject.toml       # Poetry configuration
├── poetry.lock          # Dependency lockfile
└── ARCHITECTURE.md      # This file

API Endpoints

POST /v1/chat/completions

OpenAI-compatible chat completions endpoint.

Supported Parameters:

model - Model identifier (default: "glm-4.7")
messages - Array of message objects
stream - Enable streaming (default: false)
temperature - Sampling temperature
max_tokens - Maximum tokens to generate
top_p - Nucleus sampling parameter
tools - Array of tool definitions (passed through)

Response Format:

Non-streaming: Full JSON response
Streaming: SSE stream with data: prefix

GET /health

Health check endpoint.

Response:

{
  "status": "ok",
  "provider": "LocalCode",
  "model": "GLM-4.7"
}

Dependencies

Core:

fastapi - Web framework
httpx - Async HTTP client
uvicorn[standard] - ASGI server

Development:

pytest - Testing framework
pytest-asyncio - Async test support

Future Roadmap

Short Term (Phase 2)

KV cache state persistence
Session state hot-swapping
Request hash-based caching

Medium Term (Phase 3)

llama.cpp backend integration
Speculative decoding
Grammar constraints
Native function calling

Long Term (Phase 4)

Multi-model intelligent routing
Control vector runtime steering
Multimodal support
Reranking optimization

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LocalCode Middleware Architecture

Overview

End Goals

Phase 1: Request Interception & Logging ✅

Phase 2: State Management (Planned)

Phase 3: llama.cpp Integration (Planned)

Phase 4: Intelligent Routing (Planned)

Current Implementation Status

✅ Completed Features

🚧 In Progress

📋 Next Steps

Architecture Decisions

Functional Programming over OOP

Async/Await Pattern

OpenAI-Compatible API

Configuration

File Structure

API Endpoints

POST /v1/chat/completions

GET /health

Dependencies

Future Roadmap

Short Term (Phase 2)

Medium Term (Phase 3)

Long Term (Phase 4)

Related Documentation

FilesExpand file tree

ARCHITECTURE.md

Latest commit

History

ARCHITECTURE.md

File metadata and controls

LocalCode Middleware Architecture

Overview

End Goals

Phase 1: Request Interception & Logging ✅

Phase 2: State Management (Planned)

Phase 3: llama.cpp Integration (Planned)

Phase 4: Intelligent Routing (Planned)

Current Implementation Status

✅ Completed Features

🚧 In Progress

📋 Next Steps

Architecture Decisions

Functional Programming over OOP

Async/Await Pattern

OpenAI-Compatible API

Configuration

File Structure

API Endpoints

POST /v1/chat/completions

GET /health

Dependencies

Future Roadmap

Short Term (Phase 2)

Medium Term (Phase 3)

Long Term (Phase 4)

Related Documentation