Expose one or more pydantic_ai.Agent instances as fake OpenAI-compatible chat models.
It is designed so an OpenAI-compatible client can talk to what looks like a normal model, while the backend is actually a PydanticAI agent that can:
- use its own server-side PydanticAI tools
- surface client-provided OpenAI tools through PydanticAI deferred tools
- honor OpenAI-style structured output via
response_format - expose multiple fake model IDs from one app
- keep one long-lived agent instance per fake model for the whole app lifetime
Install fakellm directly from GitHub. The fakellm name on PyPI is already
taken by a different package, so do not install this project from PyPI.
For the CLI, install it as a uv-managed tool with Python 3.14:
uv tool install --python 3.14 git+https://github.com/idiap/FakeLLM.gitTo use fakellm as a dependency in another uv project, add the Git source explicitly:
uv add "fakellm @ git+https://github.com/idiap/FakeLLM.git"fakellm mypackage.my_agent:agent --host 127.0.0.1 --port 8000You can also point it at a FakeModels registry:
fakellm examples.multi_model_agents:MODELS --host 127.0.0.1 --port 8000Use --prefix if you want a different API base path:
fakellm mypackage.my_agent:agent --host 127.0.0.1 --port 8000 --prefix /proxy/custom/v1Protect the OpenAI-compatible routes with API keys by pointing FAKELLM_CONFIG
at a JSON config file:
{
"api_keys": [
{
"name": "crush-local",
"key": "replace-with-a-secret",
"model_id": "fake-pydanticai"
}
]
}FAKELLM_CONFIG=./config.json fakellm mypackage.my_agent:agentYou can also pass the path directly:
fakellm mypackage.my_agent:agent --config ./config.jsonWhen API keys are configured, /v1/models and /v1/chat/completions require
Authorization: Bearer <key> or X-API-Key: <key>. Each key is scoped to its
configured model_id, and fakellm logs requests with the associated name.
Omit model_id to let one key access every model in that entrypoint. As a
shortcut, set FAKELLM_API_KEY to protect the whole entrypoint with a single
shared key. /health remains public for liveness checks.
The same app also exposes an OpenAI Responses-compatible endpoint at
/v1/responses. It uses the same hosted fake model IDs, hidden server-side
PydanticAI tools, client-provided function tools, structured output handling,
API-key auth, and request-context deps as /v1/chat/completions:
curl http://127.0.0.1:8000/v1/responses \
-H 'Content-Type: application/json' \
-d '{"model":"fake-pydanticai","input":"Say hello from Responses."}'Client-visible tools use the Responses function-tool shape. fakellm remains
stateless, so include the prior function_call item and the matching
function_call_output item when returning tool results:
{
"model": "fake-pydanticai",
"input": [
{"type": "message", "role": "user", "content": "Check Paris weather"},
{
"type": "function_call",
"call_id": "weather-call",
"name": "get_weather",
"arguments": "{\"city\":\"Paris\"}"
},
{
"type": "function_call_output",
"call_id": "weather-call",
"output": {"temperature_c": 21}
}
],
"tools": [
{
"type": "function",
"name": "get_weather",
"parameters": {
"type": "object",
"properties": {"city": {"type": "string"}},
"required": ["city"]
}
}
]
}Responses streaming supports optional reasoning/progress events with
responses_reasoning on create_app():
"disabled": no reasoning events, the default"agent": forward streamed PydanticAIThinkingPartdeltas from the backend agent"custom": letAgenticWorkflowcode callawait request.reasoning.emit(...)"premade": loop configured text chunks concurrently while the request runs
from fakellm import AgenticWorkflow, ResponsesReasoningConfig, WorkflowRequest, create_app
async def workflow(request: WorkflowRequest) -> str:
if request.reasoning is not None:
await request.reasoning.emit("checking context ")
return "done"
app = create_app(
AgenticWorkflow(workflow),
model_name="responses-workflow",
responses_reasoning=ResponsesReasoningConfig(mode="custom"),
)You can also turn trusted request headers into PydanticAI deps without putting those values in the chat messages or tool schemas. This is useful behind an internal proxy that authenticates with a fakellm API key and forwards a user id:
from dataclasses import dataclass
from fastapi import HTTPException
from pydantic_ai import Agent, RunContext
from fakellm import RequestContext, create_app
@dataclass(frozen=True)
class UserDeps:
user_id: str
def request_context_factory(context: RequestContext) -> UserDeps:
user_id = context.header("x-user-id")
if user_id is None:
raise HTTPException(status_code=400, detail="Missing X-User-ID header.")
return UserDeps(user_id=user_id)
agent = Agent("openai:gpt-4.1-mini", deps_type=UserDeps)
@agent.tool
async def lookup_internal_profile(ctx: RunContext[UserDeps]) -> dict[str, str]:
return {"user_id": ctx.deps.user_id}
app = create_app(
agent,
model_name="internal-assistant",
api_keys=[{"name": "internal-proxy", "key": "secret", "model_id": "internal-assistant"}],
request_context_factory=request_context_factory,
)RequestContext also exposes the parsed chat completion request. Use
context.parameter("model") for any top-level request field and
context.extra_parameter("user") for unmodeled/custom fields that fakellm does
not otherwise interpret, such as OpenAI's user value or application-specific
metadata:
def request_context_factory(context: RequestContext) -> UserDeps:
user_id = context.extra_parameter("user")
if not isinstance(user_id, str):
raise HTTPException(status_code=400, detail="Missing user parameter.")
return UserDeps(user_id=user_id)For bot-style deployments, ContextPipeline turns those request values into a
typed deps object declaratively:
from dataclasses import dataclass
from fakellm import ContextPipeline
@dataclass(frozen=True)
class BotDeps:
mattermost_user_id: str
mattermost_username: str | None
context = (
ContextPipeline(BotDeps)
.from_extra("safety_identifier", as_="mattermost_user_id")
.resolve(
"mattermost_username",
lookup_mattermost_username,
from_="mattermost_user_id",
optional=True,
)
.cache_per_request()
)WorkflowRequest has convenience helpers for common bot workflows:
latest_user_text(), conversation_text(), auth_name, context, and
emit_progress(). For Responses streams, async with request.span(...): emits
progress text and records elapsed time on the span object.
The lower-level adapter APIs remain available, and fakellm also ships a small bot-oriented layer for common multi-agent server concerns:
from fakellm import (
ApiKeyAuth,
ContextPipeline,
DependencyErrorPolicy,
FakeModels,
ManagedMCP,
OpenAICompatibleBackend,
Route,
RouterWorkflow,
create_bot_app,
)
backend = OpenAICompatibleBackend.from_env(
endpoint="LLM_ENDPOINT",
model="LLM_MODEL",
api_key="LLM_KEY",
)
root = RouterWorkflow(
model=backend,
routes=[
Route("biss", build_biss_agent, "BISS support and documentation"),
Route("rooms", build_rooms_agent, "room search and booking"),
],
)
rooms_mcp = ManagedMCP.http(
id="rooms",
command=["python", "-m", "rooms.rooms_mcp"],
url_env="ROOMS_MCP_URL",
)
app = create_bot_app(
FakeModels()
.add("root", root, lifecycle="per_request", dependency_policy="degrade")
.add("rooms", build_rooms_agent, lifecycle="per_request", dependency_policy="degrade"),
context=context,
auth=ApiKeyAuth.from_env(),
managed_mcp=[rooms_mcp],
dependency_errors=DependencyErrorPolicy(
message="A required upstream service is temporarily unavailable.",
),
progress=True,
)FakeModels.add() and FakeModel support three lifecycle policies:
startup: start once during FastAPI lifespan, the defaultlazy: start on first use and reuseper_request: build/enter/close for each request, useful for request-scoped toolsets
Set dependency_policy="degrade" to turn connection-like dependency failures
into an assistant fallback response instead of failing the whole app startup or
request. /ready is available on every create_app() app and includes model
lifecycle records plus any custom DependencyHealth checks passed through
health=[...].
For MCP-heavy bots, MCPToolset, ManagedMCP, EnvSecret, and ContextValue
cover the common pieces: external-or-managed MCP URL configuration, app lifespan
startup/shutdown, context-aware headers, env-driven timeouts, and readiness
checks. ToolCallPolicy provides a small middleware helper for policies like
requiring a resolved username or filling a missing description from title.
DependencyErrorPolicy can convert uncaught connection-like failures at the app
boundary into OpenAI-compatible JSON error responses.
For a complete configured deployment, use fakellm deploy with a YAML file:
fakellm deploy --config ./myconfig.yamlhost: 127.0.0.1
port: 8000
prefix: /v1
api_keys:
- name: local-client
key: replace-with-a-secret
model_id: assistant
# Or omit model_id to allow the key to use every model in this entrypoint.
# You can also set FAKELLM_API_KEY instead of writing an api_keys block.
backends:
local-llm:
type: openai-compatible
base_url: http://127.0.0.1:11434/v1
api_key: ${LLM_KEY:-}
model: llama3.1
mcps:
filesystem:
transport: stdio
command: npx
args: ["-y", "@modelcontextprotocol/server-filesystem", "."]
remote-search:
transport: http
url: https://mcp.example.com/mcp
headers:
Authorization: Bearer ${MCP_TOKEN}
models:
assistant:
backend: local-llm
instructions: You are a concise assistant with access to configured MCP tools.
code_mode: true
mcps: [filesystem, remote-search]backends, mcps, and models can be written either as mappings, as shown
above, or as lists with name/model_id fields. MCP and FastMCP support are
included in fakellm's main dependencies.
Use transport: http or transport: streamable-http for streamable HTTP MCP
servers, and transport: sse for older SSE MCP servers.
Set code_mode: true on a model to enable PydanticAI Harness CodeMode for its
configured tools.
You can also expose multiple prefixes from one process with entrypoints:
backends:
local-llm:
base_url: http://127.0.0.1:11434/v1
model: llama3.1
entrypoints:
/alpha/v1:
api_keys:
- name: alpha-client
key: alpha-secret
model_id: alpha
models:
alpha:
backend: local-llm
/beta/v1:
api_keys:
- name: beta-client
key: beta-secret
model_id: beta
models:
beta:
backend: local-llmThe default test suite uses in-memory model doubles and never needs a live LLM.
To smoke-test fakellm against a real OpenAI-compatible backend, create a local
.env file and opt in explicitly:
FAKELLM_RUN_REAL_LLM_TESTS=1
LLM_ENDPOINT=http://127.0.0.1:11434/v1
LLM_MODEL=gemma3:4b
# LLM_KEY=replace-with-a-secret-if-your-backend-needs-oneThen run only the real-backend tests:
uv run pytest -m real_llm tests/test_real_backend.pyThose tests stay skipped unless FAKELLM_RUN_REAL_LLM_TESTS=1 is present. When
enabled, they cover:
- direct
/v1/chat/completionsand/v1/responsescalls through a fakellm app; - image and file content parts forwarded to a real backend model through both APIs;
- the multi-model OpenAI-compatible smoke helper with hidden PydanticAI tools;
- the
examples.multi_model_agentsouter-agent flow; - the
examples.dual_subagent_judge_workflowfan-out and judge workflow.
Because these tests depend on a live model following tool-use instructions, they are intended as local release smokes rather than CI defaults.
Build the image from the repository root:
docker build -t fakellm .The container defaults to fakellm deploy --config /config/fakellm.yaml and
listens on port 8000. Mount your deployment YAML at that path and publish the
port:
docker run --rm \
-p 8000:8000 \
-v "$PWD/myconfig.yaml:/config/fakellm.yaml:ro" \
fakellmThen check that the service is up:
curl http://127.0.0.1:8000/health
curl http://127.0.0.1:8000/v1/modelsIf your deploy config uses environment variables such as ${LLM_KEY}, pass
them to Docker:
docker run --rm \
-p 8000:8000 \
-v "$PWD/myconfig.yaml:/config/fakellm.yaml:ro" \
-e LLM_KEY \
fakellmYou can also override the default command and run the single-agent CLI form:
docker run --rm -p 8000:8000 fakellm \
examples.multi_model_agents:MODELS \
--host 0.0.0.0 \
--port 8000Or embed it directly:
from fastapi import FastAPI
from fakellm import ApiKey, create_app
from mypackage.my_agent import agent
app: FastAPI = create_app(
agent,
model_name="fake-pydanticai",
prefix="/proxy/custom/v1",
api_keys=[
ApiKey(
name="crush-local",
key="replace-with-a-secret",
model_id="fake-pydanticai",
)
],
)For tests and examples that should exercise the app in memory with FastAPI lifespan enabled:
from fakellm import create_app, live_client
app = create_app(agent, model_name="fake-pydanticai")
async with live_client(app) as client:
response = await client.get("/v1/models")Use live_client() when you want to call the fakellm app without starting a real
HTTP server, for example in tests or in examples where an outer agent talks to
the proxy entirely in-process.
Why it exists:
- it creates an in-memory
httpx.AsyncClientfor the FastAPI app - it explicitly runs FastAPI lifespan startup/shutdown, so app-scoped state is initialized and cleaned up correctly
- that matters for fakellm because fake model registries and cached long-lived agents are created for the app lifetime and then released on shutdown
When you need it:
- use
live_client(app)for tests - use it for local examples that wire an
OpenAIProvider(http_client=...)directly to the in-memory fakellm app - do not use it when fakellm is already running behind a real HTTP endpoint;
in that case use a normal
httpx.AsyncClientor any OpenAI-compatible client against the server URL
User messages can include OpenAI-compatible content parts for text, images, and
files. Image parts use the Chat Completions image_url shape, including remote
URLs and base64 data URLs:
{
"role": "user",
"content": [
{"type": "text", "text": "What text is in this image?"},
{
"type": "image_url",
"image_url": {
"url": "data:image/png;base64,...",
"detail": "auto"
}
}
]
}File parts can pass uploaded OpenAI file IDs or inline base64 file data. Inline
files are forwarded to the inner PydanticAI agent as BinaryContent; file IDs
are forwarded as OpenAI UploadedFile references.
{
"role": "user",
"content": [
{"type": "text", "text": "Summarize this file."},
{
"type": "file",
"file": {
"filename": "notes.txt",
"file_data": "VGhlIGZpbGUgY29udGVudHMu"
}
}
]
}For multiple fake models:
from fastapi import FastAPI
from fakellm import FakeModels, create_app
from mypackage.multi_models import build_code_agent, build_weather_agent
app: FastAPI = create_app(
FakeModels()
.add("demo-weather", build_weather_agent)
.add("demo-code", build_code_agent),
prefix="/proxy/custom/v1",
)FakeModels.add() accepts either:
- an
Agent - a factory returning an
Agent - an
AgenticWorkflow - a factory returning an
AgenticWorkflow
If you already have a plain mapping from fake model ID to an Agent or factory, that still works.
You can also host a custom workflow directly, without making the backend itself a
pydantic_ai.Agent. Wrap an async function or object with AgenticWorkflow and
return either a final value or a WorkflowResponse with client-visible tool
calls:
from fakellm import (
AgenticWorkflow,
WorkflowRequest,
WorkflowResponse,
WorkflowToolCall,
create_app,
)
async def workflow(request: WorkflowRequest) -> WorkflowResponse | str:
if request.tool_results:
weather = request.tool_results["weather-call"]
return f"The client says Paris is {weather['temperature_c']}C."
return WorkflowResponse(
tool_calls=[
WorkflowToolCall(
name="get_weather",
arguments={"city": "Paris"},
id="weather-call",
)
]
)
app = create_app(
AgenticWorkflow(workflow),
model_name="workflow-assistant",
)WorkflowRequest includes the original OpenAI-compatible messages, normalized
PydanticAI message history for callers that want it, client-provided tool
results, externally supplied tool definitions, the resolved output type, and any
deps returned by request_context_factory.
examples/multi_model_agents.py shows the recommended multi-model DX with
FakeModels(), then runs two outer agents in parallel against the two fake
model IDs:
MODELS = FakeModels().add("demo-weather", build_weather_agent).add("demo-code", build_code_agent)That example uses standard PydanticAI agents and normal @agent.tool_plain tools. Configure LLM_MODEL, LLM_ENDPOINT, and optional LLM_KEY before running it.
examples/builtin_tools_agent.py shows the same inner/outer fakellm shape as the CodeMode example, using a hidden server-side Hacker News tool that fetches the raw https://news.ycombinator.com/news DOM instead of the Firebase API. It auto-loads LLM_MODEL, LLM_ENDPOINT, and optional LLM_KEY from .env, then asks the outer agent for the current top three Hacker News articles through the fake OpenAI-compatible model.
examples/uvicorn_hacker_news_agent.py exposes a Hacker News-capable agent as a
real OpenAI-compatible HTTP server for clients like Crush. The hidden agent uses
the Hacker News Firebase API to fetch the current top stories, while its backing
LLM comes from LLM_MODEL, LLM_ENDPOINT, and optional LLM_KEY. The example
auto-loads those values from a local .env file if present.
LLM_MODEL=your-backend-model
LLM_ENDPOINT=https://example.com/v1
LLM_KEY=...uv run uvicorn examples.uvicorn_hacker_news_agent:app --host 127.0.0.1 --port 8000For a local Ollama-compatible backend, omit LLM_KEY and point LLM_ENDPOINT
at your local /v1 base URL. The project-local .crush.json configures Crush to
use this uvicorn server as an openai-compat provider with model ID
fakellm-hacker-news. Once the server is running, start Crush from this repo and
ask for the top three Hacker News stories.
examples/uvicorn_hacker_news_mcp_agent.py exposes the same kind of server, but
the hidden Hacker News capability comes from an in-process FastMCP server attached
to the backend PydanticAI agent as a toolset. The OpenAI-compatible client still
only sees the fake model ID, not the MCP server or its tools.
uv run --with "pydantic-ai-slim[fastmcp]" \
uvicorn examples.uvicorn_hacker_news_mcp_agent:app --host 127.0.0.1 --port 8000The model ID for that MCP-backed example is fakellm-hacker-news-mcp.
If you want PydanticAI Harness CodeMode behind fakellm, see examples/code_mode_agent.py. It hosts an inner CodeMode-enabled agent as a fake OpenAI model, then runs an outer PydanticAI agent against that fake model. Per the official docs, CodeMode comes from pydantic-ai-harness and wraps your normal tools into a single run_code tool so the inner model can orchestrate multiple tool calls in Python. Run it with:
LLM_KEY=... \
LLM_MODEL=chat \
LLM_ENDPOINT=https://example.com/v1 \
uv run python examples/code_mode_agent.pyThe example also auto-loads these values from a local .env file, so uv run python examples/code_mode_agent.py works if .env defines LLM_KEY, LLM_MODEL, and LLM_ENDPOINT.
examples/openai_compatible_hidden_tool_smoke.py runs a real end-to-end smoke test against any OpenAI-compatible backend and verifies:
- multiple backend-backed inner PydanticAI agents
- a hidden server-side tool call
- the fakellm OpenAI-compatible adapter
- second-layer PydanticAI agents using the adapter as their model endpoint
- concurrent fake model hosting from one app
- a custom prefix
It accepts LLM_MODEL, LLM_ENDPOINT, and optional LLM_KEY from the environment, so it works with hosted backends and local Ollama-compatible endpoints.
LLM_KEY=... \
LLM_MODEL=chat \
LLM_ENDPOINT=https://example.com/v1 \
uv run python examples/openai_compatible_hidden_tool_smoke.py \
--proxy-model-names fake-backend-alpha fake-backend-beta \
--prefix /proxy/custom/v1For a local Ollama-compatible endpoint, you can omit LLM_KEY and point LLM_ENDPOINT at your local /v1 base URL.
examples/openai_responses_reasoning.py exercises the /v1/responses endpoint
entirely in memory. It shows a workflow that requests a client-visible function
call, receives a function_call_output, and streams custom reasoning/progress
events with ResponsesReasoningConfig(mode="custom").
uv run python examples/openai_responses_reasoning.py
