HTTP NER (Named Entity Recognition) service powered by LLMs with structured output.
- FastAPI + Python 3.12 + uv
- Client-defined entity labels per request (dynamic JSON Schema)
- Optional character offsets (
start,end) recovered via substring matching
cp .env.example .env
uv sync --extra dev
uv run uvicorn ner_service.main:app --host 0.0.0.0 --port 8000Request:
CONFIG_ID="$(curl -s -X POST http://localhost:8000/v1/configs \
-H 'Content-Type: application/json' \
-d '{
"labels": [
{"name": "PERSON", "description": "People, real or fictional"},
{"name": "LOCATION", "description": "Cities, countries, places"}
]
}' | jq -r .id)"
curl -s -X POST http://localhost:8000/v1/extract \
-H 'Content-Type: application/json' \
-d "{
\"text\": \"Tim Cook visited Berlin last week.\",
\"config_id\": \"${CONFIG_ID}\"
}" | jqResponse:
{
"data": {
"entities": [
{"text": "Tim Cook", "label": "PERSON"},
{"text": "Berlin", "label": "LOCATION"}
],
"model": "llama3.1-8b",
"provider": "cerebras"
},
"meta": {
"request_id": "0b4c6a80-4c4b-4cd0-9b1f-30b66b8e3e22",
"latency_ms": 824.6,
"attempts": 1,
"warnings": []
}
}GET /v1/health— liveness probe.GET /v1/ready— readiness probe; verifies service initialization and local config storage without making an LLM call.GET /v1/providers— current provider and model.GET /metrics— Prometheus metrics endpoint.POST /v1/configs— create a persisted NER config; returns{id, config}.GET /v1/configs— list configs.GET /v1/configs/{id}— get one config.PUT /v1/configs/{id}— replace one config.PATCH /v1/configs/{id}— partially update one config.DELETE /v1/configs/{id}— delete one config.POST /v1/extract— body:{text, config_id?, config?, prompt_payload?}. Exactly one ofconfig_idor inlineconfigis required. Returns{data, meta}.POST /v1/batch/extract— batch extraction for multiple items with per-item success/error envelopes.
NERConfig fields:
{
"labels": [{"name": "PERSON", "description": "People"}],
"model": "llama3.1-8b",
"require_offsets": false,
"case_sensitive": true,
"retries": 3,
"max_tokens": 1024,
"reasoning_effort": null,
"system_prompt": null,
"few_shot_examples": []
}Configs are stored in SQLite (configs.db) by default and survive service restarts. Create a config once and reuse config_id for high-volume extraction to avoid resending labels, prompt, and schema on every request. Inline configs are still supported for one-off calls.
Set require_offsets=true to recover start / end through substring matching. Set case_sensitive=false to match model surfaces case-insensitively; returned entity text uses the source text casing when a match is found.
system_prompt is a full prompt override template. Placeholders support {cfg.field} and {payload.field} with dotted access. cfg.schema is the compact JSON Schema generated from labels; payload is supplied per /extract call:
{
"text": "Tim Cook visited Berlin last week.",
"config_id": "CONFIG_ID",
"prompt_payload": {"number": 42}
}Example template:
follow schema strictly: {cfg.schema}; remember this number: {payload.number}
Use {{ and }} for literal braces.
Few-shot examples are sent as user/assistant examples before the extraction text:
{
"few_shot_examples": [
{
"text": "Ada Lovelace wrote notes.",
"entities": [{"text": "Ada Lovelace", "label": "PERSON"}]
}
]
}Authentication is intentionally out of scope. Add auth in the embedding application or at the gateway layer if a deployment needs it.
Environment variables (also read from .env):
| Variable | Default | Description |
|---|---|---|
CEREBRAS_API_KEY |
— | Required when NER_PROVIDER=cerebras. |
OPENAI_API_KEY |
— | Required when NER_PROVIDER=openai. |
OPENROUTER_API_KEY |
— | Required when NER_PROVIDER=openrouter. |
VLLM_API_KEY |
not-needed |
Bearer token for OpenAI-compatible local vLLM endpoints. |
NER_PROVIDER |
cerebras |
Provider id: cerebras, openai, openrouter, or vllm. |
NER_MODEL |
llama3.1-8b |
Model identifier passed to the provider. |
REQUEST_TIMEOUT_S |
30 |
Per-request upstream timeout. |
TRANSPORT_RETRIES |
2 |
SDK/network retries for upstream transport failures. |
MAX_TOKENS |
1024 |
Default max_completion_tokens passed to the provider. |
OTEL_ENDPOINT |
— | OTLP HTTP trace export endpoint, e.g. http://collector:4318/v1/traces. |
RATE_LIMIT_RPS |
100 |
Token-bucket refill rate for provider calls. |
RATE_LIMIT_BURST |
200 |
Token-bucket burst capacity for provider calls. |
PROVIDER_CONCURRENCY_LIMIT |
50 |
Max concurrent in-flight provider requests. |
MAX_TEXT_LENGTH |
32000 |
Maximum input text length accepted by the service. |
MAX_LABELS |
50 |
Maximum labels per NER config. |
MAX_SYSTEM_PROMPT_LENGTH |
20000 |
Maximum custom system_prompt length. |
MAX_LABEL_DESCRIPTION_LENGTH |
500 |
Maximum label description length. |
MAX_CONFIG_ID_LENGTH |
128 |
Maximum config id length accepted by API paths and payloads. |
CONFIG_DB_PATH |
configs.db |
SQLite config database path. |
CACHE_ENABLED |
true |
Enable in-process extraction result cache. |
CACHE_TTL_SECONDS |
600 |
Extraction result cache TTL. |
CACHE_MAX_SIZE |
10000 |
Maximum in-process cache entries. |
CIRCUIT_BREAKER_FAILURE_THRESHOLD |
5 |
Consecutive upstream failures before opening provider circuit. |
CIRCUIT_BREAKER_RECOVERY_TIMEOUT_S |
30 |
Seconds before trying a half-open provider call. |
CIRCUIT_BREAKER_HALF_OPEN_MAX_CALLS |
1 |
Max test calls while circuit is half-open. |
BATCH_CONCURRENCY |
10 |
Max concurrent items inside /v1/batch/extract. |
TOKEN_PRICING_JSON |
— | Optional JSON model pricing map for estimated cost metrics. |
NERConfig.retries controls model repair attempts when the provider returns invalid structured output. TRANSPORT_RETRIES controls SDK/network retries before the service receives a provider response.
Errors use a single envelope:
{
"error": {
"code": "validation_error",
"message": "request validation failed",
"details": {},
"request_id": "0b4c6a80-4c4b-4cd0-9b1f-30b66b8e3e22"
}
}uv sync --extra dev
uv run ruff check .
uv run ruff format --check .
uv run mypy src
uv run pytest -m "not integration"
CEREBRAS_API_KEY=... uv run pytest -m integrationTracing is initialized in-process. Do not use --reload when you need reliable OpenTelemetry spans; uvicorn reload mode is documented to break instrumentation.
CEREBRAS_API_KEY=... uv run --extra dev python scripts/benchmark_conll.py --model llama3.1-8b --concurrency 40
CEREBRAS_API_KEY=... uv run --extra dev python scripts/benchmark_conll.py --model llama3.1-8b --require-offsets --concurrency 40
CEREBRAS_API_KEY=... uv run --extra dev python scripts/benchmark_conll.py --model gpt-oss-120b --reasoning-effort low --concurrency 40
CEREBRAS_API_KEY=... uv run --extra dev python scripts/benchmark_conll.py --model gpt-oss-120b --reasoning-effort low --require-offsets --concurrency 40Default scoring is exact unique (label, text) pairs. --require-offsets scores exact (label, start, end) triples. The script creates one persisted NER config and then extracts by config_id. It prints micro-P / micro-R / micro-F1, errors, token usage, total time, throughput, and avg/min/max per-example latency.
The CoNLL-2003 test split is cached locally at data/benchmarks/conll2003-test.jsonl after the first load. Later runs reuse that file instead of downloading the dataset again.
Full CoNLL-2003 test baseline results (3453 examples, concurrency 40, max_tokens 1024, retries 3):
| Model | Mode | Reasoning | micro-P | micro-R | micro-F1 | TP | FP | FN | Errors | Total s | Avg s | Min s | Max s | Examples/s | Total tokens |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
llama3.1-8b |
dictionary | N/A | 0.531 | 0.737 | 0.617 | 4115 | 3637 | 1469 | 2 | 68.163 | 0.772 | 0.350 | 5.083 | 50.658 | 1436014 |
llama3.1-8b |
offsets | N/A | 0.509 | 0.726 | 0.598 | 4098 | 3950 | 1550 | 1 | 114.363 | 1.314 | 0.342 | 63.124 | 30.193 | 1445445 |
gpt-oss-120b |
dictionary | low | 0.778 | 0.776 | 0.777 | 4335 | 1240 | 1249 | 0 | 204.446 | 2.343 | 0.298 | 66.743 | 16.890 | 1758947 |
gpt-oss-120b |
offsets | low | 0.771 | 0.774 | 0.773 | 4374 | 1297 | 1274 | 0 | 207.912 | 2.214 | 0.304 | 62.849 | 16.608 | 1815023 |
Cost to run:
llama3.1-8b: ±0.3$ per rungpt-oss-120b: ±0.69$ per run
docker build -t ner-service .
docker run --rm -p 8000:8000 --env-file .env ner-service
curl -fsS http://localhost:8000/v1/health
curl -fsS http://localhost:8000/v1/readydocker compose up -d --build
docker compose ps- NER service:
http://localhost:8000/v1/health - Prometheus:
http://localhost:9090 - Grafana:
http://localhost:3000(admin/admin) - Provisioned Grafana datasource UID:
prometheus - Provisioned dashboard UID:
ner-service-overview
The compose stack scrapes the service /metrics endpoint every 15 seconds and provisions a Grafana dashboard with request rate, provider errors, token volume, and p95 latency panels. Service metrics include extraction latency, provider errors, token counts, circuit breaker events, cache hit/miss counters, structured output retry counters, and optional estimated cost counters when TOKEN_PRICING_JSON is configured.
just generate-client
uv add ./clients/python/ner-clientGenerated package path: clients/python/ner-client.
The package exposes sync and async clients for /v1/configs, /v1/extract, and /v1/batch/extract.
uv run python scripts/profile.py --provider cerebras --model llama3.1-8b --texts-count 4 --concurrency 2 --text-lengths 64,256,1024The profiler emits JSON with per-length p50/p95/p99 latency, throughput, error counts, and aggregated token usage. Add --output path/to/report.json to persist the report for CI artifacts.