docs: add coding-agents + MLOps how-tos for on-prem LLMs#245
docs: add coding-agents + MLOps how-tos for on-prem LLMs#245typhoonzero wants to merge 3 commits into
Conversation
WalkthroughAdds two new how‑to docs: one teaching terminal coding agents (opencode, Codex CLI, Claude Code via proxy) against on‑prem InferenceService with setup, configs, vLLM tuning, and troubleshooting; the other describing MLOps workflows (gateway CRDs, benchmarking loops, fine‑tuning plans, and daily agent operations/guardrails). ChangesCoding Agents On-Premise Integration Guide
MLOps with Coding Agents
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 ESLint
ESLint skipped: no ESLint configuration detected in root package.json. To enable, add Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In
`@docs/en/model_inference/inference_service/how_to/coding_agents_with_inference_service.mdx`:
- Line 69: Fix the broken heading anchors by removing the backslash-escaped
braces and using a supported ID or plain auto-slug: replace "## Step 2: Enable
tool calling on the runtime \{`#enable-tool-calling-on-the-runtime`}" (and the
similar heading at the other location) with either a plain heading "## Step 2:
Enable tool calling on the runtime" (rely on auto-generated slug) or an
unescaped ID form "## Step 2: Enable tool calling on the runtime
{`#enable-tool-calling-on-the-runtime`}" so in-page links like
(`#enable-tool-calling-on-the-runtime`) and (`#claude-code`) resolve correctly.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 9456e8cf-88c7-4032-862a-69f099664a18
📒 Files selected for processing (1)
docs/en/model_inference/inference_service/how_to/coding_agents_with_inference_service.mdx
|
|
||
| A normal JSON completion confirms the endpoint is reachable and the model name is correct. Note the three values you will reuse for every agent: **base URL** (ending in `/v1`), **model name** (the `--served-model-name`), and **API key**. | ||
|
|
||
| ## Step 2: Enable tool calling on the runtime \{#enable-tool-calling-on-the-runtime} |
There was a problem hiding this comment.
Fix heading anchor syntax to avoid broken in-page links.
The escaped \{#...} likely won’t create the intended IDs, so links like (#enable-tool-calling-on-the-runtime) and (#claude-code) can break. Use plain heading text (auto-slug) or unescaped supported ID syntax for your MDX flavor.
Also applies to: 133-133
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In
`@docs/en/model_inference/inference_service/how_to/coding_agents_with_inference_service.mdx`
at line 69, Fix the broken heading anchors by removing the backslash-escaped
braces and using a supported ID or plain auto-slug: replace "## Step 2: Enable
tool calling on the runtime \{`#enable-tool-calling-on-the-runtime`}" (and the
similar heading at the other location) with either a plain heading "## Step 2:
Enable tool calling on the runtime" (rely on auto-generated slug) or an
unescaped ID form "## Step 2: Enable tool calling on the runtime
{`#enable-tool-calling-on-the-runtime`}" so in-page links like
(`#enable-tool-calling-on-the-runtime`) and (`#claude-code`) resolve correctly.
Deploying alauda-ai with
|
| Latest commit: |
20ee4d9
|
| Status: | ✅ Deploy successful! |
| Preview URL: | https://c1ef65e4.alauda-ai.pages.dev |
| Branch Preview URL: | https://docs-coding-agents-onprem-ll.alauda-ai.pages.dev |
Explains how to point opencode, Codex CLI, and Claude Code at a self-hosted OpenAI-compatible InferenceService, building on the existing deploy and external-access how-tos. Covers enabling tool calling on the vLLM runtime, plus best practices for performance tuning, matching a model to available GPU memory, and getting started with vibe coding and MLOps workflows. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
21fc546 to
336c3e3
Compare
Follow-on to the on-prem coding agent guide. Covers four agent-driven MLOps workflows: managing InferenceService and LLMInferenceService resources, configuring authentication and rate limiting on Envoy AI Gateway, an iterative agent-driven performance tuning loop, and reusable templates for fine-tuning plans and post-run reports. Links to the existing fine-tuning paths (Workbench Notebook, Training Hub, Kubeflow Trainer v2, LLM Compressor) and to the Envoy AI Gateway install doc. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Restructures the Claude Code subsection into two options: - Option A — point Claude Code directly at an on-prem endpoint that speaks the Anthropic Messages API (native runner or gateway), using ANTHROPIC_BASE_URL + ANTHROPIC_MODEL plus the CLAUDE_CODE_* flags that keep the session on-premise (disable non-essential traffic, 1M context, attribution header, telemetry; cap MAX_OUTPUT_TOKENS). - Option B — keep the existing LiteLLM / claude-code-router path for OpenAI-compatible-only endpoints. The direct-env approach avoids a separate proxy when the endpoint already accepts Claude Code traffic. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
docs/en/model_inference/inference_service/how_to/coding_agents_with_inference_service.mdx (1)
161-161: 💤 Low valueConsider rewording for clarity.
The phrase "very large outputs" could be more precise. Consider "excessively large outputs" or "outputs larger than the on-prem model supports" for better clarity.
Suggested rewording
-- The `CLAUDE_CODE_DISABLE_*` and `CLAUDE_CODE_*=0` flags are what actually keep an "on-prem" setup on-prem: without them, Claude Code can still emit non-essential requests to Anthropic-hosted endpoints and ask the model for features (1M context, very large outputs) the on-prem model cannot honor. +- The `CLAUDE_CODE_DISABLE_*` and `CLAUDE_CODE_*=0` flags are what actually keep an "on-prem" setup on-prem: without them, Claude Code can still emit non-essential requests to Anthropic-hosted endpoints and ask the model for features (1M context, excessively large outputs) the on-prem model cannot honor.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/en/model_inference/inference_service/how_to/coding_agents_with_inference_service.mdx` at line 161, Update the sentence that mentions "very large outputs" to a clearer phrase: replace that fragment in the string referencing the flags CLAUDE_CODE_DISABLE_* and CLAUDE_CODE_*=0 with either "excessively large outputs" or "outputs larger than the on-prem model supports" so the intent is explicit that outputs may exceed on‑prem model capacity.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In
`@docs/en/model_inference/inference_service/how_to/coding_agents_with_inference_service.mdx`:
- Around line 148-152: Update the inline comments to match documented Claude
Code env var semantics: change the comment for
CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC=1 to indicate it disables
telemetry/auto-updater/feedback/error reporting; change
CLAUDE_CODE_ATTRIBUTION_HEADER=0 to say it removes the system-prompt attribution
block; change CLAUDE_CODE_DISABLE_1M_CONTEXT=1 to state it disables the
1M‑context model variants in the picker; and change
CLAUDE_CODE_MAX_OUTPUT_TOKENS=64000 to note it sets the max output-tokens cap.
In
`@docs/en/model_inference/inference_service/how_to/mlops_with_coding_agents.mdx`:
- Line 266: The markdown link text "vLLM benchmarking" currently points to
https://docs.vllm.ai/en/latest/serving/usage_stats.html which is the usage-stats
page; update the link so the label and URL match by either changing the URL to
the official benchmarking doc (for example the benchmarking CLI page under vLLM
docs, e.g., .../benchmarking/cli/) or change the link label to "vLLM usage
stats" to reflect the existing URL; locate the markdown anchor "[vLLM
benchmarking](https://docs.vllm.ai/en/latest/serving/usage_stats.html)" in the
file and replace the URL or label accordingly.
---
Nitpick comments:
In
`@docs/en/model_inference/inference_service/how_to/coding_agents_with_inference_service.mdx`:
- Line 161: Update the sentence that mentions "very large outputs" to a clearer
phrase: replace that fragment in the string referencing the flags
CLAUDE_CODE_DISABLE_* and CLAUDE_CODE_*=0 with either "excessively large
outputs" or "outputs larger than the on-prem model supports" so the intent is
explicit that outputs may exceed on‑prem model capacity.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 493f1a30-6237-4b14-b5b1-41e17c39e61f
📒 Files selected for processing (2)
docs/en/model_inference/inference_service/how_to/coding_agents_with_inference_service.mdxdocs/en/model_inference/inference_service/how_to/mlops_with_coding_agents.mdx
| export CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC=1 # suppress optional traffic to Anthropic-hosted services | ||
| export CLAUDE_CODE_ATTRIBUTION_HEADER=0 # drop the Anthropic attribution header | ||
| export CLAUDE_CODE_ENABLE_TELEMETRY=0 # disable telemetry | ||
| export CLAUDE_CODE_DISABLE_1M_CONTEXT=1 # disable the 1M-context feature; most on-prem models can't serve it | ||
| export CLAUDE_CODE_MAX_OUTPUT_TOKENS=64000 # cap to what the on-prem model and runtime support |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
What environment variables does Claude Code support for disabling telemetry and controlling traffic to Anthropic services?
💡 Result:
Claude Code supports several environment variables to manage telemetry and control network traffic to Anthropic services [1][2]. Telemetry and Non-Essential Traffic Controls To opt out of telemetry and reduce non-essential network traffic, the following environment variables can be set to 1 [1][3]: DISABLE_TELEMETRY: Opts out of telemetry. This disables Statsig-based telemetry event logging and feature-flag fetching [1][4][5]. DO_NOT_TRACK: Equivalent to DISABLE_TELEMETRY; Claude Code recognizes this cross-tool convention [1]. CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC: A comprehensive setting that is equivalent to simultaneously setting DISABLE_TELEMETRY, DISABLE_AUTOUPDATER, DISABLE_FEEDBACK_COMMAND, and DISABLE_ERROR_REPORTING [1]. DISABLE_GROWTHBOOK: Specifically disables the fetching of remote feature flags, causing the application to use built-in code defaults instead [1][4]. CLAUDE_CODE_DISABLE_FEEDBACK_SURVEY: Disables session quality surveys [1]. Managing Telemetry and Observability While the above variables disable Anthropic’s built-in telemetry, Claude Code also supports user-managed OpenTelemetry (OTel) for custom observability [1][5]. CLAUDE_CODE_ENABLE_TELEMETRY: Set to 1 to enable OpenTelemetry data collection [1][6]. This is required before configuring custom OTel exporters (e.g., using standard OTEL_EXPORTER_OTLP_* variables) [1][6]. Controlling Traffic to Anthropic Services Beyond disabling telemetry, you can control how and where Claude Code sends traffic to Anthropic services [7][2]: ANTHROPIC_BASE_URL: Overrides the default Anthropic API base URL, allowing you to route traffic through an alternative gateway or proxy [8][2]. ANTHROPIC_API_KEY: Used for authentication; setting this to an empty value is sometimes used to force authentication fallbacks (e.g., via alternative tokens or proxies) [8][2]. ANTHROPIC_UNIX_SOCKET: Configures the client to connect to the Anthropic API via a Unix domain socket instead of a standard network connection [2]. Configuration Methods These environment variables can be applied at the process level (e.g., in your shell profile) or via the env field in your.claude/settings.json file [9][2][10]. Note that environment variables set in settings.json apply to every session and subprocess spawned by Claude Code [9][10]. Top Results: [1], [4], [3], [7], [2]
Citations:
- 1: https://code.claude.com/docs/en/env-vars
- 2: https://sanbuphy-claude-code-source-code.mintlify.app/configuration/environment-variables
- 3: https://gist.github.com/jedisct1/9627644cda1c3929affe9b1ce8eaf714
- 4: [DOCS] Document feature-flag and bootstrap cache behavior when opt-out variables are enabled anthropics/claude-code#62382
- 5: [DOCS] Telemetry Configuration Ambiguity: Confusion between Anthropic internal metrics (Statsig) and User-managed OpenTelemetry anthropics/claude-code#19117
- 6: https://www.mintlify.com/VineeTagarwaL-code/claude-code/configuration/environment-variables
- 7: https://code.claude.com/docs/en/corporate-proxy
- 8: https://www.blogarama.com/internet-blogs/1309494-tech-basket-blog/76355821-use-claude-code-openrouter-free-models-may-2026
- 9: https://code.claude.com/docs/en/settings
- 10: https://code.claude.com/docs/en/configuration
Align the snippet’s inline comments with documented Claude Code env var behavior.
CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC=1isn’t just “suppress optional traffic”; it’s documented as equivalent to disabling telemetry/auto-updater/feedback/error reporting.CLAUDE_CODE_ATTRIBUTION_HEADER=0removes the system-prompt attribution block (not a generic “Anthropic attribution header”).CLAUDE_CODE_DISABLE_1M_CONTEXT=1disables the 1M-context model variants shown in the picker.CLAUDE_CODE_MAX_OUTPUT_TOKENSis the max output-tokens cap (not “on-prem model/runtime support”).
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In
`@docs/en/model_inference/inference_service/how_to/coding_agents_with_inference_service.mdx`
around lines 148 - 152, Update the inline comments to match documented Claude
Code env var semantics: change the comment for
CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC=1 to indicate it disables
telemetry/auto-updater/feedback/error reporting; change
CLAUDE_CODE_ATTRIBUTION_HEADER=0 to say it removes the system-prompt attribution
block; change CLAUDE_CODE_DISABLE_1M_CONTEXT=1 to state it disables the
1M‑context model variants in the picker; and change
CLAUDE_CODE_MAX_OUTPUT_TOKENS=64000 to note it sets the max output-tokens cap.
| - [Envoy AI Gateway upstream documentation](https://aigateway.envoyproxy.io/) | ||
| - [Envoy Gateway upstream documentation](https://gateway.envoyproxy.io/) | ||
| - [KServe LLMInferenceService](https://kserve.github.io/website/) | ||
| - [vLLM benchmarking](https://docs.vllm.ai/en/latest/serving/usage_stats.html) |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
What is the current official vLLM documentation page for benchmarking (e.g., vllm bench serve), and does https://docs.vllm.ai/en/latest/serving/usage_stats.html` document benchmarking?`
💡 Result:
The official vLLM documentation for benchmarking is located at https://docs.vllm.ai/en/latest/benchmarking/cli/ [1]. This page serves as the primary guide for running benchmark tests and includes information on the vllm bench CLI tools [1][2]. Specifically, detailed CLI reference documentation for the vllm bench serve command can be found at https://docs.vllm.ai/en/stable/cli/bench/serve/ [3]. Regarding your second question, the page https://docs.vllm.ai/en/latest/serving/usage_stats.html does not document benchmarking [4][5]. Instead, that page (or its equivalent in other versions) documents vLLM's anonymous usage data collection, which includes information on what data is collected, how to preview it, and how to opt out of the collection [4][5]. Benchmarking is handled under separate sections dedicated to performance testing and evaluation [1][2].
Citations:
- 1: https://docs.vllm.ai/en/latest/benchmarking/cli/
- 2: https://docs.vllm.ai/en/v0.20.1/benchmarking/
- 3: https://docs.vllm.ai/en/stable/cli/bench/serve/
- 4: https://docs.vllm.ai/en/v0.18.2/usage/usage_stats/
- 5: https://docs.vllm.ai/en/v0.13.0/usage/usage_stats/
Fix vLLM link label/URL mismatch.
“vLLM benchmarking” currently links to https://docs.vllm.ai/en/latest/serving/usage_stats.html, which is about anonymous “usage stats” collection (not benchmarking). Update the link URL to the official benchmarking docs (e.g., .../benchmarking/cli/) or rename the label to match the usage-stats page.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In
`@docs/en/model_inference/inference_service/how_to/mlops_with_coding_agents.mdx`
at line 266, The markdown link text "vLLM benchmarking" currently points to
https://docs.vllm.ai/en/latest/serving/usage_stats.html which is the usage-stats
page; update the link so the label and URL match by either changing the URL to
the official benchmarking doc (for example the benchmarking CLI page under vLLM
docs, e.g., .../benchmarking/cli/) or change the link label to "vLLM usage
stats" to reflect the existing URL; locate the markdown anchor "[vLLM
benchmarking](https://docs.vllm.ai/en/latest/serving/usage_stats.html)" in the
file and replace the URL or label accordingly.
Summary
Adds two stacked how-tos under
model_inference/inference_service/how_to, building on each other:1.
coding_agents_with_inference_service.mdx— Use Coding Agents with On-Premise Inference ServicesConnect terminal coding agents to a self-hosted, OpenAI-compatible
InferenceServiceso source code and prompts never leave the cluster:curlsmoke test for validating the endpoint (base URL / model name / API key).2.
mlops_with_coding_agents.mdx— Run MLOps with Coding Agents and On-Premise LLMsOnce the agent is wired up, the same agent drives day-to-day MLOps:
InferenceService/LLMInferenceService— agent loop (draft →kubectl apply --dry-run=server→ apply → poll → smoke-test) with concrete starter prompts.AIGatewayRoute,AIServiceBackend,BackendSecurityPolicy,SecurityPolicy, andBackendTrafficPolicy. Cross-links the existing envoy_ai_gateway intro/install docs.Test plan
doom lintpasses on both files (0 errors, 0 warnings)yarn lintpasses🤖 Generated with Claude Code
Summary by CodeRabbit