Skip to content

docs: add coding-agents + MLOps how-tos for on-prem LLMs#245

Open
typhoonzero wants to merge 3 commits into
masterfrom
docs/coding-agents-onprem-llm
Open

docs: add coding-agents + MLOps how-tos for on-prem LLMs#245
typhoonzero wants to merge 3 commits into
masterfrom
docs/coding-agents-onprem-llm

Conversation

@typhoonzero
Copy link
Copy Markdown
Contributor

@typhoonzero typhoonzero commented May 29, 2026

Summary

Adds two stacked how-tos under model_inference/inference_service/how_to, building on each other:

1. coding_agents_with_inference_service.mdx — Use Coding Agents with On-Premise Inference Services

Connect terminal coding agents to a self-hosted, OpenAI-compatible InferenceService so source code and prompts never leave the cluster:

  • Builds on and links to the existing Create Inference Service using CLI and Configure External Access how-tos rather than repeating them.
  • Adds a curl smoke test for validating the endpoint (base URL / model name / API key).
  • Shows how to enable tool (function) calling on the vLLM runtime — the prerequisite that lets agents actually edit files and run commands.
  • Per-agent connection config for opencode, Codex CLI, and Claude Code (the latter via an Anthropic→OpenAI translation proxy such as LiteLLM or claude-code-router).
  • Best practices: a GPU-memory→model-size table, performance tuning for agent traffic (prefix caching, GPU memory utilization, chunked prefill, CUDA graphs, quantization, autoscaling/cold-start trade-offs), and getting-started guidance for vibe coding and MLOps workflows.

2. mlops_with_coding_agents.mdx — Run MLOps with Coding Agents and On-Premise LLMs

Once the agent is wired up, the same agent drives day-to-day MLOps:

  • Manage InferenceService / LLMInferenceService — agent loop (draft → kubectl apply --dry-run=server → apply → poll → smoke-test) with concrete starter prompts.
  • Configure Envoy AI Gateway — auth and rate limits via AIGatewayRoute, AIServiceBackend, BackendSecurityPolicy, SecurityPolicy, and BackendTrafficPolicy. Cross-links the existing envoy_ai_gateway intro/install docs.
  • Agent-driven performance tuning loop — five steps: SLOs → reproducible benchmark → one change per iteration → measure → stop on SLO or hardware ceiling. Cross-links into doc 1's tuning section instead of duplicating the flag list.
  • Fine-tuning plans and reports — a tool-selection table (Notebook / Training Hub / Kubeflow Trainer v2 / LLM Compressor) plus two ready-to-commit markdown templates: a pre-run plan and a post-run report designed for the agent to fill in from MLflow runs, training logs, and eval outputs.
  • Adds a "daily MLOps loop" walk-through and guardrails (read-only first, server-side dry-run by default, one change per iteration, no fabricated metrics, no hosted-provider fallback).

Test plan

  • doom lint passes on both files (0 errors, 0 warnings)
  • Pre-commit yarn lint passes
  • All internal cross-links to existing docs (inference_service how-tos and troubleshooting, envoy_ai_gateway, kubeflow, workbench, llm-compressor, infrastructure_management/hardware_profile) resolve
  • Reviewer: verify rendered pages and the two markdown templates render cleanly inside fenced code blocks

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Documentation
    • Added comprehensive guide for using terminal-based coding agents with self-hosted inference service instances, including setup instructions, agent configuration, and performance tuning tips.
    • Added documentation covering MLOps workflows powered by coding agents on on-premise infrastructure, with guidance on resource management, performance optimization, and operational best practices.

Review Change Stack

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 29, 2026

Walkthrough

Adds two new how‑to docs: one teaching terminal coding agents (opencode, Codex CLI, Claude Code via proxy) against on‑prem InferenceService with setup, configs, vLLM tuning, and troubleshooting; the other describing MLOps workflows (gateway CRDs, benchmarking loops, fine‑tuning plans, and daily agent operations/guardrails).

Changes

Coding Agents On-Premise Integration Guide

Layer / File(s) Summary
Page metadata and prerequisites
docs/en/model_inference/inference_service/how_to/coding_agents_with_inference_service.mdx
Frontmatter, localization, intro framing, warning about evolving agent config formats, and prerequisites (OpenAI-compatible endpoint, network access, tool-calling parser, agent CLI).
Integration architecture and endpoint setup
docs/en/model_inference/inference_service/how_to/coding_agents_with_inference_service.mdx
End-to-end integration flow diagram, per-agent API compatibility mapping, Step 1 curl smoke-test, and Step 2 vLLM tool-calling enablement with parser/chat-template guidance.
Agent-specific integrations
docs/en/model_inference/inference_service/how_to/coding_agents_with_inference_service.mdx
Configuration details for opencode (env var wiring, model key), Codex CLI (config.toml, base URL, wire_api = "chat"), and Claude Code (Anthropic→OpenAI proxy options with env var routing for models).
Model selection and vLLM performance tuning
docs/en/model_inference/inference_service/how_to/coding_agents_with_inference_service.mdx
Best-practice model selection (GPU memory, parser/tool-calling, context budgeting, quantization) and vLLM tuning recommendations (prefix caching, KV cache sizing, chunked prefill, CUDA graphs, batching, dtype, tensor parallelism, speculative decoding, autoscaling, timeouts) plus vibe-coding tips and MLOps starter tasks.
Troubleshooting and references
docs/en/model_inference/inference_service/how_to/coding_agents_with_inference_service.mdx
Troubleshooting checklist mapping symptoms to setup areas and curated references to deployment, vLLM tool-calling, and agent/proxy docs.

MLOps with Coding Agents

Layer / File(s) Summary
Page frontmatter and top-level guidance
docs/en/model_inference/inference_service/how_to/mlops_with_coding_agents.mdx
Frontmatter, scoped operational safety warning, and framing of four supported MLOps workflows for agent-driven operations.
Gateway CRDs and performance tuning loop
docs/en/model_inference/inference_service/how_to/mlops_with_coding_agents.mdx
Agent-managed Envoy AI Gateway CRD workflow (auth, rate-limiting) with smoke-test steps and a reproducible benchmark-and-iterate performance tuning loop with SLO-style objectives.
Fine-tuning planning and reporting templates
docs/en/model_inference/inference_service/how_to/mlops_with_coding_agents.mdx
Reusable fine-tuning job plan and report templates, guidance on capturing training/eval artifacts, example prompts, and fields to mark as TODO when data is missing.
Daily MLOps loop and guardrails
docs/en/model_inference/inference_service/how_to/mlops_with_coding_agents.mdx
Daily end-to-end agent loop, guardrails (read-only first, --dry-run=server, prevent metric fabrication, keep ops on-prem), and references to prerequisite docs.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 In a burrow of docs I nibble and write,
Agents and gateways snug under soft light.
curl, parsers, and proxies all lined in a row—
Deploy, test, tune, and watch the models glow.
hops off to deploy with a carrot-sized CLI

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely summarizes the main addition: two how-to documentation pages for using coding agents with on-premises LLMs.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch docs/coding-agents-onprem-llm

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 ESLint

If the error stems from missing dependencies, add them to the package.json file. For unrecoverable errors (e.g., due to private dependencies), disable the tool in the CodeRabbit configuration.

ESLint skipped: no ESLint configuration detected in root package.json. To enable, add eslint to devDependencies.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@docs/en/model_inference/inference_service/how_to/coding_agents_with_inference_service.mdx`:
- Line 69: Fix the broken heading anchors by removing the backslash-escaped
braces and using a supported ID or plain auto-slug: replace "## Step 2: Enable
tool calling on the runtime \{`#enable-tool-calling-on-the-runtime`}" (and the
similar heading at the other location) with either a plain heading "## Step 2:
Enable tool calling on the runtime" (rely on auto-generated slug) or an
unescaped ID form "## Step 2: Enable tool calling on the runtime
{`#enable-tool-calling-on-the-runtime`}" so in-page links like
(`#enable-tool-calling-on-the-runtime`) and (`#claude-code`) resolve correctly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 9456e8cf-88c7-4032-862a-69f099664a18

📥 Commits

Reviewing files that changed from the base of the PR and between fff230c and 21fc546.

📒 Files selected for processing (1)
  • docs/en/model_inference/inference_service/how_to/coding_agents_with_inference_service.mdx


A normal JSON completion confirms the endpoint is reachable and the model name is correct. Note the three values you will reuse for every agent: **base URL** (ending in `/v1`), **model name** (the `--served-model-name`), and **API key**.

## Step 2: Enable tool calling on the runtime \{#enable-tool-calling-on-the-runtime}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Fix heading anchor syntax to avoid broken in-page links.

The escaped \{#...} likely won’t create the intended IDs, so links like (#enable-tool-calling-on-the-runtime) and (#claude-code) can break. Use plain heading text (auto-slug) or unescaped supported ID syntax for your MDX flavor.

Also applies to: 133-133

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@docs/en/model_inference/inference_service/how_to/coding_agents_with_inference_service.mdx`
at line 69, Fix the broken heading anchors by removing the backslash-escaped
braces and using a supported ID or plain auto-slug: replace "## Step 2: Enable
tool calling on the runtime \{`#enable-tool-calling-on-the-runtime`}" (and the
similar heading at the other location) with either a plain heading "## Step 2:
Enable tool calling on the runtime" (rely on auto-generated slug) or an
unescaped ID form "## Step 2: Enable tool calling on the runtime
{`#enable-tool-calling-on-the-runtime`}" so in-page links like
(`#enable-tool-calling-on-the-runtime`) and (`#claude-code`) resolve correctly.

@cloudflare-workers-and-pages
Copy link
Copy Markdown

cloudflare-workers-and-pages Bot commented May 29, 2026

Deploying alauda-ai with  Cloudflare Pages  Cloudflare Pages

Latest commit: 20ee4d9
Status: ✅  Deploy successful!
Preview URL: https://c1ef65e4.alauda-ai.pages.dev
Branch Preview URL: https://docs-coding-agents-onprem-ll.alauda-ai.pages.dev

View logs

Explains how to point opencode, Codex CLI, and Claude Code at a
self-hosted OpenAI-compatible InferenceService, building on the existing
deploy and external-access how-tos. Covers enabling tool calling on the
vLLM runtime, plus best practices for performance tuning, matching a
model to available GPU memory, and getting started with vibe coding and
MLOps workflows.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@typhoonzero typhoonzero force-pushed the docs/coding-agents-onprem-llm branch from 21fc546 to 336c3e3 Compare May 29, 2026 03:58
Follow-on to the on-prem coding agent guide. Covers four agent-driven
MLOps workflows: managing InferenceService and LLMInferenceService
resources, configuring authentication and rate limiting on Envoy AI
Gateway, an iterative agent-driven performance tuning loop, and reusable
templates for fine-tuning plans and post-run reports. Links to the
existing fine-tuning paths (Workbench Notebook, Training Hub, Kubeflow
Trainer v2, LLM Compressor) and to the Envoy AI Gateway install doc.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@typhoonzero typhoonzero changed the title docs: add guide for using coding agents with on-prem inference services docs: add coding-agents + MLOps how-tos for on-prem LLMs May 29, 2026
Restructures the Claude Code subsection into two options:
- Option A — point Claude Code directly at an on-prem endpoint that
  speaks the Anthropic Messages API (native runner or gateway), using
  ANTHROPIC_BASE_URL + ANTHROPIC_MODEL plus the CLAUDE_CODE_* flags that
  keep the session on-premise (disable non-essential traffic, 1M
  context, attribution header, telemetry; cap MAX_OUTPUT_TOKENS).
- Option B — keep the existing LiteLLM / claude-code-router path for
  OpenAI-compatible-only endpoints.

The direct-env approach avoids a separate proxy when the endpoint
already accepts Claude Code traffic.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
docs/en/model_inference/inference_service/how_to/coding_agents_with_inference_service.mdx (1)

161-161: 💤 Low value

Consider rewording for clarity.

The phrase "very large outputs" could be more precise. Consider "excessively large outputs" or "outputs larger than the on-prem model supports" for better clarity.

Suggested rewording
-- The `CLAUDE_CODE_DISABLE_*` and `CLAUDE_CODE_*=0` flags are what actually keep an "on-prem" setup on-prem: without them, Claude Code can still emit non-essential requests to Anthropic-hosted endpoints and ask the model for features (1M context, very large outputs) the on-prem model cannot honor.
+- The `CLAUDE_CODE_DISABLE_*` and `CLAUDE_CODE_*=0` flags are what actually keep an "on-prem" setup on-prem: without them, Claude Code can still emit non-essential requests to Anthropic-hosted endpoints and ask the model for features (1M context, excessively large outputs) the on-prem model cannot honor.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@docs/en/model_inference/inference_service/how_to/coding_agents_with_inference_service.mdx`
at line 161, Update the sentence that mentions "very large outputs" to a clearer
phrase: replace that fragment in the string referencing the flags
CLAUDE_CODE_DISABLE_* and CLAUDE_CODE_*=0 with either "excessively large
outputs" or "outputs larger than the on-prem model supports" so the intent is
explicit that outputs may exceed on‑prem model capacity.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@docs/en/model_inference/inference_service/how_to/coding_agents_with_inference_service.mdx`:
- Around line 148-152: Update the inline comments to match documented Claude
Code env var semantics: change the comment for
CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC=1 to indicate it disables
telemetry/auto-updater/feedback/error reporting; change
CLAUDE_CODE_ATTRIBUTION_HEADER=0 to say it removes the system-prompt attribution
block; change CLAUDE_CODE_DISABLE_1M_CONTEXT=1 to state it disables the
1M‑context model variants in the picker; and change
CLAUDE_CODE_MAX_OUTPUT_TOKENS=64000 to note it sets the max output-tokens cap.

In
`@docs/en/model_inference/inference_service/how_to/mlops_with_coding_agents.mdx`:
- Line 266: The markdown link text "vLLM benchmarking" currently points to
https://docs.vllm.ai/en/latest/serving/usage_stats.html which is the usage-stats
page; update the link so the label and URL match by either changing the URL to
the official benchmarking doc (for example the benchmarking CLI page under vLLM
docs, e.g., .../benchmarking/cli/) or change the link label to "vLLM usage
stats" to reflect the existing URL; locate the markdown anchor "[vLLM
benchmarking](https://docs.vllm.ai/en/latest/serving/usage_stats.html)" in the
file and replace the URL or label accordingly.

---

Nitpick comments:
In
`@docs/en/model_inference/inference_service/how_to/coding_agents_with_inference_service.mdx`:
- Line 161: Update the sentence that mentions "very large outputs" to a clearer
phrase: replace that fragment in the string referencing the flags
CLAUDE_CODE_DISABLE_* and CLAUDE_CODE_*=0 with either "excessively large
outputs" or "outputs larger than the on-prem model supports" so the intent is
explicit that outputs may exceed on‑prem model capacity.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 493f1a30-6237-4b14-b5b1-41e17c39e61f

📥 Commits

Reviewing files that changed from the base of the PR and between 21fc546 and 20ee4d9.

📒 Files selected for processing (2)
  • docs/en/model_inference/inference_service/how_to/coding_agents_with_inference_service.mdx
  • docs/en/model_inference/inference_service/how_to/mlops_with_coding_agents.mdx

Comment on lines +148 to +152
export CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC=1 # suppress optional traffic to Anthropic-hosted services
export CLAUDE_CODE_ATTRIBUTION_HEADER=0 # drop the Anthropic attribution header
export CLAUDE_CODE_ENABLE_TELEMETRY=0 # disable telemetry
export CLAUDE_CODE_DISABLE_1M_CONTEXT=1 # disable the 1M-context feature; most on-prem models can't serve it
export CLAUDE_CODE_MAX_OUTPUT_TOKENS=64000 # cap to what the on-prem model and runtime support
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

🧩 Analysis chain

🌐 Web query:

What environment variables does Claude Code support for disabling telemetry and controlling traffic to Anthropic services?

💡 Result:

Claude Code supports several environment variables to manage telemetry and control network traffic to Anthropic services [1][2]. Telemetry and Non-Essential Traffic Controls To opt out of telemetry and reduce non-essential network traffic, the following environment variables can be set to 1 [1][3]: DISABLE_TELEMETRY: Opts out of telemetry. This disables Statsig-based telemetry event logging and feature-flag fetching [1][4][5]. DO_NOT_TRACK: Equivalent to DISABLE_TELEMETRY; Claude Code recognizes this cross-tool convention [1]. CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC: A comprehensive setting that is equivalent to simultaneously setting DISABLE_TELEMETRY, DISABLE_AUTOUPDATER, DISABLE_FEEDBACK_COMMAND, and DISABLE_ERROR_REPORTING [1]. DISABLE_GROWTHBOOK: Specifically disables the fetching of remote feature flags, causing the application to use built-in code defaults instead [1][4]. CLAUDE_CODE_DISABLE_FEEDBACK_SURVEY: Disables session quality surveys [1]. Managing Telemetry and Observability While the above variables disable Anthropic’s built-in telemetry, Claude Code also supports user-managed OpenTelemetry (OTel) for custom observability [1][5]. CLAUDE_CODE_ENABLE_TELEMETRY: Set to 1 to enable OpenTelemetry data collection [1][6]. This is required before configuring custom OTel exporters (e.g., using standard OTEL_EXPORTER_OTLP_* variables) [1][6]. Controlling Traffic to Anthropic Services Beyond disabling telemetry, you can control how and where Claude Code sends traffic to Anthropic services [7][2]: ANTHROPIC_BASE_URL: Overrides the default Anthropic API base URL, allowing you to route traffic through an alternative gateway or proxy [8][2]. ANTHROPIC_API_KEY: Used for authentication; setting this to an empty value is sometimes used to force authentication fallbacks (e.g., via alternative tokens or proxies) [8][2]. ANTHROPIC_UNIX_SOCKET: Configures the client to connect to the Anthropic API via a Unix domain socket instead of a standard network connection [2]. Configuration Methods These environment variables can be applied at the process level (e.g., in your shell profile) or via the env field in your.claude/settings.json file [9][2][10]. Note that environment variables set in settings.json apply to every session and subprocess spawned by Claude Code [9][10]. Top Results: [1], [4], [3], [7], [2]

Citations:


Align the snippet’s inline comments with documented Claude Code env var behavior.

  • CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC=1 isn’t just “suppress optional traffic”; it’s documented as equivalent to disabling telemetry/auto-updater/feedback/error reporting.
  • CLAUDE_CODE_ATTRIBUTION_HEADER=0 removes the system-prompt attribution block (not a generic “Anthropic attribution header”).
  • CLAUDE_CODE_DISABLE_1M_CONTEXT=1 disables the 1M-context model variants shown in the picker.
  • CLAUDE_CODE_MAX_OUTPUT_TOKENS is the max output-tokens cap (not “on-prem model/runtime support”).
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@docs/en/model_inference/inference_service/how_to/coding_agents_with_inference_service.mdx`
around lines 148 - 152, Update the inline comments to match documented Claude
Code env var semantics: change the comment for
CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC=1 to indicate it disables
telemetry/auto-updater/feedback/error reporting; change
CLAUDE_CODE_ATTRIBUTION_HEADER=0 to say it removes the system-prompt attribution
block; change CLAUDE_CODE_DISABLE_1M_CONTEXT=1 to state it disables the
1M‑context model variants in the picker; and change
CLAUDE_CODE_MAX_OUTPUT_TOKENS=64000 to note it sets the max output-tokens cap.

- [Envoy AI Gateway upstream documentation](https://aigateway.envoyproxy.io/)
- [Envoy Gateway upstream documentation](https://gateway.envoyproxy.io/)
- [KServe LLMInferenceService](https://kserve.github.io/website/)
- [vLLM benchmarking](https://docs.vllm.ai/en/latest/serving/usage_stats.html)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

🧩 Analysis chain

🌐 Web query:

What is the current official vLLM documentation page for benchmarking (e.g., vllm bench serve), and does https://docs.vllm.ai/en/latest/serving/usage_stats.html` document benchmarking?`

💡 Result:

The official vLLM documentation for benchmarking is located at https://docs.vllm.ai/en/latest/benchmarking/cli/ [1]. This page serves as the primary guide for running benchmark tests and includes information on the vllm bench CLI tools [1][2]. Specifically, detailed CLI reference documentation for the vllm bench serve command can be found at https://docs.vllm.ai/en/stable/cli/bench/serve/ [3]. Regarding your second question, the page https://docs.vllm.ai/en/latest/serving/usage_stats.html does not document benchmarking [4][5]. Instead, that page (or its equivalent in other versions) documents vLLM's anonymous usage data collection, which includes information on what data is collected, how to preview it, and how to opt out of the collection [4][5]. Benchmarking is handled under separate sections dedicated to performance testing and evaluation [1][2].

Citations:


Fix vLLM link label/URL mismatch.

“vLLM benchmarking” currently links to https://docs.vllm.ai/en/latest/serving/usage_stats.html, which is about anonymous “usage stats” collection (not benchmarking). Update the link URL to the official benchmarking docs (e.g., .../benchmarking/cli/) or rename the label to match the usage-stats page.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@docs/en/model_inference/inference_service/how_to/mlops_with_coding_agents.mdx`
at line 266, The markdown link text "vLLM benchmarking" currently points to
https://docs.vllm.ai/en/latest/serving/usage_stats.html which is the usage-stats
page; update the link so the label and URL match by either changing the URL to
the official benchmarking doc (for example the benchmarking CLI page under vLLM
docs, e.g., .../benchmarking/cli/) or change the link label to "vLLM usage
stats" to reflect the existing URL; locate the markdown anchor "[vLLM
benchmarking](https://docs.vllm.ai/en/latest/serving/usage_stats.html)" in the
file and replace the URL or label accordingly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant