Rewrite Kubernetes Autoscaling: clearer Enable workflow + in-place vertical resize coverage by Danny-Driscoll · Pull Request #36531 · DataDog/documentation

Danny-Driscoll · 2026-05-07T02:13:15Z

What does this PR do?

End-to-end editorial pass on the Kubernetes Autoscaling page focused on making the workload-enablement guidance clearer and more actionable, and bringing the doc up to date with the in-place vertical pod resize work that landed in Cluster Agent 7.78.

Major changes

Rewritten "Enable Autoscaling for a workload"

Replaced the previous two-bullet section with a decision-led structure. Readers now see a comparison table of three deployment paths and follow the one that matches how they ship workloads:

Path A: Datadog UI setup wizard — fastest path; links to the in-app Setup page at /orchestration/scaling/setup and names all four templates (Optimize cost / Optimize balance / Optimize performance / Customize). Best for single workloads, demos, or first rollouts.
Path B: author a manifest — kubectl apply workflow; points down to the existing tabbed example configurations.
Path C: manage as infrastructure as code — links to the existing ArgoCD and Terraform guides. Best for fleet teams.

In-place vertical pod resize (Cluster Agent 7.78)

New In-place vertical pod resize (opt-in) | 7.78+ row in the Agent version table under Requirements.
Vertical CPU and Memory example tab now explains the opt-in flag (autoscaling.workload.in_place_vertical_scaling.enabled), the Kubernetes prerequisites (pods/resize subresource exposed via the InPlacePodVerticalScaling feature gate — beta in 1.33+, alpha-opt-in for 1.27 to 1.32), the Infeasible → rollout fallback behavior, and applyPolicy.update.strategy: TriggerRollout as the per-workload opt-out.
No claim of a future default flip — the doc is accurate against origin/7.79.x today and will be a small targeted edit if/when the default changes.

Tighter "when to pick this" intro for each example tab

All four DPA YAML examples are unchanged, but each tab now opens with a 2-sentence framing of (1) when to pick the template and (2) the single most important spec value driving its behavior.

New "Manage workloads at scale" section

A short bullet-driven section on day-two operations: changing the scaling template, switching owner: Remote vs Local, pausing with applyPolicy.mode: Preview, watching the rollout in the Workload Scaling list view, and removing autoscaling cleanly.

Reference section

The detailed "How vertical recommendations are calculated" methodology block (added earlier on this branch) moves from under Requirements down to a new top-level Reference section near the bottom of the page. Content unchanged — placement only.

Smaller cleanups

The 1000-workloads-per-cluster cap is now a **Note:** block instead of a buried sentence.
The Identify section now ends with an explicit forward link to Enable Autoscaling for a workload.

Notes for reviewers

No new shortcodes, partials, or images.
No changes to the Operator/Helm Setup tabs or the Idle cost & savings section.
The vale baseline (0 errors, 4 warnings, 3 suggestions) is the same as before this rewrite — all remaining items are pre-existing in untouched copy.
This branch supersedes the closed draft Add vertical scaling recommendations documentation to autoscaling page #34515; the two prior commits on this work (the methodology block + Vale fixes) are preserved and travel with this PR.

Made with Cursor

Adds a new 'How vertical recommendations are calculated' section under 'How it works' that explains how Datadog computes CPU and memory request/limit recommendations, including: - Memory: p95-based requests with decaying weights, peak-based limits - CPU: p95/p99-based requests and limits - Handling of request-equals-limit configurations - OOMKill detection and response - Key design principles (8-day lookback, safety margins, etc.)

…names

…ize coverage Restructure the Kubernetes Autoscaling page to give readers a clearer path from "I have a workload to optimize" to "it's running in production": - "Enable Autoscaling for a workload" is now a decision-led section with three paths (Datadog UI setup wizard / hand-written manifest / IaC), including a comparison table and links to the in-app Setup wizard and the ArgoCD + Terraform guides. - The four DPA example tabs each get a tighter "when to pick this" intro pointing at the single defining spec value. - Vertical CPU and Memory tab documents in-place pod resizing introduced in Cluster Agent 7.78. The flag is opt-in (autoscaling.workload.in_place_vertical_scaling.enabled) and requires the Kubernetes pods/resize subresource. TriggerRollout is documented as the per-workload opt-out. - New "In-place vertical pod resize (opt-in) | 7.78+" row in the Agent version table. - New "Manage workloads at scale" section covers day-two operations: template changes, owner Remote/Local, applyPolicy.mode: Preview, monitoring, and clean removal. - "How vertical recommendations are calculated" moved to a new top-level Reference section near the bottom (placement only, content unchanged). - Smaller cleanups: 1000-workload-per-cluster cap is now a Note block; Identify section links forward to Enable Autoscaling. Co-authored-by: Cursor <cursoragent@cursor.com>

github-actions · 2026-05-07T03:10:59Z

Preview links (active after the `build_preview` check completes)

Modified Files

https://docs-staging.datadoghq.com/ddriscoll/autoscaling-docs-rewrite/containers/autoscaling/

vboulineau · 2026-05-07T08:37:48Z

+After you identify a workload to optimize, inspect its {{< ui >}}Scaling Recommendation{{< /ui >}}. Click {{< ui >}}Configure Recommendation{{< /ui >}} to add constraints or adjust target utilization levels before enabling.

-When you are ready to proceed with enabling Autoscaling for a workload, you have two options for deployment:
+There are three ways to enable autoscaling for a workload. Pick the path that matches how you ship workloads today.


IMHO there's only 2 way: Datadog UI and Authoring a DatadogPodAutoscaler.
Whether the DPA is applied through kubectl or any other tool (which more often than not some Helm directly or Helm wrapped by ArgoCD).

vboulineau · 2026-05-07T08:39:09Z

+| Path | Best for | Where you start | Ongoing management |
+|------|----------|-----------------|--------------------|
+| **A. Datadog UI setup wizard** | A single workload, demos, or your first rollout | [Setup page][11] in the Datadog UI | Edit the workload's `DatadogPodAutoscaler` from the UI or your cluster |
+| **B. Author a `DatadogPodAutoscaler` manifest** | Existing `kubectl`-driven workflows | Hand-written YAML applied with `kubectl apply` | Edit the manifest and reapply, or let Datadog manage the spec by setting `owner: Remote` |


The owner part is largely incorrect. The only value that should be documented in the doc is owner: Local. The customer should never set owner: Remote nor attempt to change owner values.

vboulineau · 2026-05-07T13:37:39Z

+After a workload is autoscaled, day-two operations are managed through a combination of the `DatadogPodAutoscaler` resource and the Datadog UI:
+
+- **Change the scaling template.** Edit the workload's `DatadogPodAutoscaler` spec (CPU target, replica bounds, scale-up and scale-down rules) directly, or pick a different template from the [Workload Scaling list view][8]. Changes take effect on the next reconcile.
+- **Switch between Datadog-managed and self-managed specs.** Set `owner: Remote` to let Datadog refresh the recommendation in place, or `owner: Local` to pin the spec and treat Datadog's recommendations as advisory. Per-workload overrides on a `Local` resource always win.


Entirely incorrect, to be removed

vboulineau · 2026-05-07T13:38:36Z

+
+| | How it's computed |
+|---|---|
+| **Request and limit recommendation** | Based on the **maximum peak memory usage** observed over the last 8 days. A **5% safety margin** is added. If an **OOMKill** is detected, an additional **20% bump** is applied to help prevent future out-of-memory events. |


We should highlight that it's configurable

vboulineau · 2026-05-07T13:40:20Z

+
+| | How it's computed |
+|---|---|
+| **Request recommendation** | Based on the **p95** of CPU usage relative to the current request over the last 8 days. A **10% safety margin** is then added. |


It's currently p90 for req and p95 for lim

vboulineau · 2026-05-07T13:40:44Z

+
+| | How it's computed |
+|---|---|
+| **Request and limit recommendation** | Based on the **p99** of CPU usage relative to the current request over the last 8 days. A **5% safety margin** is then added. |


p95 as well

vboulineau · 2026-05-07T13:41:19Z

+#### Key design principles
+
+- **8-day lookback window**: All recommendations consider usage data from the past 8 days, providing enough history to capture weekly traffic patterns while remaining responsive to changes.
+- **Decaying weights**: For memory request recommendations (when request ≠ limit), older samples are weighted less heavily, so the recommendation adapts faster to recent usage shifts.


It's for both CPU and Memory.

Danny-Driscoll requested a review from a team as a code owner May 7, 2026 02:13

Danny-Driscoll and others added 3 commits May 6, 2026 22:52

Fix Vale linter error: replace en dash with 'to' and improve heading …

f455bc1

…names

Danny-Driscoll force-pushed the ddriscoll/autoscaling-docs-rewrite branch from c0118db to 7765e01 Compare May 7, 2026 03:06

vboulineau reviewed May 7, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rewrite Kubernetes Autoscaling: clearer Enable workflow + in-place vertical resize coverage#36531

Rewrite Kubernetes Autoscaling: clearer Enable workflow + in-place vertical resize coverage#36531
Danny-Driscoll wants to merge 3 commits intomasterfrom
ddriscoll/autoscaling-docs-rewrite

Danny-Driscoll commented May 7, 2026

Uh oh!

github-actions Bot commented May 7, 2026

Uh oh!

vboulineau May 7, 2026

Uh oh!

vboulineau May 7, 2026

Uh oh!

vboulineau May 7, 2026

Uh oh!

vboulineau May 7, 2026

Uh oh!

vboulineau May 7, 2026

Uh oh!

vboulineau May 7, 2026

Uh oh!

vboulineau May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Danny-Driscoll commented May 7, 2026

What does this PR do?

Major changes

Rewritten "Enable Autoscaling for a workload"

In-place vertical pod resize (Cluster Agent 7.78)

Tighter "when to pick this" intro for each example tab

New "Manage workloads at scale" section

Reference section

Smaller cleanups

Notes for reviewers

Uh oh!

github-actions Bot commented May 7, 2026

Preview links (active after the build_preview check completes)

Modified Files

Uh oh!

vboulineau May 7, 2026

Choose a reason for hiding this comment

Uh oh!

vboulineau May 7, 2026

Choose a reason for hiding this comment

Uh oh!

vboulineau May 7, 2026

Choose a reason for hiding this comment

Uh oh!

vboulineau May 7, 2026

Choose a reason for hiding this comment

Uh oh!

vboulineau May 7, 2026

Choose a reason for hiding this comment

Uh oh!

vboulineau May 7, 2026

Choose a reason for hiding this comment

Uh oh!

vboulineau May 7, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Preview links (active after the `build_preview` check completes)