Skip to content

Rewrite Kubernetes Autoscaling: clearer Enable workflow + in-place vertical resize coverage#36531

Open
Danny-Driscoll wants to merge 3 commits intomasterfrom
ddriscoll/autoscaling-docs-rewrite
Open

Rewrite Kubernetes Autoscaling: clearer Enable workflow + in-place vertical resize coverage#36531
Danny-Driscoll wants to merge 3 commits intomasterfrom
ddriscoll/autoscaling-docs-rewrite

Conversation

@Danny-Driscoll
Copy link
Copy Markdown
Contributor

What does this PR do?

End-to-end editorial pass on the Kubernetes Autoscaling page focused on making the workload-enablement guidance clearer and more actionable, and bringing the doc up to date with the in-place vertical pod resize work that landed in Cluster Agent 7.78.

Major changes

Rewritten "Enable Autoscaling for a workload"

Replaced the previous two-bullet section with a decision-led structure. Readers now see a comparison table of three deployment paths and follow the one that matches how they ship workloads:

  • Path A: Datadog UI setup wizard — fastest path; links to the in-app Setup page at /orchestration/scaling/setup and names all four templates (Optimize cost / Optimize balance / Optimize performance / Customize). Best for single workloads, demos, or first rollouts.
  • Path B: author a manifestkubectl apply workflow; points down to the existing tabbed example configurations.
  • Path C: manage as infrastructure as code — links to the existing ArgoCD and Terraform guides. Best for fleet teams.

In-place vertical pod resize (Cluster Agent 7.78)

  • New In-place vertical pod resize (opt-in) | 7.78+ row in the Agent version table under Requirements.
  • Vertical CPU and Memory example tab now explains the opt-in flag (autoscaling.workload.in_place_vertical_scaling.enabled), the Kubernetes prerequisites (pods/resize subresource exposed via the InPlacePodVerticalScaling feature gate — beta in 1.33+, alpha-opt-in for 1.27 to 1.32), the Infeasible → rollout fallback behavior, and applyPolicy.update.strategy: TriggerRollout as the per-workload opt-out.
  • No claim of a future default flip — the doc is accurate against origin/7.79.x today and will be a small targeted edit if/when the default changes.

Tighter "when to pick this" intro for each example tab

All four DPA YAML examples are unchanged, but each tab now opens with a 2-sentence framing of (1) when to pick the template and (2) the single most important spec value driving its behavior.

New "Manage workloads at scale" section

A short bullet-driven section on day-two operations: changing the scaling template, switching owner: Remote vs Local, pausing with applyPolicy.mode: Preview, watching the rollout in the Workload Scaling list view, and removing autoscaling cleanly.

Reference section

The detailed "How vertical recommendations are calculated" methodology block (added earlier on this branch) moves from under Requirements down to a new top-level Reference section near the bottom of the page. Content unchanged — placement only.

Smaller cleanups

  • The 1000-workloads-per-cluster cap is now a **Note:** block instead of a buried sentence.
  • The Identify section now ends with an explicit forward link to Enable Autoscaling for a workload.

Notes for reviewers

  • No new shortcodes, partials, or images.
  • No changes to the Operator/Helm Setup tabs or the Idle cost & savings section.
  • The vale baseline (0 errors, 4 warnings, 3 suggestions) is the same as before this rewrite — all remaining items are pre-existing in untouched copy.
  • This branch supersedes the closed draft Add vertical scaling recommendations documentation to autoscaling page #34515; the two prior commits on this work (the methodology block + Vale fixes) are preserved and travel with this PR.

Made with Cursor

@Danny-Driscoll Danny-Driscoll requested a review from a team as a code owner May 7, 2026 02:13
Danny-Driscoll and others added 3 commits May 6, 2026 22:52
Adds a new 'How vertical recommendations are calculated' section under
'How it works' that explains how Datadog computes CPU and memory
request/limit recommendations, including:
- Memory: p95-based requests with decaying weights, peak-based limits
- CPU: p95/p99-based requests and limits
- Handling of request-equals-limit configurations
- OOMKill detection and response
- Key design principles (8-day lookback, safety margins, etc.)
…ize coverage

Restructure the Kubernetes Autoscaling page to give readers a clearer path
from "I have a workload to optimize" to "it's running in production":

- "Enable Autoscaling for a workload" is now a decision-led section with
  three paths (Datadog UI setup wizard / hand-written manifest / IaC),
  including a comparison table and links to the in-app Setup wizard and
  the ArgoCD + Terraform guides.
- The four DPA example tabs each get a tighter "when to pick this" intro
  pointing at the single defining spec value.
- Vertical CPU and Memory tab documents in-place pod resizing introduced
  in Cluster Agent 7.78. The flag is opt-in
  (autoscaling.workload.in_place_vertical_scaling.enabled) and requires
  the Kubernetes pods/resize subresource. TriggerRollout is documented
  as the per-workload opt-out.
- New "In-place vertical pod resize (opt-in) | 7.78+" row in the Agent
  version table.
- New "Manage workloads at scale" section covers day-two operations:
  template changes, owner Remote/Local, applyPolicy.mode: Preview,
  monitoring, and clean removal.
- "How vertical recommendations are calculated" moved to a new top-level
  Reference section near the bottom (placement only, content unchanged).
- Smaller cleanups: 1000-workload-per-cluster cap is now a Note block;
  Identify section links forward to Enable Autoscaling.

Co-authored-by: Cursor <cursoragent@cursor.com>
@Danny-Driscoll Danny-Driscoll force-pushed the ddriscoll/autoscaling-docs-rewrite branch from c0118db to 7765e01 Compare May 7, 2026 03:06
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 7, 2026

Preview links (active after the build_preview check completes)

Modified Files

After you identify a workload to optimize, inspect its {{< ui >}}Scaling Recommendation{{< /ui >}}. Click {{< ui >}}Configure Recommendation{{< /ui >}} to add constraints or adjust target utilization levels before enabling.

When you are ready to proceed with enabling Autoscaling for a workload, you have two options for deployment:
There are three ways to enable autoscaling for a workload. Pick the path that matches how you ship workloads today.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO there's only 2 way: Datadog UI and Authoring a DatadogPodAutoscaler.
Whether the DPA is applied through kubectl or any other tool (which more often than not some Helm directly or Helm wrapped by ArgoCD).

| Path | Best for | Where you start | Ongoing management |
|------|----------|-----------------|--------------------|
| **A. Datadog UI setup wizard** | A single workload, demos, or your first rollout | [Setup page][11] in the Datadog UI | Edit the workload's `DatadogPodAutoscaler` from the UI or your cluster |
| **B. Author a `DatadogPodAutoscaler` manifest** | Existing `kubectl`-driven workflows | Hand-written YAML applied with `kubectl apply` | Edit the manifest and reapply, or let Datadog manage the spec by setting `owner: Remote` |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The owner part is largely incorrect. The only value that should be documented in the doc is owner: Local. The customer should never set owner: Remote nor attempt to change owner values.

After a workload is autoscaled, day-two operations are managed through a combination of the `DatadogPodAutoscaler` resource and the Datadog UI:

- **Change the scaling template.** Edit the workload's `DatadogPodAutoscaler` spec (CPU target, replica bounds, scale-up and scale-down rules) directly, or pick a different template from the [Workload Scaling list view][8]. Changes take effect on the next reconcile.
- **Switch between Datadog-managed and self-managed specs.** Set `owner: Remote` to let Datadog refresh the recommendation in place, or `owner: Local` to pin the spec and treat Datadog's recommendations as advisory. Per-workload overrides on a `Local` resource always win.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Entirely incorrect, to be removed


| | How it's computed |
|---|---|
| **Request and limit recommendation** | Based on the **maximum peak memory usage** observed over the last 8 days. A **5% safety margin** is added. If an **OOMKill** is detected, an additional **20% bump** is applied to help prevent future out-of-memory events. |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should highlight that it's configurable


| | How it's computed |
|---|---|
| **Request recommendation** | Based on the **p95** of CPU usage relative to the current request over the last 8 days. A **10% safety margin** is then added. |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's currently p90 for req and p95 for lim


| | How it's computed |
|---|---|
| **Request and limit recommendation** | Based on the **p99** of CPU usage relative to the current request over the last 8 days. A **5% safety margin** is then added. |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

p95 as well

#### Key design principles

- **8-day lookback window**: All recommendations consider usage data from the past 8 days, providing enough history to capture weekly traffic patterns while remaining responsive to changes.
- **Decaying weights**: For memory request recommendations (when request ≠ limit), older samples are weighted less heavily, so the recommendation adapts faster to recent usage shifts.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's for both CPU and Memory.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants