Rewrite Kubernetes Autoscaling: clearer Enable workflow + in-place vertical resize coverage#36531
Rewrite Kubernetes Autoscaling: clearer Enable workflow + in-place vertical resize coverage#36531Danny-Driscoll wants to merge 3 commits intomasterfrom
Conversation
Adds a new 'How vertical recommendations are calculated' section under 'How it works' that explains how Datadog computes CPU and memory request/limit recommendations, including: - Memory: p95-based requests with decaying weights, peak-based limits - CPU: p95/p99-based requests and limits - Handling of request-equals-limit configurations - OOMKill detection and response - Key design principles (8-day lookback, safety margins, etc.)
…ize coverage Restructure the Kubernetes Autoscaling page to give readers a clearer path from "I have a workload to optimize" to "it's running in production": - "Enable Autoscaling for a workload" is now a decision-led section with three paths (Datadog UI setup wizard / hand-written manifest / IaC), including a comparison table and links to the in-app Setup wizard and the ArgoCD + Terraform guides. - The four DPA example tabs each get a tighter "when to pick this" intro pointing at the single defining spec value. - Vertical CPU and Memory tab documents in-place pod resizing introduced in Cluster Agent 7.78. The flag is opt-in (autoscaling.workload.in_place_vertical_scaling.enabled) and requires the Kubernetes pods/resize subresource. TriggerRollout is documented as the per-workload opt-out. - New "In-place vertical pod resize (opt-in) | 7.78+" row in the Agent version table. - New "Manage workloads at scale" section covers day-two operations: template changes, owner Remote/Local, applyPolicy.mode: Preview, monitoring, and clean removal. - "How vertical recommendations are calculated" moved to a new top-level Reference section near the bottom (placement only, content unchanged). - Smaller cleanups: 1000-workload-per-cluster cap is now a Note block; Identify section links forward to Enable Autoscaling. Co-authored-by: Cursor <cursoragent@cursor.com>
c0118db to
7765e01
Compare
Preview links (active after the
|
| After you identify a workload to optimize, inspect its {{< ui >}}Scaling Recommendation{{< /ui >}}. Click {{< ui >}}Configure Recommendation{{< /ui >}} to add constraints or adjust target utilization levels before enabling. | ||
|
|
||
| When you are ready to proceed with enabling Autoscaling for a workload, you have two options for deployment: | ||
| There are three ways to enable autoscaling for a workload. Pick the path that matches how you ship workloads today. |
There was a problem hiding this comment.
IMHO there's only 2 way: Datadog UI and Authoring a DatadogPodAutoscaler.
Whether the DPA is applied through kubectl or any other tool (which more often than not some Helm directly or Helm wrapped by ArgoCD).
| | Path | Best for | Where you start | Ongoing management | | ||
| |------|----------|-----------------|--------------------| | ||
| | **A. Datadog UI setup wizard** | A single workload, demos, or your first rollout | [Setup page][11] in the Datadog UI | Edit the workload's `DatadogPodAutoscaler` from the UI or your cluster | | ||
| | **B. Author a `DatadogPodAutoscaler` manifest** | Existing `kubectl`-driven workflows | Hand-written YAML applied with `kubectl apply` | Edit the manifest and reapply, or let Datadog manage the spec by setting `owner: Remote` | |
There was a problem hiding this comment.
The owner part is largely incorrect. The only value that should be documented in the doc is owner: Local. The customer should never set owner: Remote nor attempt to change owner values.
| After a workload is autoscaled, day-two operations are managed through a combination of the `DatadogPodAutoscaler` resource and the Datadog UI: | ||
|
|
||
| - **Change the scaling template.** Edit the workload's `DatadogPodAutoscaler` spec (CPU target, replica bounds, scale-up and scale-down rules) directly, or pick a different template from the [Workload Scaling list view][8]. Changes take effect on the next reconcile. | ||
| - **Switch between Datadog-managed and self-managed specs.** Set `owner: Remote` to let Datadog refresh the recommendation in place, or `owner: Local` to pin the spec and treat Datadog's recommendations as advisory. Per-workload overrides on a `Local` resource always win. |
There was a problem hiding this comment.
Entirely incorrect, to be removed
|
|
||
| | | How it's computed | | ||
| |---|---| | ||
| | **Request and limit recommendation** | Based on the **maximum peak memory usage** observed over the last 8 days. A **5% safety margin** is added. If an **OOMKill** is detected, an additional **20% bump** is applied to help prevent future out-of-memory events. | |
There was a problem hiding this comment.
We should highlight that it's configurable
|
|
||
| | | How it's computed | | ||
| |---|---| | ||
| | **Request recommendation** | Based on the **p95** of CPU usage relative to the current request over the last 8 days. A **10% safety margin** is then added. | |
There was a problem hiding this comment.
It's currently p90 for req and p95 for lim
|
|
||
| | | How it's computed | | ||
| |---|---| | ||
| | **Request and limit recommendation** | Based on the **p99** of CPU usage relative to the current request over the last 8 days. A **5% safety margin** is then added. | |
| #### Key design principles | ||
|
|
||
| - **8-day lookback window**: All recommendations consider usage data from the past 8 days, providing enough history to capture weekly traffic patterns while remaining responsive to changes. | ||
| - **Decaying weights**: For memory request recommendations (when request ≠ limit), older samples are weighted less heavily, so the recommendation adapts faster to recent usage shifts. |
There was a problem hiding this comment.
It's for both CPU and Memory.
What does this PR do?
End-to-end editorial pass on the Kubernetes Autoscaling page focused on making the workload-enablement guidance clearer and more actionable, and bringing the doc up to date with the in-place vertical pod resize work that landed in Cluster Agent 7.78.
Major changes
Rewritten "Enable Autoscaling for a workload"
Replaced the previous two-bullet section with a decision-led structure. Readers now see a comparison table of three deployment paths and follow the one that matches how they ship workloads:
Setuppage at/orchestration/scaling/setupand names all four templates (Optimize cost / Optimize balance / Optimize performance / Customize). Best for single workloads, demos, or first rollouts.kubectl applyworkflow; points down to the existing tabbed example configurations.In-place vertical pod resize (Cluster Agent 7.78)
In-place vertical pod resize (opt-in) | 7.78+row in the Agent version table under Requirements.autoscaling.workload.in_place_vertical_scaling.enabled), the Kubernetes prerequisites (pods/resizesubresource exposed via theInPlacePodVerticalScalingfeature gate — beta in 1.33+, alpha-opt-in for 1.27 to 1.32), theInfeasible→ rollout fallback behavior, andapplyPolicy.update.strategy: TriggerRolloutas the per-workload opt-out.origin/7.79.xtoday and will be a small targeted edit if/when the default changes.Tighter "when to pick this" intro for each example tab
All four DPA YAML examples are unchanged, but each tab now opens with a 2-sentence framing of (1) when to pick the template and (2) the single most important spec value driving its behavior.
New "Manage workloads at scale" section
A short bullet-driven section on day-two operations: changing the scaling template, switching
owner: RemotevsLocal, pausing withapplyPolicy.mode: Preview, watching the rollout in the Workload Scaling list view, and removing autoscaling cleanly.Reference section
The detailed "How vertical recommendations are calculated" methodology block (added earlier on this branch) moves from under Requirements down to a new top-level Reference section near the bottom of the page. Content unchanged — placement only.
Smaller cleanups
**Note:**block instead of a buried sentence.Notes for reviewers
Made with Cursor