Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
125 changes: 125 additions & 0 deletions docs/encyclopedia/workers/serverless-workers/autoscaling.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,125 @@
---
id: autoscaling
title: Serverless Worker autoscaling
sidebar_label: Autoscaling
description:
How Temporal autoscales Serverless Workers on each compute provider, including the scaling signals, algorithm behavior,
and tuning parameters.
slug: /encyclopedia/workers/serverless-workers/autoscaling
toc_max_heading_level: 4
keywords:
- serverless
- workers
- autoscaling
- lambda
- cloud run
- worker controller instance
tags:
- Workers
- Concepts
- Serverless
---

:::tip SUPPORT, STABILITY, and DEPENDENCY INFO

Serverless Workers are in [Pre-release](/evaluate/development-production-features/release-stages#pre-release) and available to select Temporal Cloud customers.
To request access during Pre-release, create a [support ticket](/cloud/support#support-ticket) or contact your account team.
APIs are experimental and may be subject to backwards-incompatible changes.
[Sign up for updates](https://temporal.io/pages/serverless-workers-updates) to be notified when Serverless Workers reach Public Preview.

:::

The [Worker Controller Instance (WCI)](/serverless-workers#worker-controller-instance) autoscales Serverless Workers
using two signals: sync match failure and Task Queue backlog. The autoscaling algorithm differs by compute provider
because of differences in cold start latency, invocation duration limits, and provider APIs.

## Scaling signals

Both compute providers use the same two signals to drive scaling decisions.

### Sync match failure {#sync-match-failure}

When a Task is submitted, the [Matching Service](/temporal-service/temporal-server#matching-service) attempts to route
it directly to an available Worker. If no Worker is available, the sync match fails, and the Matching Service pushes a
signal to the WCI. Because the Matching Service pushes match failures as they happen rather than the WCI polling on a
timer, scaling is responsive.

### Task Queue backlog {#task-queue-backlog}

The WCI monitors Task Queue metadata to determine whether pending Tasks exist without enough Workers to process them. If
there are Tasks on the queue and not enough Workers, the WCI scales up.

## AWS Lambda {#aws-lambda}

The Lambda algorithm is event-driven and reactive. Sync match failure is the primary control signal, and backlog aids
sizing.

When the WCI needs more capacity, it calls the Lambda `InvokeFunction` API to start new Workers. Each call is a discrete
action ("invoke N more functions"), not a target state. The WCI does not manage a fleet of instances.

### Scale-out

On sync match failure, the WCI invokes new Lambda functions. Because Lambda cold start is sub-second to low
single-digit seconds, reactive-only control does not create meaningful backlog overshoot. The WCI can scale from zero
with low latency.

### Scale-in

Scale-in is automatic. Each Lambda invocation runs until the Worker has finished processing available Tasks or
approaches the 15-minute execution time limit, then shuts down. There is no drain logic or stabilization window. The WCI
does not need to actively remove capacity.

### Instance model

Each invocation is independent. The Worker starts, creates a fresh client connection, processes multiple Tasks until near
the execution time limit, and then shuts down gracefully. There is no shared state across invocations.

## GCP Cloud Run {#cloud-run}

The Cloud Run algorithm is a hybrid rate-plus-backlog controller. It extends the base algorithm with a latency-first
fast-path that reacts to sync match failures.

Unlike Lambda, the WCI outputs a target state ("there should be _c_ instances") rather than discrete invocations. The
WCI adjusts Cloud Run's instance count through the Cloud Run admin API.

### Scale-out

The algorithm uses four layers to determine the desired instance count:

1. **Feedforward base capacity.** The WCI estimates the required fleet size from the Task arrival rate, divided by per-instance throughput at the target utilization. Feedforward sizing is critical because Cloud Run cold start is approximately 10-30 seconds. Waiting for backlog to signal under-provisioning means new capacity is 10-30 seconds away.
2. **Backlog-drain correction.** If a backlog exists, the WCI adds instances to drain it within the target queue wait time.
3. **Warm-reserve headroom.** The WCI maintains extra capacity above the feedforward estimate to absorb sync match failures without triggering cold starts.
4. **Sync match fast-path.** On any sync match failure, the WCI immediately re-evaluates and scales out if the current fleet is undersized. This event-triggered path bypasses the regular control interval.

The final desired count is the maximum of the reactive and event-driven calculations, clamped to the configured minimum
and maximum instance counts, and quantized to the scaling granularity.

### Scale-in

Scale-in is conservative to avoid oscillation:

- **Scale-down stabilization window.** After a scale-down decision, the WCI waits (default 300 seconds) before removing instances. If load increases during this window, the scale-down is canceled.
- **Hold after scale-out.** After scaling out, the WCI holds the new capacity for a minimum period before considering scale-in.
- **Drain logic.** When removing instances, the WCI drains them over a configurable horizon, allowing in-flight Tasks to complete before the instance is terminated.

### Minimum instances

Setting `c_min >= 1` keeps at least one instance warm at all times. With constant traffic, this behaves like an
always-on Worker with elastic scale-up and scale-down. Setting `c_min = 0` enables full scale-to-zero but means the
first Task after an idle period incurs a cold start.

### Tuning parameters

The following parameters control Cloud Run autoscaling behavior. These are starting points for latency-first operation.

| Parameter | Starting value | Description |
| -------------------------- | --------------------------- | ------------------------------------------------------------------------------------------- |
| Control interval | 15s | How often the WCI re-evaluates the desired instance count. |
| Utilization target | 0.70-0.80 | Target per-instance utilization for feedforward sizing. |
| Queue wait target | 3-5s | Target time a Task should wait in the queue before being picked up. |
| Drain horizon | 30-60s | How long the WCI allows for in-flight Tasks to complete when removing an instance. |
| Event cooldown | max(5s, 0.25 x scale-up latency) | Minimum time between event-triggered scale-out evaluations. |
| Scale-down stabilization | 300s | How long the WCI waits after a scale-down decision before removing instances. |
| Hold after scale-out | max(60s, 2 x scale-up latency) | Minimum time to hold new capacity before considering scale-in. |
| Min instances | >= 1 for latency-first | Minimum instance count. Set to 0 for full scale-to-zero. |
| Scaling granularity | 1 | Minimum step size for scaling changes. |
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ keywords:
- serverless
- workers
- lambda
- cloud run
- compute provider
tags:
- Workers
Expand Down Expand Up @@ -68,10 +69,9 @@ The Worker Controller Instance (WCI) is a system Workflow that scales Serverless
One WCI Workflow runs per Worker Deployment Version that has a compute provider configured. The WCI runs in the same
Namespace as your Worker Deployment.

The WCI responds to two triggers: [sync match failures](#sync-match-failure) and
[Task Queue backlog](#task-queue-backlog). When either trigger fires, the WCI produces a scaling action, such as
invoking the configured compute provider (for example, calling AWS Lambda's `InvokeFunction` API) to start new Workers.
For details on how scaling works, see [Autoscaling](#autoscaling).
The WCI responds to two triggers: sync match failures and Task Queue backlog. When either trigger fires, the WCI
produces a scaling action, such as invoking the configured compute provider to start new Workers. For details on how
scaling works, see [Autoscaling](#autoscaling).

You can list WCI Workflows in your Namespace:

Expand Down Expand Up @@ -115,28 +115,15 @@ reuse or shared state across invocations.

## Autoscaling {#autoscaling}

The [WCI](#worker-controller-instance) automatically scales Serverless Workers based on Task Queue signals. When Tasks
arrive and no Worker is available, the WCI invokes new Workers. When the Tasks are done, Workers exit and scale to zero.

The WCI uses two signals to decide when to invoke new Workers:

### Sync match failure {#sync-match-failure}

When a Task is submitted, the [Matching Service](/temporal-service/temporal-server#matching-service) attempts to route
it directly to an available Worker. If no Worker is available, the sync match fails, and the Matching Service pushes a
signal to the WCI. The WCI then invokes a new Worker. This is the primary scaling path. Because the Matching Service
pushes match failures to the WCI as they happen rather than the WCI polling on a timer, latency stays low and scaling is
responsive.

### Task Queue backlog {#task-queue-backlog}

The WCI monitors Task Queue metadata to determine whether pending Tasks exist without enough Workers to process them. If
there are Tasks on the queue and not enough Workers, the WCI invokes additional Workers.
The [WCI](#worker-controller-instance) automatically scales Serverless Workers based on Task Queue signals. The
autoscaling algorithm differs by compute provider because of differences in cold start latency, invocation duration
limits, and provider APIs. For details on how autoscaling works on each platform, see
[Serverless Worker autoscaling](/encyclopedia/workers/serverless-workers/autoscaling).

## Scaling with long-lived Workers {#scaling-with-long-lived-workers}

Serverless Workers can share a Task Queue with long-lived Workers. Because Serverless Workers are only invoked on
[sync match failure](#sync-match-failure), Serverless Workers only pick up Tasks that no long-lived Worker was available
sync match failure, Serverless Workers only pick up Tasks that no long-lived Worker was available
to handle. In practice, the Serverless Workers act as spillover capacity for the long-lived fleet.

:::caution
Expand Down Expand Up @@ -259,6 +246,7 @@ provider because the Worker process manages its own lifecycle.

### Supported providers

| Provider | Description |
| ---------- | ----------------------------------------------------------------------------- |
| AWS Lambda | Temporal assumes an IAM role in your AWS account to invoke a Lambda function. |
| Provider | Description |
| ------------- | ----------------------------------------------------------------------------- |
| AWS Lambda | Temporal assumes an IAM role in your AWS account to invoke a Lambda function. |
| GCP Cloud Run | Temporal manages Cloud Run instance scaling through the Cloud Run admin API. |
10 changes: 9 additions & 1 deletion sidebars.js
Original file line number Diff line number Diff line change
Expand Up @@ -1534,7 +1534,15 @@ module.exports = {
'encyclopedia/workers/sticky-execution',
'encyclopedia/workers/worker-shutdown',
'encyclopedia/workers/worker-versioning',
'encyclopedia/workers/serverless-workers',
{
type: 'category',
label: 'Serverless Workers',
collapsed: true,
link: { type: 'doc', id: 'encyclopedia/workers/serverless-workers/index' },
items: [
'encyclopedia/workers/serverless-workers/autoscaling',
],
},
],
},
{
Expand Down