Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
180 changes: 180 additions & 0 deletions docs/best-practices/error-handling-strategy.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,180 @@
---
id: error-handling-strategy
title: Error handling strategy
sidebar_label: Error handling strategy
description: Learn how to categorize failures, decide when to use non-retryable errors, and implement compensation patterns like the Saga pattern in Temporal applications.
toc_max_heading_level: 4
keywords:
- error handling
- non-retryable errors
- saga pattern
- compensation
- rollback
- idempotence
- failure types
tags:
- Best Practices
- Failures
- Error Handling
---

Temporal automatically retries failed Activities and recovers from infrastructure failures through Durable Execution.
But not all failures should be retried.
This page covers how to categorize failures, when to mark errors as non-retryable, and how to implement compensation when retries are not enough.

For background on how Temporal represents and propagates failures, see [Application failures](/encyclopedia/application-failures).

## Categorize failures {#categorize-failures}

When an operation fails, the appropriate response depends on the nature of the failure.
Failures fall into three categories based on whether retrying can resolve them.

### Transient failures

A transient failure is a one-off event that resolves on its own without intervention.
For example, a Worker happens to make a network request at the exact moment an administrator replaces a network cable.
The cause is unlikely to affect future requests.

Transient failures are resolved by retrying the operation shortly after the failure.
Temporal's default Retry Policy handles transient failures automatically.

### Intermittent failures

An intermittent failure is one that recurs but resolves over time.
For example, a service that uses rate limiting will reject requests once the threshold is reached, but will accept requests again after the rate limiter resets.

Intermittent failures require retries spaced out over a longer period.
Configure your [Retry Policy](/encyclopedia/retry-policies) with an appropriate `backoffCoefficient` and `maximumInterval` to avoid overwhelming the failing service.

### Permanent failures

A permanent failure is one that will recur indefinitely until the cause is fixed.
For example, a request that fails due to an invalid email address will continue to fail no matter how many times the operation retries.
The only resolution is to correct the email address.

Permanent failures cannot be resolved through retries.
They require different input data, a code fix, or some external intervention.
Mark these errors as non-retryable to fail fast instead of consuming resources on retries that will not succeed.

## Mark errors as non-retryable {#non-retryable-errors}

When your code detects a permanent failure, mark the error as non-retryable to prevent unnecessary retry attempts.

Use non-retryable errors for situations like:

- **Invalid input data**: A malformed email address, a negative payment amount, or a missing required field.
- **Business rule violations**: A customer outside the service area, an order exceeding credit limits, or an expired promotion code.
- **Authorization failures**: The caller does not have permission to perform the operation.
- **Data validation errors**: A referenced record does not exist, or data fails integrity checks.

There are two ways to mark errors as non-retryable:

**In the Activity (implementer decides):** Set the `non_retryable` flag when throwing an Application Failure.
This enforces the constraint for all callers.
Use this when the Activity implementer knows that the error can never be resolved through retries.

**In the Retry Policy (caller decides):** Add the error type to the Retry Policy's list of non-retryable error types.
This lets different Workflows make different decisions about the same Activity.
Use this when the decision depends on the caller's business logic.

Use non-retryable errors sparingly.
In most cases, let the Retry Policy handle retry limits through timeouts and maximum attempts.
Reserve `non_retryable` for cases where retrying is guaranteed to be futile.

For SDK-specific syntax and code examples, see the error handling guide for your language:
- [Python](/develop/python/best-practices/error-handling)
- [Go](/develop/go/best-practices/error-handling)
- [.NET](/develop/dotnet/best-practices/error-handling)
- [Ruby](/develop/ruby/best-practices/error-handling)

## Design Activities for idempotence {#idempotence}

Activities may execute more than once due to retries, so design them to be idempotent: producing the same result whether executed once or multiple times.

This is especially important because of an edge case in distributed systems.
A Worker can execute an Activity, complete it, and then crash before reporting the result to the Temporal Service.
The Activity is retried even though it completed, because the Service has no record of the completion.

Use idempotency keys to prevent duplicate operations.
Combine the Workflow Run ID and Activity ID for a value that is consistent across retries but unique across Workflow Executions.

For a detailed explanation, see [Activity idempotence](/activity-definition#idempotency).

## Implement compensation with the Saga pattern {#saga-pattern}

Some operations cannot be "retried away."
When a multi-step process fails partway through, previous steps may need to be undone.
The Saga pattern provides a structured way to handle this.

### What is the Saga pattern

A saga coordinates a sequence of operations where each operation has a corresponding compensating action that reverses its effects.
If any operation in the sequence fails, the compensating actions for previously completed operations execute in reverse order.

For example, an order fulfillment process might involve three steps:

1. **Reserve inventory** (compensating action: release inventory)
2. **Charge payment** (compensating action: refund payment)
3. **Create shipment** (compensating action: cancel shipment)

If the payment charge fails, the saga runs the compensation for step 1 (release inventory).
If the shipment fails, the saga runs compensations for steps 2 and 1 (refund payment, then release inventory).

### When to use it

Use the Saga pattern when:

- A Workflow involves multiple steps that produce side effects in external systems.
- Each step can be reversed with a compensating action.
- Retrying the failed step is not sufficient because earlier steps have already committed changes.

The Saga pattern is not needed when Temporal's built-in retries can resolve the failure, or when operations are naturally idempotent and do not produce side effects that need to be reversed.

### Designing compensating actions

Each forward action needs a corresponding compensating action.
Keep these guidelines in mind:

- **Make compensating actions idempotent.** Compensations may also be retried, so they must be safe to execute more than once.
- **Add compensations before executing the step.** Register each compensating action before running the corresponding forward action, so the compensation is available if the forward action partially completes and then fails.
- **Run compensations in reverse order.** Undo operations in the opposite order from which they were performed to maintain data consistency.
- **Handle compensation failures.** A compensating action can itself fail. Log the failure and continue executing remaining compensations rather than stopping. This prevents a single compensation failure from leaving the system in a partially rolled-back state.

### Example: order fulfillment

The following pseudocode shows the structure of a Saga implementation in a Workflow:

```
compensations = []

try:
// Step 1: Reserve inventory
compensations.add(release_inventory)
execute reserve_inventory(order)

// Step 2: Charge payment
compensations.add(refund_payment)
execute charge_payment(order)

// Step 3: Create shipment
compensations.add(cancel_shipment)
execute create_shipment(order)

return success

catch error:
// Run compensations in reverse order
for each compensation in reverse(compensations):
try:
execute compensation(order)
catch compensation_error:
log("Compensation failed", compensation_error)

raise ApplicationFailure("Order failed", cause: error)
```

In Temporal, compensating actions are implemented as Activities.
Temporal manages the state of the compensation list and handles retries for each compensation Activity, making the Saga pattern more straightforward to implement than in systems without Durable Execution.

For SDK-specific implementations with working code examples, see the error handling guide for your language:
- [Python](/develop/python/best-practices/error-handling#implement-saga-pattern)
169 changes: 169 additions & 0 deletions docs/encyclopedia/application-failures.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,169 @@
---
id: application-failures
title: Application failures
sidebar_label: Application failures
description: Learn what application failures are in Temporal, how they differ from platform failures, and how errors propagate between Activities and Workflows.
toc_max_heading_level: 4
keywords:
- application failures
- platform failures
- ApplicationFailure
- error propagation
- Workflow Task failure
- Workflow Execution failure
- event history
tags:
- Concepts
- Failures
---

Temporal handles many types of failures automatically through Durable Execution.
Worker crashes, network interruptions, and infrastructure outages are all recovered from without any intervention.
But some failures require your application to detect and respond to them.
Understanding which failures Temporal handles and which ones your application must handle is fundamental to building reliable Temporal applications.

## Platform failures vs application failures {#platform-vs-application}

Failures fall into two categories based on where they are detected and mitigated: platform failures and application failures.

### Platform failures

Platform failures occur due to issues with the infrastructure: server outages, network interruptions, Worker crashes, or other environmental factors outside of your application's control.
Temporal's Durable Execution handles these failures transparently.
When a Worker crashes mid-execution, another Worker picks up the work and continues from where it left off.
Your application code does not need to account for these failures.

Platform failures are resolved through **forward recovery**: the system retries the failed operation, and if the retry succeeds, the application continues from the point of failure without undoing any previous work.

### Application failures

Application failures are generated by your code.
They indicate an issue with your application logic, such as invalid input data, a business rule violation, or a failed call to an external service.

Application failures do not resolve on their own through retries alone.
Recovering from an application failure may require fixing a bug, passing different input data, or performing some external mitigation.

Application failures often involve **backward recovery**: the system undoes some of the work that has already been performed to return to a previous state.
For example, if a payment step fails after inventory has already been reserved, the application may need to release that inventory.

For guidance on categorizing failures and deciding how to handle them, see [Error handling strategy](/best-practices/error-handling-strategy).

## How Temporal represents failures {#failure-representation}

All failures in Temporal are represented as a Failure in the API.
Each SDK exposes failures using the conventions of its language: what is called a Failure in one SDK might be called an Error or Exception in another.

Most SDKs have a base class that other failure types extend.
This provides a common interface and shared behavior across different failure types:

- TypeScript: [TemporalFailure](https://typescript.temporal.io/api/classes/common.TemporalFailure)
- Java: [TemporalFailure](https://www.javadoc.io/doc/io.temporal/temporal-sdk/latest/io/temporal/failure/TemporalFailure.html)
- Python: [FailureError](https://python.temporal.io/temporalio.exceptions.FailureError.html)
- Go: Uses specific error types rather than a base class

Temporal categorizes failures into several types:

| Failure type | Description |
| :--- | :--- |
| **Application Failure** | Raised by your code to indicate application-specific errors. This is the only failure type you create directly. |
| **Activity Failure** | Wraps an error from an Activity Execution. The `cause` field contains the underlying error. |
| **Child Workflow Failure** | Wraps an error from a Child Workflow Execution. |
| **Timeout Failure** | Occurs when an Activity or Workflow exceeds its configured timeout. |
| **Cancelled Failure** | Results from cancellation of a Workflow, Activity, or Timer. |
| **Terminated Failure** | Occurs when a Workflow Execution is forcefully terminated. |
| **Server Failure** | Originates from the Temporal Service itself. |

Do not extend the base failure class or any of its children in your code.
The provided classes are designed to work with Temporal's serialization mechanism, which converts failures to Protocol Buffer messages for communication across process and language boundaries.
Custom subclasses can break this serialization and lead to unexpected behavior.

For a complete reference of all failure types and their SDK-specific classes, see [Failures reference](/references/failures).

### Application Failure

Application Failure is the failure type you use to communicate application-specific errors.
It is the only failure type designed to be created and thrown directly by your code.

When you throw an Application Failure, you can set these fields:

- **message**: A human-readable description of the error.
- **type**: A string that categorizes the failure (for example, `"InvalidInput"` or `"InsufficientFunds"`).
- **non_retryable**: A flag that prevents the operation from being retried, regardless of the Retry Policy.
- **details**: Additional data about the failure.

Any non-Temporal error thrown from an Activity is automatically converted to an Application Failure.
During this conversion, the error's type name, message, and call stack are preserved, and `non_retryable` is set to `false`.

### Failure Converters

When Temporal returns a failure, the default Failure Converter copies error messages and stack traces as plain text.
This text is accessible in the Web UI and through the CLI.

If your errors might contain sensitive information, you can encrypt the message and stack trace by configuring a custom Failure Converter with a codec.
See [Failure Converter](/failure-converter) for details.

## Workflow Task failures vs Workflow Execution failures {#task-vs-execution}

When an error occurs in Workflow code, it produces one of two outcomes depending on the error type: a Workflow Task failure or a Workflow Execution failure.
Understanding the difference is important because they have very different implications.

### Workflow Task failures

A Workflow Task failure occurs when the Workflow code throws an error that does not extend the Temporal base failure class.
This includes language-level errors (null reference, division by zero, type errors) and non-determinism errors.

Workflow Task failures are treated as transient problems, typically bugs that can be fixed with a code deployment.
Temporal retries them automatically, giving you the opportunity to fix the code and redeploy without losing the state of existing Workflow Executions.

When a Workflow Task failure is retried:

1. The Worker removes the Workflow Execution from its cache.
2. The Temporal Service schedules a new Workflow Task on the original Task Queue.
3. A Worker picks up the Task and replays the Workflow Execution from Event History to restore the correct state before continuing.

### Workflow Execution failures

A Workflow Execution failure occurs when the Workflow code throws a Temporal failure, such as an Application Failure.
This puts the Workflow Execution into the "Failed" state permanently.
No more attempts are made to progress the execution.

Use Workflow Execution failures for permanent business logic failures where retrying the same code with the same input will not produce a different result.

## How errors propagate {#error-propagation}

When an Activity fails, Temporal wraps the error in an Activity Failure before delivering it to the Workflow.
The Activity Failure provides context about the failure, including the Activity Type, the number of retry attempts, and the original cause.

The original error is in the `cause` field.
For example, if an Activity throws an Application Failure with `type: "InvalidInput"`, the Workflow receives an Activity Failure whose `cause` is that Application Failure.
If an Activity times out instead, the `cause` is a Timeout Failure.

This wrapping pattern applies to other execution types as well.
A failed Child Workflow delivers a Child Workflow Failure to the parent Workflow, with the original error in the `cause` field.

If a Temporal failure propagates unhandled through Workflow code, it fails the Workflow Execution.
The exception is Cancelled Failure, which puts the Workflow in "Cancelled" state instead of "Failed".

## Failures in Event History {#event-history}

Failures are recorded in Event History, which provides a detailed record for debugging.

### Activity failures

An Activity Execution that completes results in three Events: `ActivityTaskScheduled`, `ActivityTaskStarted`, and `ActivityTaskCompleted`.

If an Activity fails and the Retry Policy does not cause it to retry, the Temporal Service adds an `ActivityTaskFailed` Event that contains the error details.
If an Activity times out, an `ActivityTaskTimedOut` Event is added instead.

While an Activity is running, `ActivityTaskScheduled` is the most recent Event visible for that Activity.
The `ActivityTaskStarted` Event is not written until the Activity Task closes, because the final retry attempt number (an attribute of `ActivityTaskStarted`) is not known until then.

You can view pending Activity Executions in the Web UI's Pending Activities section, which shows the Activity Type, current retry attempt, remaining attempts, and heartbeat information.

### Workflow Execution failures

An Activity failure does not directly cause a Workflow Execution failure.
If an Activity fails and the error is not handled in the Workflow code (or is intentionally re-raised), the Workflow Execution fails.

When a Workflow Execution fails, the Temporal Service adds a `WorkflowExecutionFailed` Event.
If the failure was caused by an unhandled Activity error, the `activityFailureInfo` is attached to that Event.
Loading
Loading