temporalio · lennessyy · May 21, 2026
@@ -0,0 +1,180 @@
+---
+id: error-handling-strategy
+title: Error handling strategy
+sidebar_label: Error handling strategy
+description: Learn how to categorize failures, decide when to use non-retryable errors, and implement compensation patterns like the Saga pattern in Temporal applications.
+toc_max_heading_level: 4
+keywords:
+  - error handling
+  - non-retryable errors
+  - saga pattern
+  - compensation
+  - rollback
+  - idempotence
+  - failure types
+tags:
+  - Best Practices
+  - Failures
+  - Error Handling
+---
+
+Temporal automatically retries failed Activities and recovers from infrastructure failures through Durable Execution.
+But not all failures should be retried.
+This page covers how to categorize failures, when to mark errors as non-retryable, and how to implement compensation when retries are not enough.
+
+For background on how Temporal represents and propagates failures, see [Application failures](/encyclopedia/application-failures).
+
+## Categorize failures {#categorize-failures}
+
+When an operation fails, the appropriate response depends on the nature of the failure.
+Failures fall into three categories based on whether retrying can resolve them.
+
+### Transient failures
+
+A transient failure is a one-off event that resolves on its own without intervention.
+For example, a Worker happens to make a network request at the exact moment an administrator replaces a network cable.
+The cause is unlikely to affect future requests.
+
+Transient failures are resolved by retrying the operation shortly after the failure.
+Temporal's default Retry Policy handles transient failures automatically.
+
+### Intermittent failures
+
+An intermittent failure is one that recurs but resolves over time.
+For example, a service that uses rate limiting will reject requests once the threshold is reached, but will accept requests again after the rate limiter resets.
+
+Intermittent failures require retries spaced out over a longer period.
+Configure your [Retry Policy](/encyclopedia/retry-policies) with an appropriate `backoffCoefficient` and `maximumInterval` to avoid overwhelming the failing service.
+
+### Permanent failures
+
+A permanent failure is one that will recur indefinitely until the cause is fixed.
+For example, a request that fails due to an invalid email address will continue to fail no matter how many times the operation retries.
+The only resolution is to correct the email address.
+
+Permanent failures cannot be resolved through retries.
+They require different input data, a code fix, or some external intervention.
+Mark these errors as non-retryable to fail fast instead of consuming resources on retries that will not succeed.
+
+## Mark errors as non-retryable {#non-retryable-errors}
+
+When your code detects a permanent failure, mark the error as non-retryable to prevent unnecessary retry attempts.
+
+Use non-retryable errors for situations like:
+
+- **Invalid input data**: A malformed email address, a negative payment amount, or a missing required field.
+- **Business rule violations**: A customer outside the service area, an order exceeding credit limits, or an expired promotion code.
+- **Authorization failures**: The caller does not have permission to perform the operation.
+- **Data validation errors**: A referenced record does not exist, or data fails integrity checks.
+
+There are two ways to mark errors as non-retryable:
+
+**In the Activity (implementer decides):** Set the `non_retryable` flag when throwing an Application Failure.
+This enforces the constraint for all callers.
+Use this when the Activity implementer knows that the error can never be resolved through retries.
+
+**In the Retry Policy (caller decides):** Add the error type to the Retry Policy's list of non-retryable error types.
+This lets different Workflows make different decisions about the same Activity.
+Use this when the decision depends on the caller's business logic.
+
+Use non-retryable errors sparingly.
+In most cases, let the Retry Policy handle retry limits through timeouts and maximum attempts.
+Reserve `non_retryable` for cases where retrying is guaranteed to be futile.
+
+For SDK-specific syntax and code examples, see the error handling guide for your language:
+- [Python](/develop/python/best-practices/error-handling)
+- [Go](/develop/go/best-practices/error-handling)
+- [.NET](/develop/dotnet/best-practices/error-handling)
+- [Ruby](/develop/ruby/best-practices/error-handling)
+
+## Design Activities for idempotence {#idempotence}
+
+Activities may execute more than once due to retries, so design them to be idempotent: producing the same result whether executed once or multiple times.
+
+This is especially important because of an edge case in distributed systems.
+A Worker can execute an Activity, complete it, and then crash before reporting the result to the Temporal Service.
+The Activity is retried even though it completed, because the Service has no record of the completion.
+
+Use idempotency keys to prevent duplicate operations.
+Combine the Workflow Run ID and Activity ID for a value that is consistent across retries but unique across Workflow Executions.
+
+For a detailed explanation, see [Activity idempotence](/activity-definition#idempotency).
+
+## Implement compensation with the Saga pattern {#saga-pattern}
+
+Some operations cannot be "retried away."
+When a multi-step process fails partway through, previous steps may need to be undone.
+The Saga pattern provides a structured way to handle this.
+
+### What is the Saga pattern
+
+A saga coordinates a sequence of operations where each operation has a corresponding compensating action that reverses its effects.
+If any operation in the sequence fails, the compensating actions for previously completed operations execute in reverse order.
+
+For example, an order fulfillment process might involve three steps:
+
+1. **Reserve inventory** (compensating action: release inventory)
+2. **Charge payment** (compensating action: refund payment)
+3. **Create shipment** (compensating action: cancel shipment)
+
+If the payment charge fails, the saga runs the compensation for step 1 (release inventory).
+If the shipment fails, the saga runs compensations for steps 2 and 1 (refund payment, then release inventory).
+
+### When to use it
+
+Use the Saga pattern when:
+
+- A Workflow involves multiple steps that produce side effects in external systems.
+- Each step can be reversed with a compensating action.
+- Retrying the failed step is not sufficient because earlier steps have already committed changes.
+
+The Saga pattern is not needed when Temporal's built-in retries can resolve the failure, or when operations are naturally idempotent and do not produce side effects that need to be reversed.
+
+### Designing compensating actions
+
+Each forward action needs a corresponding compensating action.
+Keep these guidelines in mind:
+
+- **Make compensating actions idempotent.** Compensations may also be retried, so they must be safe to execute more than once.
+- **Add compensations before executing the step.** Register each compensating action before running the corresponding forward action, so the compensation is available if the forward action partially completes and then fails.
+- **Run compensations in reverse order.** Undo operations in the opposite order from which they were performed to maintain data consistency.
+- **Handle compensation failures.** A compensating action can itself fail. Log the failure and continue executing remaining compensations rather than stopping. This prevents a single compensation failure from leaving the system in a partially rolled-back state.
+
+### Example: order fulfillment
+
+The following pseudocode shows the structure of a Saga implementation in a Workflow:
+
+```
+compensations = []
+
+try:
+    // Step 1: Reserve inventory
+    compensations.add(release_inventory)
+    execute reserve_inventory(order)
+
+    // Step 2: Charge payment
+    compensations.add(refund_payment)
+    execute charge_payment(order)
+
+    // Step 3: Create shipment
+    compensations.add(cancel_shipment)
+    execute create_shipment(order)
+
+    return success
+
+catch error:
+    // Run compensations in reverse order
+    for each compensation in reverse(compensations):
+        try:
+            execute compensation(order)
+        catch compensation_error:
+            log("Compensation failed", compensation_error)
+
+    raise ApplicationFailure("Order failed", cause: error)
+```
+
+In Temporal, compensating actions are implemented as Activities.
+Temporal manages the state of the compensation list and handles retries for each compensation Activity, making the Saga pattern more straightforward to implement than in systems without Durable Execution.
+
+For SDK-specific implementations with working code examples, see the error handling guide for your language:
+- [Python](/develop/python/best-practices/error-handling#implement-saga-pattern)
@@ -0,0 +1,169 @@
+---
+id: application-failures
+title: Application failures
+sidebar_label: Application failures
+description: Learn what application failures are in Temporal, how they differ from platform failures, and how errors propagate between Activities and Workflows.
+toc_max_heading_level: 4
+keywords:
+  - application failures
+  - platform failures
+  - ApplicationFailure
+  - error propagation
+  - Workflow Task failure
+  - Workflow Execution failure
+  - event history
+tags:
+  - Concepts
+  - Failures
+---
+
+Temporal handles many types of failures automatically through Durable Execution.
+Worker crashes, network interruptions, and infrastructure outages are all recovered from without any intervention.
+But some failures require your application to detect and respond to them.
+Understanding which failures Temporal handles and which ones your application must handle is fundamental to building reliable Temporal applications.
+
+## Platform failures vs application failures {#platform-vs-application}
+
+Failures fall into two categories based on where they are detected and mitigated: platform failures and application failures.
+
+### Platform failures
+
+Platform failures occur due to issues with the infrastructure: server outages, network interruptions, Worker crashes, or other environmental factors outside of your application's control.
+Temporal's Durable Execution handles these failures transparently.
+When a Worker crashes mid-execution, another Worker picks up the work and continues from where it left off.
+Your application code does not need to account for these failures.
+
+Platform failures are resolved through **forward recovery**: the system retries the failed operation, and if the retry succeeds, the application continues from the point of failure without undoing any previous work.
+
+### Application failures
+
+Application failures are generated by your code.
+They indicate an issue with your application logic, such as invalid input data, a business rule violation, or a failed call to an external service.
+
+Application failures do not resolve on their own through retries alone.
+Recovering from an application failure may require fixing a bug, passing different input data, or performing some external mitigation.
+
+Application failures often involve **backward recovery**: the system undoes some of the work that has already been performed to return to a previous state.
+For example, if a payment step fails after inventory has already been reserved, the application may need to release that inventory.
+
+For guidance on categorizing failures and deciding how to handle them, see [Error handling strategy](/best-practices/error-handling-strategy).
+
+## How Temporal represents failures {#failure-representation}
+
+All failures in Temporal are represented as a Failure in the API.
+Each SDK exposes failures using the conventions of its language: what is called a Failure in one SDK might be called an Error or Exception in another.
+
+Most SDKs have a base class that other failure types extend.
+This provides a common interface and shared behavior across different failure types:
+
+- TypeScript: [TemporalFailure](https://typescript.temporal.io/api/classes/common.TemporalFailure)
+- Java: [TemporalFailure](https://www.javadoc.io/doc/io.temporal/temporal-sdk/latest/io/temporal/failure/TemporalFailure.html)
+- Python: [FailureError](https://python.temporal.io/temporalio.exceptions.FailureError.html)
+- Go: Uses specific error types rather than a base class
+
+Temporal categorizes failures into several types:
+
+| Failure type | Description |
+| :--- | :--- |
+| **Application Failure** | Raised by your code to indicate application-specific errors. This is the only failure type you create directly. |
+| **Activity Failure** | Wraps an error from an Activity Execution. The `cause` field contains the underlying error. |
+| **Child Workflow Failure** | Wraps an error from a Child Workflow Execution. |
+| **Timeout Failure** | Occurs when an Activity or Workflow exceeds its configured timeout. |
+| **Cancelled Failure** | Results from cancellation of a Workflow, Activity, or Timer. |
+| **Terminated Failure** | Occurs when a Workflow Execution is forcefully terminated. |
+| **Server Failure** | Originates from the Temporal Service itself. |
+
+Do not extend the base failure class or any of its children in your code.
+The provided classes are designed to work with Temporal's serialization mechanism, which converts failures to Protocol Buffer messages for communication across process and language boundaries.
+Custom subclasses can break this serialization and lead to unexpected behavior.
+
+For a complete reference of all failure types and their SDK-specific classes, see [Failures reference](/references/failures).
+
+### Application Failure
+
+Application Failure is the failure type you use to communicate application-specific errors.
+It is the only failure type designed to be created and thrown directly by your code.
+
+When you throw an Application Failure, you can set these fields:
+
+- **message**: A human-readable description of the error.
+- **type**: A string that categorizes the failure (for example, `"InvalidInput"` or `"InsufficientFunds"`).
+- **non_retryable**: A flag that prevents the operation from being retried, regardless of the Retry Policy.
+- **details**: Additional data about the failure.
+
+Any non-Temporal error thrown from an Activity is automatically converted to an Application Failure.
+During this conversion, the error's type name, message, and call stack are preserved, and `non_retryable` is set to `false`.
+
+### Failure Converters
+
+When Temporal returns a failure, the default Failure Converter copies error messages and stack traces as plain text.
+This text is accessible in the Web UI and through the CLI.
+
+If your errors might contain sensitive information, you can encrypt the message and stack trace by configuring a custom Failure Converter with a codec.
+See [Failure Converter](/failure-converter) for details.
+
+## Workflow Task failures vs Workflow Execution failures {#task-vs-execution}
+
+When an error occurs in Workflow code, it produces one of two outcomes depending on the error type: a Workflow Task failure or a Workflow Execution failure.
+Understanding the difference is important because they have very different implications.
+
+### Workflow Task failures
+
+A Workflow Task failure occurs when the Workflow code throws an error that does not extend the Temporal base failure class.
+This includes language-level errors (null reference, division by zero, type errors) and non-determinism errors.
+
+Workflow Task failures are treated as transient problems, typically bugs that can be fixed with a code deployment.
+Temporal retries them automatically, giving you the opportunity to fix the code and redeploy without losing the state of existing Workflow Executions.
+
+When a Workflow Task failure is retried:
+
+1. The Worker removes the Workflow Execution from its cache.
+2. The Temporal Service schedules a new Workflow Task on the original Task Queue.
+3. A Worker picks up the Task and replays the Workflow Execution from Event History to restore the correct state before continuing.
+
+### Workflow Execution failures
+
+A Workflow Execution failure occurs when the Workflow code throws a Temporal failure, such as an Application Failure.
+This puts the Workflow Execution into the "Failed" state permanently.
+No more attempts are made to progress the execution.
+
+Use Workflow Execution failures for permanent business logic failures where retrying the same code with the same input will not produce a different result.
+
+## How errors propagate {#error-propagation}
+
+When an Activity fails, Temporal wraps the error in an Activity Failure before delivering it to the Workflow.
+The Activity Failure provides context about the failure, including the Activity Type, the number of retry attempts, and the original cause.
+
+The original error is in the `cause` field.
+For example, if an Activity throws an Application Failure with `type: "InvalidInput"`, the Workflow receives an Activity Failure whose `cause` is that Application Failure.
+If an Activity times out instead, the `cause` is a Timeout Failure.
+
+This wrapping pattern applies to other execution types as well.
+A failed Child Workflow delivers a Child Workflow Failure to the parent Workflow, with the original error in the `cause` field.
+
+If a Temporal failure propagates unhandled through Workflow code, it fails the Workflow Execution.
+The exception is Cancelled Failure, which puts the Workflow in "Cancelled" state instead of "Failed".
+
+## Failures in Event History {#event-history}
+
+Failures are recorded in Event History, which provides a detailed record for debugging.
+
+### Activity failures
+
+An Activity Execution that completes results in three Events: `ActivityTaskScheduled`, `ActivityTaskStarted`, and `ActivityTaskCompleted`.
+
+If an Activity fails and the Retry Policy does not cause it to retry, the Temporal Service adds an `ActivityTaskFailed` Event that contains the error details.
+If an Activity times out, an `ActivityTaskTimedOut` Event is added instead.
+
+While an Activity is running, `ActivityTaskScheduled` is the most recent Event visible for that Activity.
+The `ActivityTaskStarted` Event is not written until the Activity Task closes, because the final retry attempt number (an attribute of `ActivityTaskStarted`) is not known until then.
+
+You can view pending Activity Executions in the Web UI's Pending Activities section, which shows the Activity Type, current retry attempt, remaining attempts, and heartbeat information.
+
+### Workflow Execution failures
+
+An Activity failure does not directly cause a Workflow Execution failure.
+If an Activity fails and the error is not handled in the Workflow code (or is intentionally re-raised), the Workflow Execution fails.
+
+When a Workflow Execution fails, the Temporal Service adds a `WorkflowExecutionFailed` Event.
+If the failure was caused by an unhandled Activity error, the `activityFailureInfo` is attached to that Event.