Using JavaScript or TypeScript? Check out the AWS Durable Execution SDK for JavaScript instead.
- What is the Durable Execution SDK?
- Key features
- Installation
- Quick example
- Core concepts
- Architecture
- Use cases
- Getting help
- License
The AWS Durable Execution SDK for Python lets you build reliable, long-running workflows in AWS Lambda. Your functions can pause execution, wait for external events, retry failed operations, and resume exactly where they left off—even if Lambda recycles your execution environment.
The SDK provides a DurableContext that gives you operations like steps, waits, callbacks, and parallel execution. Each operation is checkpointed automatically, so your workflow state is preserved across interruptions.
- Automatic checkpointing - Your workflow state is saved automatically after each operation
- Durable steps - Execute code with configurable retry strategies and at-most-once or at-least-once semantics
- Wait operations - Pause execution for seconds, minutes, or hours without blocking Lambda resources
- Callbacks - Wait for external systems to respond with results or approvals
- Parallel execution - Run multiple operations concurrently with configurable completion criteria
- Map operations - Process collections in parallel with batching and failure tolerance
- Child contexts - Isolate nested workflows for better organization and error handling
- Structured logging - Integrate with your logger to track execution flow and debug issues
Install the SDK using pip:
pip install aws-durable-execution-sdk-pythonHere's a simple durable function that processes an order:
from aws_durable_execution_sdk_python import (
DurableContext,
durable_execution,
durable_step,
)
from aws_durable_execution_sdk_python.config import Duration
@durable_step
def validate_order(order_id: str) -> dict:
# Validation logic here
return {"order_id": order_id, "valid": True}
@durable_step
def charge_payment(order_id: str, amount: float) -> dict:
# Payment processing logic here
return {"transaction_id": "txn_123", "status": "completed"}
@durable_step
def fulfill_order(order_id: str) -> dict:
# Fulfillment logic here
return {"tracking_number": "TRK123456"}
@durable_execution
def process_order(event: dict, context: DurableContext) -> dict:
order_id = event["order_id"]
amount = event["amount"]
# Step 1: Validate the order
validation = context.step(validate_order(order_id))
if not validation["valid"]:
return {"status": "failed", "reason": "invalid_order"}
# Step 2: Charge payment
payment = context.step(charge_payment(order_id, amount))
# Step 3: Wait for payment confirmation (simulated)
context.wait(Duration.from_seconds(5))
# Step 4: Fulfill the order
fulfillment = context.step(fulfill_order(order_id))
return {
"status": "completed",
"order_id": order_id,
"transaction_id": payment["transaction_id"],
"tracking_number": fulfillment["tracking_number"]
}Each context.step() call is checkpointed automatically. If Lambda recycles your execution environment, the function resumes from the last completed step.
A durable function is a Lambda function decorated with @durable_execution that can be checkpointed and resumed. The function receives a DurableContext that provides methods for durable operations.
Operations are units of work in a durable execution. Each operation type serves a specific purpose:
- Steps - Execute code and checkpoint the result with retry support
- Waits - Pause execution for a specified duration without blocking Lambda
- Callbacks - Wait for external systems to respond with results
- Invoke - Call other durable functions to compose complex workflows
- Child contexts - Isolate nested workflows for better organization
- Parallel - Execute multiple operations concurrently with completion criteria
- Map - Process collections in parallel with batching and failure tolerance
Checkpoints are saved states of execution that allow resumption. When your function calls context.step() or other operations, the SDK creates a checkpoint and sends it to AWS. If Lambda recycles your environment or your function waits for an external event, execution can resume from the last checkpoint.
When your function resumes, completed operations don't re-execute. Instead, they return their checkpointed results instantly. This means your function code runs multiple times, but side effects only happen once per operation.
The SDK provides decorators to mark functions as durable:
@durable_execution- Marks your Lambda handler as a durable function@durable_step- Marks a function that can be used withcontext.step()@durable_with_child_context- Marks a function that receives a child context
The SDK integrates with AWS Lambda's durable execution service to provide reliable, long-running workflows. Here's how it works:
- Execution starts - Lambda invokes your function with a
DurableContext - Operations checkpoint - Each
context.step(),context.wait(), or other operation creates a checkpoint - State is saved - Checkpoints are sent to the durable execution service and persisted
- Execution may pause - Lambda can recycle your environment or wait for external events
- Execution resumes - When ready, Lambda invokes your function again with the saved state
- Operations replay - Completed operations return their saved results instantly
- New operations execute - Your function continues from where it left off
- DurableContext - Main interface for durable operations, provided by Lambda
- ExecutionState - Manages checkpoints and tracks operation results
- Operation handlers - Execute steps, waits, callbacks, and other operations
- Checkpoint batching - Groups multiple checkpoints into efficient API calls
- SerDes system - Serializes and deserializes operation inputs and results
The SDK uses a background thread to batch checkpoints for efficiency. Critical operations (like step starts with at-most-once semantics) block until the checkpoint is confirmed. Non-critical operations (like observability checkpoints) are asynchronous for better performance
See architecture diagrams for class diagrams and concurrency flows.
The SDK helps you build:
Order processing workflows - Validate orders, charge payments, and fulfill shipments with automatic retry on failures.
Approval workflows - Wait for human approvals or external system responses using callbacks.
Data processing pipelines - Process large datasets in parallel with map operations and failure tolerance.
Multi-step integrations - Coordinate calls to multiple services with proper error handling and state management.
Long-running tasks - Execute workflows that take minutes or hours without blocking Lambda resources.
Saga patterns - Implement distributed transactions with compensation logic for failures.
Documentation - You're reading it! Use the navigation above to find specific topics.
Examples - Check the examples/ directory in the repository for working code samples.
Issues - Report bugs or request features on the GitHub repository.
Contributing - See CONTRIBUTING.md for guidelines on contributing to the project.
See the LICENSE file for our project's licensing.