Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
133 changes: 39 additions & 94 deletions references/core/ai-integration.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,9 @@

Temporal provides durable execution for AI/LLM applications, handling retries, rate limits, and long-running operations automatically. These patterns apply across languages, with Python being the most mature for AI integration.

For Python-specific implementation details and code examples, see `references/python/ai-patterns.md`.
For Python-specific implementation details and code examples, see `references/python/ai-patterns.md`. Temporal's Python SDK also provides pre-built integrations with several LLM and agent SDKs, which can be leveraged to create agentic workflows with minimal effort (when working in Python).

The remainder of this document describes general principles to follow when building AI/LLM applications in Temporal, particularly when from scratch instead of with an integration.

## Why Temporal for AI?

Expand All @@ -19,28 +21,24 @@ For Python-specific implementation details and code examples, see `references/py

## Core Patterns

### Pattern 1: Generic LLM Activity

Create flexible, reusable activities for LLM calls:
### Pattern 1: Activities should Wrap LLM Calls

```
Activity: call_llm_generic(
model: string,
system_instructions: string,
user_input: string,
tools?: list,
response_format?: schema
) -> response
```
- activity: call_llm
- inputs:
- model_id -> internally activity can route to different models, so we don't need 1 activity per unique model.
- prompt / chat history
- tools
- etc.
- returns model response, as a typed structured output

**Benefits**:
- Single activity handles multiple use cases
- Consistent retry handling
- Centralized configuration

### Pattern 2: Activity-Based Separation
### Pattern 2: Non-deterministic / heavy tools in Activities

Isolate each operation in its own activity:
Tools which are non-deterministic and/or heavy actions (file system, hitting APIs, etc.) should be placed in activities:

```
Workflow:
Expand All @@ -55,55 +53,32 @@ Workflow:
- Easier testing and mocking
- Failure isolation

### Pattern 3: Centralized Retry Management
### Pattern 3: Tools that Mutate Agent State can be in the Workflow directly

**Critical**: Disable retries in LLM client libraries, let Temporal handle retries.
Generally, agent state is in bijection with workflow state. Thus, tools which mutate agent state and are deterministic (like TODO tools, just updating a hash map) typically belong in the workflow code rather than an activity.

```
LLM Client Config:
max_retries = 0 ← Disable client retries

Activity Retry Policy:
initial_interval = 1s
backoff_coefficient = 2.0
maximum_attempts = 5
maximum_interval = 60s
Workflow:
├── Activity: call_llm (tool selection: todos_write tool)
├── Write new TODOs to workflow state (not in activity)
└── Activity: call_llm (continuing agent flow...)
```

### Pattern 4: Centralized Retry Management

Disable retries in LLM client libraries, let Temporal handle retries.

- LLM Client Config:
- max_retries = 0 ← Disable client retries at the LLM client level

Use either the default activity retry policy, or customize it as needed for the situation.

**Why**:
- Temporal retries are durable (survive crashes)
- Single retry configuration point
- Better visibility into retry attempts
- Consistent backoff behavior

### Pattern 4: Tool-Calling Agent

Three-phase workflow for LLM agents with tools:

```
┌─────────────────────────────────────────────┐
│ Phase 1: Tool Selection │
│ Activity: Present tools to LLM │
│ LLM returns: tool_name, arguments │
└─────────────────────────────────────────────┘
┌─────────────────────────────────────────────┐
│ Phase 2: Tool Execution │
│ Activity: Execute selected tool │
│ (Separate activity per tool type) │
└─────────────────────────────────────────────┘
┌─────────────────────────────────────────────┐
│ Phase 3: Result Interpretation │
│ Activity: Send results back to LLM │
│ LLM returns: final response or next tool │
└─────────────────────────────────────────────┘
Loop until LLM returns final answer
```

### Pattern 5: Multi-Agent Orchestration

Expand All @@ -127,21 +102,6 @@ Deep Research Example:

**Key Pattern**: Use parallel execution with `return_exceptions=True` to continue with partial results when some searches fail.

### Pattern 6: Structured Outputs

Define schemas for LLM responses:

```
Input: Raw LLM prompt
Schema: { action: string, confidence: float, reasoning: string }
Output: Validated, typed response
```

**Benefits**:
- Type safety
- Automatic validation
- Easier downstream processing

## Timeout Recommendations

| Operation Type | Recommended Timeout |
Expand All @@ -165,27 +125,14 @@ Output: Validated, typed response

Parse rate limit info from API responses:

```
Response Headers:
Retry-After: 30
X-RateLimit-Remaining: 0

Activity:
If rate limited:
Raise retryable error with retry_after hint
Temporal handles the delay
```

### Retry Policy Configuration
- Response Headers:
- Retry-After: 30
- X-RateLimit-Remaining: 0

```
Retry Policy:
initial_interval: 1s (or from Retry-After header)
backoff_coefficient: 2.0
maximum_interval: 60s
maximum_attempts: 10
non_retryable_errors: [InvalidAPIKey, InvalidInput]
```
- Activity:
- If rate limited:
- Raise retryable error with retry_after hint
- Temporal handles the delay

## Error Handling

Expand All @@ -209,15 +156,13 @@ Retry Policy:
4. **Use structured outputs** - For type safety and validation
5. **Handle partial failures** - Continue with available results
6. **Monitor costs** - Track LLM calls at activity level
7. **Version prompts** - Track prompt changes in code
8. **Test with mocks** - Mock LLM responses in tests
7. **Test with mocks** - Mock LLM responses in tests

## Observability

- **Activity duration**: Track LLM latency
- **Retry counts**: Monitor rate limiting
- **Token usage**: Log in activity output
- **Cost attribution**: Tag workflows with cost centers
See `references/python/observability.md` (or the language you are working in) for documentation on observability in Temporal. It is generally recommended to add observability for:
- Token usage, via activity logging
- any else to help track LLM usage and debug agentic flows, within moderation.

## Language-Specific Resources

Expand Down
91 changes: 28 additions & 63 deletions references/core/common-gotchas.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,7 @@

Common mistakes and anti-patterns in Temporal development. Learning from these saves significant debugging time.

## Idempotency Issues

### Non-Idempotent Activities
## Non-Idempotent Activities

**The Problem**: Activities may execute more than once due to retries or Worker failures. If an activity calls an external service without an idempotency key, you may charge a customer twice, send duplicate emails, or create duplicate records.

Expand All @@ -14,38 +12,30 @@ Common mistakes and anti-patterns in Temporal development. Learning from these s

**The Fix**: Always use idempotency keys when calling external services. Use the workflow ID, activity ID, or a domain-specific identifier (like order ID) as the key.

### Local Activities

Local Activities skip the task queue for lower latency, but they're still subject to retries. The same idempotency rules apply.

## Replay Safety Violations
**Note:** Local Activities skip the task queue for lower latency, but they're still subject to retries. The same idempotency rules apply.

### Side Effects in Workflow Code
## Side Effects & Non-Determinism in Workflow Code

**The Problem**: Code in workflow functions runs on first execution AND on every replay. Any side effect (logging, notifications, metrics) will happen multiple times.
**The Problem**: Code in workflow functions runs on first execution AND on every replay. Any side effect (logging, notifications, metrics, etc.) will happen multiple times and non-deterministic code (IO, current time, random numbers, threading, etc.) won't replay correctly.

**Symptoms**:
- Non-determinism errors
- Sandbox violations, depending on SDK language
- Duplicate log entries
- Multiple notifications for the same event
- Inflated metrics

**The Fix**:
- Use the SDK's replay-aware logger (only logs on first execution)
- Put all side effects in Activities

### Non-Deterministic Time
- Use Temporal replay-aware managed side effects for common, non-business logic cases:
- Temporal workflow logging
- Temporal date time (`workflow.now()` in Python, `Date.now()` is auto-replaced in TypeScript)
- Temporal UUID generation
- Temporal random number generation
- Put all other side effects in Activities

**The Problem**: Using system time (`datetime.now()`, `Date.now()`) in workflow code returns different values on replay, causing non-determinism errors.
See `references/core/determinism.md` for more info.

**Symptoms**:
- Non-determinism errors mentioning time-based decisions
- Workflows that worked once but fail on replay

**The Fix**: Use the SDK's deterministic time function (`workflow.now()` in Python, `Date.now()` is auto-replaced in TypeScript).

## Worker Management Issues

### Multiple Workers with Different Code
## Multiple Workers with Different Code

**The Problem**: If Worker A runs part of a workflow with code v1, then Worker B (with code v2) picks it up, replay may produce different Commands.

Expand All @@ -55,70 +45,45 @@ Local Activities skip the task queue for lower latency, but they're still subjec

**The Fix**:
- Use Worker Versioning for production deployments
- Use patching APIs
- During development: kill old workers before starting new ones
- Ensure all workers run identical code

### Stale Workflows During Development

**The Problem**: Workflows started with old code continue running after you change the code.
**Note:** Workflows started with old code continue running after you change the code, which can then induce the above issues. During development (NOT production), you may want to terminate stale workflows (`temporal workflow terminate --workflow-id <id>`), or use `find-stalled-workflows.sh` included in this skill to detect stuck workflows.

**Symptoms**:
- Workflows behave unexpectedly after code changes
- Non-determinism errors on previously-working workflows

**The Fix**:
- Terminate stale workflows: `temporal workflow terminate --workflow-id <id>`
- Use `find-stalled-workflows.sh` to detect stuck workflows
- In production, use versioning for backward compatibility

## Workflow Design Anti-Patterns

### The Mega Workflow

**The Problem**: Putting too much logic in a single workflow.

**Issues**:
- Hard to test and maintain
- Event history grows unbounded
- Single point of failure
- Difficult to reason about

**The Fix**:
- Keep workflows focused on a single responsibility
- Use Child Workflows for sub-processes
- Use Continue-as-New for long-running workflows
See `references/core/versioning.md` for more info.

### Failing Too Quickly
## Failing Activities Too Quickly

**The Problem**: Using aggressive retry policies that give up too easily.
**The Problem**: Using aggressive activity retry policies that give up too easily.

**Symptoms**:
- Workflows failing on transient errors
- Unnecessary workflow failures during brief outages

**The Fix**: Use appropriate retry policies. Let Temporal handle transient failures with exponential backoff. Reserve `maximum_attempts=1` for truly non-retryable operations.
**The Fix**: Use appropriate activity retry policies. Let Temporal handle transient failures with exponential backoff. Reserve `maximum_attempts=1` for truly non-retryable operations.

## Query Handler Mistakes
## Query Handler & Update Validator Mistakes

### Modifying State in Queries
### Modifying State in Queries & Update Validators

**The Problem**: Queries are read-only. Modifying state in a query handler causes non-determinism on replay because queries don't generate history events.
**The Problem**: Queries and update validators are read-only. Modifying state causes non-determinism on replay, and must strictly be avoided.

**Symptoms**:
- State inconsistencies after workflow replay
- Non-determinism errors

**The Fix**: Queries must only read state. Use Updates for operations that need to modify state AND return a result.
**The Fix**: Queries and update validators must only read state. Use Updates for operations that need to modify state AND return a result.

### Blocking in Queries
### Blocking in Queries & Update Validators

**The Problem**: Queries must return immediately. They cannot await activities, child workflows, timers, or conditions.
**The Problem**: Queries and update validators must return immediately. They cannot await activities, child workflows, timers, or conditions.

**Symptoms**:
- Query timeouts
- Query / update validators timeouts
- Deadlocks

**The Fix**: Queries return current state only. Use Signals or Updates to trigger async operations.
**The Fix**: Queries and update validators must only look at current state. Use Signals or Updates to trigger async operations.

### Query vs Signal vs Update

Expand Down
22 changes: 20 additions & 2 deletions references/core/determinism.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ Temporal workflows must be deterministic because of **history replay** - the mec

### The Replay Mechanism

When a Worker needs to restore workflow state (after crash, cache eviction, or continuing after a long timer), it **re-executes the workflow code from the beginning**. But instead of re-running activities, it uses results stored in the Event History.
When a Worker needs to restore workflow state (after crash, cache eviction, or continuing after a long timer), it **re-executes the workflow code from the beginning**. But instead of re-running external actions, it uses results stored in the Event History.

```
Initial Execution:
Expand All @@ -22,7 +22,7 @@ Replay (Recovery):

### Commands and Events

Every workflow operation generates a Command that becomes an Event:
Every workflow operation generates a Command that becomes an Event, here are some examples:

| Workflow Code | Command Generated | Event Stored |
|--------------|-------------------|--------------|
Expand Down Expand Up @@ -95,6 +95,24 @@ Math.random() // Returns seeded PRNG value
new Date() // Deterministic
```

### Go `workflowcheck` static analyzer
The Go SDK provides a workflowcheck CLI tool that:
- Statically analyzes registered Workflow Definitions and their call graph
- Flags common sources of non-determinism (e.g., time.Now, time.Sleep, goroutines, channels, map iteration, global math/rand, stdio)
- Helps catch invalid constructs early in development, but cannot detect all issues (e.g., global var mutation, some reflection)

```bash
# Install
go install go.temporal.io/sdk/contrib/tools/workflowcheck@latest

# Run from your module root to scan all packages
workflowcheck ./...

# Optional: configure overrides / skips in workflowcheck.config.yaml
# (e.g., mark a function as deterministic or skip files)
workflowcheck -config workflowcheck.config.yaml ./...
```

## Detecting Non-Determinism

### During Execution
Expand Down