-
Notifications
You must be signed in to change notification settings - Fork 305
Add cloud auth and observability guidance #4351
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
|
|
@@ -22,6 +22,13 @@ Temporal Cloud supports two secure authentication methods for Workers: | |||||
|
|
||||||
| Both options help secure communication between workers and Temporal Cloud. Choosing the right method and managing it properly is key to maintaining security and minimizing downtime. | ||||||
|
|
||||||
| Use this page to define your operating model for machine access to Temporal Cloud. For setup steps and product-specific | ||||||
| mechanics, see [Manage API keys](/cloud/api-keys) and [Manage service accounts](/cloud/service-accounts). | ||||||
|
|
||||||
| Related guidance: | ||||||
| - [Namespace best practices](/best-practices/managing-namespace) | ||||||
| - [Multi-tenant application patterns](/production-deployment/multi-tenant-patterns) | ||||||
|
|
||||||
| The high-level end-to-end rotation process is: | ||||||
|
|
||||||
| 1. **Generate new credentials**: Create new certificates or API keys in Temporal Cloud before the current ones expire | ||||||
|
|
@@ -45,17 +52,64 @@ In the case that you are using multiple certificates signed by the same CA, and | |||||
|
|
||||||
| One convention is to give certificates a common name that matches the namespace. If you do this when using the same CA for dev and prod, then you can leverage Certificate Filters to prevent access to production environments. This is described in detail under the [authorization section](https://docs.temporal.io/cloud/certificates#control-authorization) of the documentation. | ||||||
|
|
||||||
| ## Best practices: | ||||||
| #### 1. Establish clear guidelines on authentication methods: Teams should standardize on either [mTLS certificates](https://docs.temporal.io/cloud/certificates) or [API keys](https://docs.temporal.io/cloud/api-keys) for the following operations: | ||||||
| ## Best practices | ||||||
|
|
||||||
| ### Establish clear guidelines on authentication methods | ||||||
|
|
||||||
| Teams should standardize on either [mTLS certificates](https://docs.temporal.io/cloud/certificates) or | ||||||
| [API keys](https://docs.temporal.io/cloud/api-keys) for the following operations: | ||||||
| - Connect Temporal clients to Temporal Cloud (e.g. Worker processes) | ||||||
| - Automation (e.g. Temporal Cloud [Operations API](https://docs.temporal.io/ops), [Terraform provider](https://docs.temporal.io/cloud/terraform-provider), [Temporal CLI](https://docs.temporal.io/cli/setup-cli)) | ||||||
|
|
||||||
| By default, it is recommended for teams to use API keys and [service accounts](https://docs.temporal.io/cloud/service-accounts) for both operations because API keys are easier to manage and rotate for most teams. In addition, you can control account-level and namespace-level roles for service accounts. | ||||||
| By default, teams should use API keys with [service accounts](/cloud/service-accounts) for both operations. API keys | ||||||
| are generally easier to set up and rotate than mTLS certificates, and service accounts let you assign account-level and | ||||||
| namespace-level roles. | ||||||
|
|
||||||
| If your organization requires mutual authentication and stronger cryptographic guarantees, use | ||||||
| [mTLS certificates](/cloud/certificates) to authenticate Temporal clients to Temporal Cloud and use API keys for | ||||||
| automation, because the Temporal Cloud [Operations API](/ops) and [Terraform provider](/cloud/terraform-provider) only | ||||||
| support API key authentication. Unlike API keys tied to users or service accounts, mTLS certificate authentication is | ||||||
| not tied to Temporal Cloud RBAC identities. Namespace access is based on CA trust, with optional | ||||||
| [Certificate Filters](/cloud/certificates#manage-certificate-filters) to narrow access by Common Name. | ||||||
|
|
||||||
| ### Default operating model for service accounts and API keys | ||||||
|
|
||||||
| For most organizations, use the following defaults: | ||||||
|
|
||||||
| - Create one Service Account per service or worker deployment, not one shared Service Account for an entire team | ||||||
| - Use account-level Service Accounts only when a service genuinely needs cross-Namespace or account-wide access | ||||||
| - Prefer Namespace-scoped Service Accounts when a service should only access one Namespace | ||||||
| - Grant Service Accounts namespace-level access only to the specific Namespaces they need | ||||||
|
|
||||||
| This approach gives you cleaner ownership, easier rotation, and better auditability than sharing a single machine | ||||||
| identity across multiple services. | ||||||
|
|
||||||
| ### Use access boundaries that match your Namespace boundaries | ||||||
|
|
||||||
| The way you partition Namespaces should usually match the way you partition machine identities. | ||||||
|
|
||||||
| - If multiple services share a Namespace, you may still want one Service Account per service so that each deployment can | ||||||
| rotate credentials independently. | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
| - If you split workloads into separate Namespaces for security, capacity, or team ownership reasons, those Namespaces | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
| should usually have separate Service Accounts and API keys as well. | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
| - If you use Namespace-per-tenant isolation, expect your credential model and RBAC model to become correspondingly more | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
| granular. | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
|
|
||||||
| For more on topology tradeoffs, see [Namespace best practices](/best-practices/managing-namespace) and | ||||||
| [Multi-tenant application patterns](/production-deployment/multi-tenant-patterns). | ||||||
|
|
||||||
| ### Rotate credentials without downtime | ||||||
|
|
||||||
| Use the following sequence when rotating credentials: | ||||||
|
|
||||||
| If your organization requires mutual authentication and stronger cryptographic guarantees, then it is encouraged for your teams to use mTLS certificates to authenticate Temporal clients to Temporal Cloud and use API keys for automation (because Temporal Cloud [Operations API](https://docs.temporal.io/ops) and [Terraform provider](https://docs.temporal.io/cloud/terraform-provider) only supports API key for authentication) | ||||||
| 1. Create the replacement credential before the existing one expires. | ||||||
| 2. For API keys, create the new valid key while the old key still works, then roll your Workers and clients to use the new key. | ||||||
| 3. For client certificates, stage the new certificate before removing the old one when your deployment process supports that transition. | ||||||
| 4. Validate connectivity and normal Workflow execution using the new credential. | ||||||
| 5. Remove the old credential only after all clients and Workers have switched. | ||||||
|
|
||||||
| #### 2. Use Certificate Filters to restrict access when using shared CAs (e.g., `dev` vs `prod`): | ||||||
| ### Use Certificate Filters to restrict access when using shared CAs (e.g., `dev` vs `prod`) | ||||||
|
|
||||||
| Certificate Filters are an additional way of validating using the client certificate presented during client authentication. Give certificates a common name that matches the namespace. This is not a requirement. | ||||||
| Certificate Filters are an additional way of validating using the client certificate presented during client authentication. Give certificates a common name that matches the namespace. This is not a requirement. | ||||||
|
|
||||||
| If you do this when using the same CA for dev and prod environments, then you can leverage Certificate Filters to prevent access to production. | ||||||
| If you do this when using the same CA for dev and prod environments, then you can leverage Certificate Filters to prevent access to production. | ||||||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -33,6 +33,17 @@ When used together, Cloud and SDK metrics measure the health and performance of | |
|
|
||
| Cloud metrics for all Namespaces in your account are available from the [OpenMetrics endpoint](/cloud/metrics/openmetrics), a Prometheus-compatible scrapable endpoint at `metrics.temporal.io`. | ||
|
|
||
| Use the following rule of thumb when deciding which signal to rely on: | ||
|
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @dustin-temporal please review |
||
|
|
||
| | Question | Primary signal | | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This reads like it's for an LLM to reference, but if that's what we're going for then I'm good with it. |
||
| |---|---| | ||
| | Is Temporal Cloud accepting and serving work normally? | Cloud metrics | | ||
| | Are Tasks backing up in a Task Queue? | Cloud metrics plus SDK Schedule-To-Start metrics | | ||
| | Are my Workers saturated, under-provisioned, or misconfigured? | SDK metrics | | ||
| | Is my application logic, downstream dependency, or Activity behavior unhealthy? | SDK metrics and traces | | ||
|
|
||
| For a Worker-focused view of how to combine these signals, see [Monitor worker health](/cloud/worker-health). | ||
|
|
||
| - [OpenMetrics overview](/cloud/metrics/openmetrics) - Getting started and key concepts | ||
| - [Metrics integrations](/cloud/metrics/openmetrics/metrics-integrations) - Datadog, Grafana Cloud, New Relic, ClickStack, and more | ||
| - [API reference](/cloud/metrics/openmetrics/api-reference) - Endpoint specification and advanced configuration | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -34,6 +34,14 @@ This document is for basic configuration only. For advanced concepts such as lab | |
|
|
||
| Datadog provides a serverless integration with the OpenMetrics endpoint. This integration will scrape metrics, store them in Datadog, and provides a default dashboard with some built in monitors. See the [integration page](https://docs.datadoghq.com/integrations/temporal-cloud-openmetrics/) for more details. | ||
|
|
||
| For Datadog users, treat this integration as the Cloud-side half of your observability setup: | ||
|
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @dustin-temporal please review |
||
|
|
||
| - Use OpenMetrics in Datadog to monitor Temporal Cloud behavior such as Task Queue backlog, poll success, and rate limiting. | ||
| - Collect [SDK metrics](/cloud/metrics/sdk-metrics-setup) from your Workers separately to monitor saturation, Schedule-To-Start latency, slot availability, and sticky cache behavior. | ||
|
|
||
| If you only ingest Cloud metrics, you will miss many worker-side bottlenecks. For recommended Worker monitors, see | ||
| [Monitor worker health](/cloud/worker-health). | ||
|
|
||
| ### Grafana Cloud | ||
|
|
||
| Grafana provides a serverless integration with the OpenMetrics endpoint for Grafana Cloud. This integration will scrape metrics, store them in Grafana Cloud, and provides a default dashboard | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -39,6 +39,10 @@ This page is a guide to monitoring a Temporal Worker fleet and covers the follow | |
| - [How to detect misconfigured Workers](#detect-misconfigured-workers) | ||
| - [How to configure Sticky cache](#configure-sticky-cache) | ||
|
|
||
| This page assumes you are monitoring both Worker-side SDK metrics and Cloud-side metrics. Use SDK metrics to understand | ||
|
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @dustin-temporal please review |
||
| what your Workers are doing, and Cloud metrics to understand what Temporal Cloud is seeing at the Task Queue and service | ||
| level. For an overview of how these signals fit together, see [Temporal Cloud metrics](/cloud/metrics). | ||
|
|
||
| ## Minimal Observations {#minimal-observations} | ||
|
|
||
| These alerts should be configured and understood first to gain intelligence into your application health and behaviors. | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are just line wrapping artifacts, shouldn't affect anything