[SVLS-8351] Add CPU Enhanced Metrics in Linux Azure Functions by kathiehuang · Pull Request #77 · DataDog/serverless-components

kathiehuang · 2026-02-13T20:19:14Z

What does this PR do?

Adds CPU limit and usage enhanced metrics for Linux Azure Functions.

Adds a new datadog-metrics-collector crate that reads CPU metrics every 3 seconds and submits them to the Datadog backend every 10 seconds when DD_ENHANCED_METRICS_ENABLED=true (default on)
- This creates an OS-agnostic CpuMetricsCollector struct and CpuStatsReader trait. Currently this only collects CPU metrics in Linux. CPU metrics in Windows will be completed in a future PR
Emits two new metrics:
- azure.functions.enhanced.cpu.usage - container-level CPU consumption rate in nanocores, sourced from cpuacct.usage
- azure.functions.enhanced.cpu.limit - CPU limit in nanocores, computed as min(cpuset.cpus, cfs_quota/cfs_period), falling back to host CPU count if no cgroup limit is set

Additional Notes

Tags attached to all CPU metrics:
- Azure resource metadata from libdd-common:
- resource_id
- resource_group
- subscription_id
- name
- Metadata from other environment variables:
  - region
  - plan_tier
  - service
  - env
  - version
  - serverless_compat_version
Categorizes azure.functions.* metrics as ServerlessEnhanced origin in the dogstatsd origin classifier so that they show up as Enhanced rather than Custom metrics in Datadog Metrics Summary
Sets up:
- CgroupStats struct for reading statistics from cgroup v1 files
  - This normalizes the stats to nanoseconds
- CpuStats struct to store the computed CPU total and limit metrics
  - Converts u64 values to to f64
  - Calculates CPU limit percentage
Separates start_dogstatsd into two functions
- start_aggregator, which starts the aggregator service and metrics flusher
- start_dogstatsd_listener, which enables custom metrics to be received from user code
  - This separation enables enhanced metrics to be submitted to the aggregator service and flushed even when DD_USE_DOGSTATSD is off
Metrics are submitted as distribution metrics because not all metrics have tags with a unique identifier from the instance they are sent from
If the collector cannot read the cgroup files successfully, it will not submit enhanced metrics for that interval and log accordingly

Motivation

https://datadoghq.atlassian.net/browse/SVLS-8351

Describe how to test/QA your changes

Build with serverless-compat-self-monitoring.

Added debug logs to verify calculations:

DEBUG datadog_trace_agent::metrics_collector: Contents of /sys/fs/cgroup/cpuset/cpuset.cpus: 0-1
DEBUG datadog_trace_agent::metrics_collector: Range: ["0", "1"]
DEBUG datadog_trace_agent::metrics_collector: Total CPU count: 2
DEBUG datadog_trace_agent::metrics_collector: CFS scheduler quota is -1, setting to None
DEBUG datadog_trace_agent::metrics_collector: Could not read scheduler quota from /sys/fs/cgroup/cpu/cpu.cfs_quota_us
DEBUG datadog_trace_agent::metrics_collector: No CPU limit found, defaulting to host CPU count: 2 CPUs
DEBUG datadog_trace_agent::metrics_collector: Collected cpu stats!
DEBUG datadog_trace_agent::metrics_collector: CPU usage: 9871234519
DEBUG datadog_trace_agent::metrics_collector: CPU limit: 200%, defaulted: true
DEBUG datadog_trace_agent::metrics_collector: Submitting CPU metrics!

Oftentimes scheduler quota cannot be read from cpu.cfs_quota_us, so it will falls back to the host cpu count from the crate num_cpus

This was deployed with the serverless-compat-self-monitoring pipeline across all runtimes and hosting plans. All hosting plans in Linux were tested to verify metrics submit correctly, and a Windows function was tested to ensure lack of current support in Windows is logged.

Testing to compare against Azure Monitor is documented in an internal doc in Enhanced Metrics in the Serverless Compatibility Layer. Detailed calculations are explained in Calculating CPU Enhanced Metrics in Windows and Linux Azure Functions.

Verified CPU usage increases with a CPU-intensive function
Verified CPU limit increases when the hosting plan is scaled up
Verified enhanced metrics still send when DD_USE_DOGSTATSD is off
Verified custom metrics still send when DD_ENHANCED_METRICS_ENABLED is off

References: datadog-agent cgroup collection and calculation logic

…file reading and cpu stats calculation functions

…n OS-agnostic, create separate crates for Windows and Linux for reading raw CPU data

kathiehuang · 2026-03-07T01:04:19Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c6a55dc810

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

crates/datadog-serverless-compat/src/main.rs

kathiehuang · 2026-03-07T01:23:51Z

@codex review

…created

kathiehuang · 2026-03-09T14:29:22Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 071d2f0488

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-03-09T14:35:21Z

crates/datadog-serverless-compat/src/main.rs

+                    tokio::spawn(async move {
+                        metrics_flusher.flush().await;
+                    });


Serialize metrics flushes to prevent overlapping upload tasks

This change spawns a new flush() task on every interval tick, so if a flush takes longer than 10s (for example during network slowness or outages), multiple flushes run concurrently and keep accumulating over time. Because Flusher::flush performs network I/O with retries, this can happen in production and causes unnecessary task/socket/memory pressure; the previous logic awaited each flush and guaranteed only one in-flight flush at a time.

Useful? React with 👍 / 👎.

The client has a 5-second timeout with RetryStrategy::LinearBackoff(3, 1), so this would require three consecutive full timeouts. I'm happy to add this, but I'm not sure if it's needed

duncanista

I'd suggest using features for OS specific business logic

Also suggest checking how ADP is doing agent checks in rust, this sounds like an agent check for a very specific use case

duncanista · 2026-03-09T17:29:56Z

crates/datadog-serverless-compat/src/main.rs

 use dogstatsd::metric::{SortedTags, EMPTY_TAGS};
 use tokio_util::sync::CancellationToken;

+const CPU_METRICS_COLLECTION_INTERVAL: u64 = 3;


Suggested change

const CPU_METRICS_COLLECTION_INTERVAL: u64 = 3;

const CPU_METRICS_COLLECTION_INTERVAL_SECONDS: u64 = 3;

Updated in aae174a

crates/datadog-metrics-collector/src/cpu.rs

crates/datadog-metrics-collector/src/linux.rs

Lewis-E · 2026-03-09T20:12:26Z

crates/datadog-metrics-collector/src/cpu.rs

+            );
+
+            if let Err(e) = self.aggregator.insert_batch(vec![usage_metric]) {
+                error!("Failed to insert CPU usage metric: {}", e);


In what situations would we see this error? Would we hit this repeatedly or can the aggregator recover from errors quickly? (Also applies to line 111)

insert_batch calls tx.send, which is on an unbounded channel that has infinite capacity. An error will only happen if the receive half of the channel is closed or dropped, which means the aggregator service isn't working anymore and every subsequent call should also fail. This means that metrics would stop sending, with error logs on every attempted insert. It seems the only way to recover would be for the customer to stop and start their function app to restart the agent

Error logging but continuing is what the lambda extension does

If we're worried about log spam, I could change this to return early on the CPU usage metric insert failure - this would halve the error logs

Or maybe a better solution would be to have collect_and_submit return a Result, and main.rs could set cpu_collector=None on error?

crates/datadog-metrics-collector/src/linux.rs

Lewis-E · 2026-03-09T20:30:18Z

So, ~6 debug logs every 3 seconds? Do these go to Datadog & cost money?

Lewis-E · 2026-03-10T14:10:37Z

crates/datadog-serverless-compat/src/main.rs

+            ("name", aas_metadata.get_site_name()),
+        ];
+        for (name, value) in aas_tags {
+            if value != "unknown" {


might be good to use the actual constant UNKNOWN_VALUE, so people can click through to where this is coming from?

Ooh good point, but it looks like UNKNOWN_VALUE is a private constant so it's not accessible here

…eeded

duncanpharvey

Excellent work! I added a few suggestions to consider

duncanpharvey · 2026-03-10T21:27:41Z

crates/datadog-trace-agent/Cargo.toml

+dogstatsd = { path = "../dogstatsd", default-features = true }
+num_cpus = "1.16"


Are these dependencies needed in datadog-trace-agent?

Oh good catch - this was accidentally left over from before I moved the metrics collector into its own crate. Fixed in 2ad5f24!

duncanpharvey · 2026-03-10T21:35:11Z

crates/datadog-serverless-compat/src/main.rs

+    let (metrics_flusher, aggregator_handle) = if needs_aggregator {
+        debug!("Creating metrics flusher and aggregator");
+
+        let (flusher, handle) =


I think a comment here to note why the aggregator is started separately from the dogstatsd listener would be helpful - just enough to be clear that there are different configuration options that require this (dogstatsd enabled/disabled, enhanced metrics enabled/disabled).

Maybe a unit test as well to assert that all of these combinations are covered?

Good point! I added a comment in 60cdecf

It seems like it'll be a little hard to make a meaningful unit test since the aggregator/dogstatsd startup logic has side effects that would make it hard to test in isolation? Maybe I could refactor the startup decision into a struct that describes what to start to separate the decision from execution?

struct AgentConfig { start_aggregator: bool, start_dogstatsd: bool, start_enhanced_metrics: bool, } fn resolve_agent_config(dd_use_dogstatsd: bool, dd_enhanced_metrics: bool) -> AgentConfig { AgentConfig { start_aggregator: dd_use_dogstatsd || dd_enhanced_metrics, start_dogstatsd: dd_use_dogstatsd, start_enhanced_metrics: dd_enhanced_metrics, } }

duncanpharvey · 2026-03-10T21:37:42Z

crates/datadog-serverless-compat/src/main.rs

    Ok(builder.build()?)
 }
+
+fn build_cpu_metrics_tags() -> Option<SortedTags> {


Would this method make more sense to live in the datadog-metrics-collector crate and be used internally within the crate?

That makes sense! I moved it in f867f6f

kathiehuang · 2026-03-11T16:38:05Z

I looked into how ADP does agent checks in Rust - it looks like it's still experimental and may be too high-level for this use case. I will make a Jira ticket for the backlog though so we can come back to this in the future and see if anything in the way they do checks is applicable!

https://datadoghq.atlassian.net/browse/SVLS-8699

…llector

…rics-collector

kathiehuang changed the title ~~[SVLS-8351] Add CPU Enhanced Metrics~~ [SVLS-8351] Add CPU Enhanced Metrics in Linux Azure Functions Mar 6, 2026

kathiehuang force-pushed the kathie.huang/add-cpu-enhanced-metrics branch from b987e0d to 7010e35 Compare March 6, 2026 19:39

kathiehuang added 16 commits March 6, 2026 15:00

Set up Cgroup, CpuStats, and CpuMetricsCollector structs, and cgroup …

0314faa

…file reading and cpu stats calculation functions

Add cpu collector into loop with dogstatsd

af53bf0

Fix license

9d69162

Move metrics_collector into its own crate

36171b8

Submit cpu usage and limit metrics and fix units

9252609

Test more precise time interval, add instance ID as a tag

f47c4ff

Categorize metrics with azure.functions prefix as enhanced metrics

b632c17

Refactor to make CpuMetricsCollector, CpuStats, and metrics submissio…

1bff3a8

…n OS-agnostic, create separate crates for Windows and Linux for reading raw CPU data

Testing different cpu collection methods

fac5fda

Clean up and emit cpu usage and host-level cpu usage metrics

bf1e8a7

Clean up and emit cpu usage and host-level cpu usage metrics

c43bb32

Add tags to metrics

cba2d69

Ensure tags match cloud integration metrics

70aa6fe

Separate Windows CPU metrics collection into separate PR

eec4202

Separate CPU host usage metrics collection into separate PR

ddfd37f

Remove functionname tag

5953d68

kathiehuang force-pushed the kathie.huang/add-cpu-enhanced-metrics branch from 7010e35 to 5953d68 Compare March 6, 2026 20:38

kathiehuang added 6 commits March 6, 2026 18:43

Send enhanced metrics even if custom metrics are turned off

45317ff

Pull out building metrics tags into function

12bfde2

Add unit tests

8c4cf5f

Clean up

a7f9f8d

Refactor

feca14c

Remove last_collection_time

c6a55dc

DataDog deleted a comment from chatgpt-codex-connector bot Mar 7, 2026

chatgpt-codex-connector bot reviewed Mar 7, 2026

View reviewed changes

crates/datadog-serverless-compat/src/main.rs Outdated Show resolved Hide resolved

Only send enhanced metrics for Azure Functions

dfe28a3

kathiehuang added 3 commits March 9, 2026 10:01

Only enable enhanced metrics for Azure Functions

36bba17

Only create CPUMetricsCollector when metrics flusher is successfully …

c626c03

…created

Launch metrics flusher as independent task from collector

071d2f0

chatgpt-codex-connector bot reviewed Mar 9, 2026

View reviewed changes

kathiehuang marked this pull request as ready for review March 9, 2026 17:26

kathiehuang requested review from a team as code owners March 9, 2026 17:26

kathiehuang requested review from Lewis-E, duncanpharvey and lym953 and removed request for a team March 9, 2026 17:26

duncanista reviewed Mar 9, 2026

View reviewed changes

Lewis-E reviewed Mar 9, 2026

View reviewed changes

crates/datadog-metrics-collector/src/cpu.rs Outdated Show resolved Hide resolved

Lewis-E reviewed Mar 9, 2026

View reviewed changes

crates/datadog-metrics-collector/src/linux.rs Outdated Show resolved Hide resolved

Lewis-E reviewed Mar 9, 2026

View reviewed changes

crates/datadog-metrics-collector/src/linux.rs Show resolved Hide resolved

Lewis-E reviewed Mar 10, 2026

View reviewed changes

kathiehuang added 3 commits March 10, 2026 11:01

Create windows-enhanced-metrics feature for Windows-specific logic

f6c2694

Add unit to collection interval variable

aae174a

Make last_usage_ns an Option and keep CPU total as u64 until f64 is n…

f443794

…eeded

duncanpharvey reviewed Mar 10, 2026

View reviewed changes

kathiehuang added 5 commits March 11, 2026 12:47

Change collection interval to 1 for precision and remove unneeded logs

5f16053

Add comment to clarify shared aggregator between dogstatsd and cpu co…

60cdecf

…llector

Move tag building logic from datadog-serverless-compat to datadog-met…

f867f6f

…rics-collector

Remove unused dependencies from datadog-trace-agent

2ad5f24

Formatting

b4a7624

	const CPU_METRICS_COLLECTION_INTERVAL: u64 = 3;
	const CPU_METRICS_COLLECTION_INTERVAL_SECONDS: u64 = 3;

		dogstatsd = { path = "../dogstatsd", default-features = true }
		num_cpus = "1.16"

Conversation

kathiehuang commented Feb 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Additional Notes

Motivation

Describe how to test/QA your changes

Uh oh!

kathiehuang commented Mar 7, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

kathiehuang commented Mar 7, 2026

Uh oh!

kathiehuang commented Mar 9, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

duncanista left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Lewis-E Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Lewis-E commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

duncanpharvey left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kathiehuang commented Mar 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

kathiehuang commented Feb 13, 2026 •

edited

Loading

Lewis-E Mar 9, 2026 •

edited

Loading

Lewis-E commented Mar 9, 2026 •

edited

Loading