Skip to content

[SVLS-8351] Add CPU Enhanced Metrics in Linux Azure Functions#77

Open
kathiehuang wants to merge 35 commits intomainfrom
kathie.huang/add-cpu-enhanced-metrics
Open

[SVLS-8351] Add CPU Enhanced Metrics in Linux Azure Functions#77
kathiehuang wants to merge 35 commits intomainfrom
kathie.huang/add-cpu-enhanced-metrics

Conversation

@kathiehuang
Copy link
Contributor

@kathiehuang kathiehuang commented Feb 13, 2026

What does this PR do?

Adds CPU limit and usage enhanced metrics for Linux Azure Functions.

  • Adds a new datadog-metrics-collector crate that reads CPU metrics every 3 seconds and submits them to the Datadog backend every 10 seconds when DD_ENHANCED_METRICS_ENABLED=true (default on)
    • This creates an OS-agnostic CpuMetricsCollector struct and CpuStatsReader trait. Currently this only collects CPU metrics in Linux. CPU metrics in Windows will be completed in a future PR
  • Emits two new metrics:
    • azure.functions.enhanced.cpu.usage - container-level CPU consumption rate in nanocores, sourced from cpuacct.usage
    • azure.functions.enhanced.cpu.limit - CPU limit in nanocores, computed as min(cpuset.cpus, cfs_quota/cfs_period), falling back to host CPU count if no cgroup limit is set

Additional Notes

  • Tags attached to all CPU metrics:
    • Azure resource metadata from libdd-common:
    • resource_id
    • resource_group
    • subscription_id
    • name
    • Metadata from other environment variables:
      • region
      • plan_tier
      • service
      • env
      • version
      • serverless_compat_version
  • Categorizes azure.functions.* metrics as ServerlessEnhanced origin in the dogstatsd origin classifier so that they show up as Enhanced rather than Custom metrics in Datadog Metrics Summary
  • Sets up:
    • CgroupStats struct for reading statistics from cgroup v1 files
      • This normalizes the stats to nanoseconds
    • CpuStats struct to store the computed CPU total and limit metrics
      • Converts u64 values to to f64
      • Calculates CPU limit percentage
  • Separates start_dogstatsd into two functions
    • start_aggregator, which starts the aggregator service and metrics flusher
    • start_dogstatsd_listener, which enables custom metrics to be received from user code
      • This separation enables enhanced metrics to be submitted to the aggregator service and flushed even when DD_USE_DOGSTATSD is off
  • Metrics are submitted as distribution metrics because not all metrics have tags with a unique identifier from the instance they are sent from
  • If the collector cannot read the cgroup files successfully, it will not submit enhanced metrics for that interval and log accordingly

Motivation

https://datadoghq.atlassian.net/browse/SVLS-8351

Describe how to test/QA your changes

Build with serverless-compat-self-monitoring.

Added debug logs to verify calculations:

DEBUG datadog_trace_agent::metrics_collector: Contents of /sys/fs/cgroup/cpuset/cpuset.cpus: 0-1
DEBUG datadog_trace_agent::metrics_collector: Range: ["0", "1"]
DEBUG datadog_trace_agent::metrics_collector: Total CPU count: 2
DEBUG datadog_trace_agent::metrics_collector: CFS scheduler quota is -1, setting to None
DEBUG datadog_trace_agent::metrics_collector: Could not read scheduler quota from /sys/fs/cgroup/cpu/cpu.cfs_quota_us
DEBUG datadog_trace_agent::metrics_collector: No CPU limit found, defaulting to host CPU count: 2 CPUs
DEBUG datadog_trace_agent::metrics_collector: Collected cpu stats!
DEBUG datadog_trace_agent::metrics_collector: CPU usage: 9871234519
DEBUG datadog_trace_agent::metrics_collector: CPU limit: 200%, defaulted: true
DEBUG datadog_trace_agent::metrics_collector: Submitting CPU metrics!
  • Oftentimes scheduler quota cannot be read from cpu.cfs_quota_us, so it will falls back to the host cpu count from the crate num_cpus

This was deployed with the serverless-compat-self-monitoring pipeline across all runtimes and hosting plans. All hosting plans in Linux were tested to verify metrics submit correctly, and a Windows function was tested to ensure lack of current support in Windows is logged.

Testing to compare against Azure Monitor is documented in an internal doc in Enhanced Metrics in the Serverless Compatibility Layer. Detailed calculations are explained in Calculating CPU Enhanced Metrics in Windows and Linux Azure Functions.

  • Verified CPU usage increases with a CPU-intensive function
  • Verified CPU limit increases when the hosting plan is scaled up
  • Verified enhanced metrics still send when DD_USE_DOGSTATSD is off
  • Verified custom metrics still send when DD_ENHANCED_METRICS_ENABLED is off

References: datadog-agent cgroup collection and calculation logic

@kathiehuang kathiehuang changed the title [SVLS-8351] Add CPU Enhanced Metrics [SVLS-8351] Add CPU Enhanced Metrics in Linux Azure Functions Mar 6, 2026
@kathiehuang kathiehuang force-pushed the kathie.huang/add-cpu-enhanced-metrics branch from b987e0d to 7010e35 Compare March 6, 2026 19:39
@kathiehuang kathiehuang force-pushed the kathie.huang/add-cpu-enhanced-metrics branch from 7010e35 to 5953d68 Compare March 6, 2026 20:38
@DataDog DataDog deleted a comment from chatgpt-codex-connector bot Mar 7, 2026
@kathiehuang
Copy link
Contributor Author

@codex review

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c6a55dc810

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@kathiehuang
Copy link
Contributor Author

@codex review

@kathiehuang
Copy link
Contributor Author

@codex review

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 071d2f0488

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +240 to +242
tokio::spawn(async move {
metrics_flusher.flush().await;
});

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Serialize metrics flushes to prevent overlapping upload tasks

This change spawns a new flush() task on every interval tick, so if a flush takes longer than 10s (for example during network slowness or outages), multiple flushes run concurrently and keep accumulating over time. Because Flusher::flush performs network I/O with retries, this can happen in production and causes unnecessary task/socket/memory pressure; the previous logic awaited each flush and guaranteed only one in-flight flush at a time.

Useful? React with 👍 / 👎.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The client has a 5-second timeout with RetryStrategy::LinearBackoff(3, 1), so this would require three consecutive full timeouts. I'm happy to add this, but I'm not sure if it's needed

@kathiehuang kathiehuang marked this pull request as ready for review March 9, 2026 17:26
@kathiehuang kathiehuang requested review from a team as code owners March 9, 2026 17:26
@kathiehuang kathiehuang requested review from Lewis-E, duncanpharvey and lym953 and removed request for a team March 9, 2026 17:26
Copy link
Contributor

@duncanista duncanista left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest using features for OS specific business logic

Also suggest checking how ADP is doing agent checks in rust, this sounds like an agent check for a very specific use case

use dogstatsd::metric::{SortedTags, EMPTY_TAGS};
use tokio_util::sync::CancellationToken;

const CPU_METRICS_COLLECTION_INTERVAL: u64 = 3;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
const CPU_METRICS_COLLECTION_INTERVAL: u64 = 3;
const CPU_METRICS_COLLECTION_INTERVAL_SECONDS: u64 = 3;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in aae174a

);

if let Err(e) = self.aggregator.insert_batch(vec![usage_metric]) {
error!("Failed to insert CPU usage metric: {}", e);
Copy link
Contributor

@Lewis-E Lewis-E Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In what situations would we see this error? Would we hit this repeatedly or can the aggregator recover from errors quickly? (Also applies to line 111)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

insert_batch calls tx.send, which is on an unbounded channel that has infinite capacity. An error will only happen if the receive half of the channel is closed or dropped, which means the aggregator service isn't working anymore and every subsequent call should also fail. This means that metrics would stop sending, with error logs on every attempted insert. It seems the only way to recover would be for the customer to stop and start their function app to restart the agent

Error logging but continuing is what the lambda extension does

If we're worried about log spam, I could change this to return early on the CPU usage metric insert failure - this would halve the error logs

Or maybe a better solution would be to have collect_and_submit return a Result, and main.rs could set cpu_collector=None on error?

@Lewis-E
Copy link
Contributor

Lewis-E commented Mar 9, 2026

So, ~6 debug logs every 3 seconds? Do these go to Datadog & cost money?

("name", aas_metadata.get_site_name()),
];
for (name, value) in aas_tags {
if value != "unknown" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be good to use the actual constant UNKNOWN_VALUE, so people can click through to where this is coming from?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ooh good point, but it looks like UNKNOWN_VALUE is a private constant so it's not accessible here

Copy link
Collaborator

@duncanpharvey duncanpharvey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent work! I added a few suggestions to consider

Comment on lines +36 to +37
dogstatsd = { path = "../dogstatsd", default-features = true }
num_cpus = "1.16"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these dependencies needed in datadog-trace-agent?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh good catch - this was accidentally left over from before I moved the metrics collector into its own crate. Fixed in 2ad5f24!

let (metrics_flusher, aggregator_handle) = if needs_aggregator {
debug!("Creating metrics flusher and aggregator");

let (flusher, handle) =
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a comment here to note why the aggregator is started separately from the dogstatsd listener would be helpful - just enough to be clear that there are different configuration options that require this (dogstatsd enabled/disabled, enhanced metrics enabled/disabled).

Maybe a unit test as well to assert that all of these combinations are covered?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! I added a comment in 60cdecf

It seems like it'll be a little hard to make a meaningful unit test since the aggregator/dogstatsd startup logic has side effects that would make it hard to test in isolation? Maybe I could refactor the startup decision into a struct that describes what to start to separate the decision from execution?

struct AgentConfig {
    start_aggregator: bool,
    start_dogstatsd: bool,
    start_enhanced_metrics: bool,
}

fn resolve_agent_config(dd_use_dogstatsd: bool, dd_enhanced_metrics: bool) -> AgentConfig {
    AgentConfig {
        start_aggregator: dd_use_dogstatsd || dd_enhanced_metrics,
        start_dogstatsd: dd_use_dogstatsd,
        start_enhanced_metrics: dd_enhanced_metrics,
    }
}

Ok(builder.build()?)
}

fn build_cpu_metrics_tags() -> Option<SortedTags> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this method make more sense to live in the datadog-metrics-collector crate and be used internally within the crate?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense! I moved it in f867f6f

@kathiehuang
Copy link
Contributor Author

I looked into how ADP does agent checks in Rust - it looks like it's still experimental and may be too high-level for this use case. I will make a Jira ticket for the backlog though so we can come back to this in the future and see if anything in the way they do checks is applicable!

https://datadoghq.atlassian.net/browse/SVLS-8699

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants