[SVLS-8351] Add CPU Enhanced Metrics in Linux Azure Functions#77
[SVLS-8351] Add CPU Enhanced Metrics in Linux Azure Functions#77kathiehuang wants to merge 35 commits intomainfrom
Conversation
b987e0d to
7010e35
Compare
…file reading and cpu stats calculation functions
…n OS-agnostic, create separate crates for Windows and Linux for reading raw CPU data
7010e35 to
5953d68
Compare
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c6a55dc810
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
@codex review |
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 071d2f0488
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| tokio::spawn(async move { | ||
| metrics_flusher.flush().await; | ||
| }); |
There was a problem hiding this comment.
Serialize metrics flushes to prevent overlapping upload tasks
This change spawns a new flush() task on every interval tick, so if a flush takes longer than 10s (for example during network slowness or outages), multiple flushes run concurrently and keep accumulating over time. Because Flusher::flush performs network I/O with retries, this can happen in production and causes unnecessary task/socket/memory pressure; the previous logic awaited each flush and guaranteed only one in-flight flush at a time.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
The client has a 5-second timeout with RetryStrategy::LinearBackoff(3, 1), so this would require three consecutive full timeouts. I'm happy to add this, but I'm not sure if it's needed
duncanista
left a comment
There was a problem hiding this comment.
I'd suggest using features for OS specific business logic
Also suggest checking how ADP is doing agent checks in rust, this sounds like an agent check for a very specific use case
| use dogstatsd::metric::{SortedTags, EMPTY_TAGS}; | ||
| use tokio_util::sync::CancellationToken; | ||
|
|
||
| const CPU_METRICS_COLLECTION_INTERVAL: u64 = 3; |
There was a problem hiding this comment.
| const CPU_METRICS_COLLECTION_INTERVAL: u64 = 3; | |
| const CPU_METRICS_COLLECTION_INTERVAL_SECONDS: u64 = 3; |
| ); | ||
|
|
||
| if let Err(e) = self.aggregator.insert_batch(vec![usage_metric]) { | ||
| error!("Failed to insert CPU usage metric: {}", e); |
There was a problem hiding this comment.
In what situations would we see this error? Would we hit this repeatedly or can the aggregator recover from errors quickly? (Also applies to line 111)
There was a problem hiding this comment.
insert_batch calls tx.send, which is on an unbounded channel that has infinite capacity. An error will only happen if the receive half of the channel is closed or dropped, which means the aggregator service isn't working anymore and every subsequent call should also fail. This means that metrics would stop sending, with error logs on every attempted insert. It seems the only way to recover would be for the customer to stop and start their function app to restart the agent
Error logging but continuing is what the lambda extension does
If we're worried about log spam, I could change this to return early on the CPU usage metric insert failure - this would halve the error logs
Or maybe a better solution would be to have collect_and_submit return a Result, and main.rs could set cpu_collector=None on error?
|
So, ~6 debug logs every 3 seconds? Do these go to Datadog & cost money? |
| ("name", aas_metadata.get_site_name()), | ||
| ]; | ||
| for (name, value) in aas_tags { | ||
| if value != "unknown" { |
There was a problem hiding this comment.
might be good to use the actual constant UNKNOWN_VALUE, so people can click through to where this is coming from?
There was a problem hiding this comment.
Ooh good point, but it looks like UNKNOWN_VALUE is a private constant so it's not accessible here
duncanpharvey
left a comment
There was a problem hiding this comment.
Excellent work! I added a few suggestions to consider
| dogstatsd = { path = "../dogstatsd", default-features = true } | ||
| num_cpus = "1.16" |
There was a problem hiding this comment.
Are these dependencies needed in datadog-trace-agent?
There was a problem hiding this comment.
Oh good catch - this was accidentally left over from before I moved the metrics collector into its own crate. Fixed in 2ad5f24!
| let (metrics_flusher, aggregator_handle) = if needs_aggregator { | ||
| debug!("Creating metrics flusher and aggregator"); | ||
|
|
||
| let (flusher, handle) = |
There was a problem hiding this comment.
I think a comment here to note why the aggregator is started separately from the dogstatsd listener would be helpful - just enough to be clear that there are different configuration options that require this (dogstatsd enabled/disabled, enhanced metrics enabled/disabled).
Maybe a unit test as well to assert that all of these combinations are covered?
There was a problem hiding this comment.
Good point! I added a comment in 60cdecf
It seems like it'll be a little hard to make a meaningful unit test since the aggregator/dogstatsd startup logic has side effects that would make it hard to test in isolation? Maybe I could refactor the startup decision into a struct that describes what to start to separate the decision from execution?
struct AgentConfig {
start_aggregator: bool,
start_dogstatsd: bool,
start_enhanced_metrics: bool,
}
fn resolve_agent_config(dd_use_dogstatsd: bool, dd_enhanced_metrics: bool) -> AgentConfig {
AgentConfig {
start_aggregator: dd_use_dogstatsd || dd_enhanced_metrics,
start_dogstatsd: dd_use_dogstatsd,
start_enhanced_metrics: dd_enhanced_metrics,
}
}
| Ok(builder.build()?) | ||
| } | ||
|
|
||
| fn build_cpu_metrics_tags() -> Option<SortedTags> { |
There was a problem hiding this comment.
Would this method make more sense to live in the datadog-metrics-collector crate and be used internally within the crate?
|
I looked into how ADP does agent checks in Rust - it looks like it's still experimental and may be too high-level for this use case. I will make a Jira ticket for the backlog though so we can come back to this in the future and see if anything in the way they do checks is applicable! |
What does this PR do?
Adds CPU limit and usage enhanced metrics for Linux Azure Functions.
datadog-metrics-collectorcrate that reads CPU metrics every 3 seconds and submits them to the Datadog backend every 10 seconds whenDD_ENHANCED_METRICS_ENABLED=true(default on)CpuMetricsCollectorstruct andCpuStatsReadertrait. Currently this only collects CPU metrics in Linux. CPU metrics in Windows will be completed in a future PRazure.functions.enhanced.cpu.usage- container-level CPU consumption rate in nanocores, sourced fromcpuacct.usageazure.functions.enhanced.cpu.limit- CPU limit in nanocores, computed asmin(cpuset.cpus, cfs_quota/cfs_period), falling back to host CPU count if no cgroup limit is setAdditional Notes
libdd-common:resource_idresource_groupsubscription_idnameregionplan_tierserviceenvversionserverless_compat_versionazure.functions.*metrics asServerlessEnhancedorigin in the dogstatsd origin classifier so that they show up as Enhanced rather than Custom metrics in Datadog Metrics Summarystart_dogstatsdinto two functionsstart_aggregator, which starts the aggregator service and metrics flusherstart_dogstatsd_listener, which enables custom metrics to be received from user codeDD_USE_DOGSTATSDis offMotivation
https://datadoghq.atlassian.net/browse/SVLS-8351
Describe how to test/QA your changes
Build with serverless-compat-self-monitoring.
Added debug logs to verify calculations:
cpu.cfs_quota_us, so it will falls back to the host cpu count from the cratenum_cpusThis was deployed with the serverless-compat-self-monitoring pipeline across all runtimes and hosting plans. All hosting plans in Linux were tested to verify metrics submit correctly, and a Windows function was tested to ensure lack of current support in Windows is logged.
Testing to compare against Azure Monitor is documented in an internal doc in Enhanced Metrics in the Serverless Compatibility Layer. Detailed calculations are explained in Calculating CPU Enhanced Metrics in Windows and Linux Azure Functions.
DD_USE_DOGSTATSDis offDD_ENHANCED_METRICS_ENABLEDis offReferences: datadog-agent cgroup collection and calculation logic