[AI-4964] Add support for timeout in process isolation#22975
[AI-4964] Add support for timeout in process isolation#22975sarah-witt wants to merge 7 commits intomasterfrom
Conversation
|
✅ Tests 🎉 All green!❄️ No new flaky tests detected 🔗 Commit SHA: d1675a4 | Docs | Datadog PR Page | Was this helpful? React with 👍/👎 or give us feedback! |
Codecov Report❌ Patch coverage is Additional details and impacted files🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 568125556b
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| def _kill_on_timeout(): | ||
| nonlocal timed_out | ||
| timed_out = True | ||
| process.kill() |
There was a problem hiding this comment.
Terminate the entire isolated process tree on timeout
process.kill() only stops the wrapper Python process here. Checks can legitimately spawn child commands (for example process/datadog_checks/process/process.py:244-246 and tibco_ems/datadog_checks/tibco_ems/tibco_ems.py:81-88), and with process_isolation_timeout enabled those descendants will keep running after the wrapper is killed because this subprocess was not started in its own session/process group. In those integrations, repeated timeouts would leak orphaned commands and the check is not actually "stopped" as advertised.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
This is a good point however the process_isolation feature is still experimental and the timeout in particular only will be used on vSphere, which doesn't spawn processes. I added a TODO, but reviewers let me know if I should address!
What does this PR do?
Adds a parameter
process_isolation_timeoutto be used withprocess_isolationthat sets a timeout for a check running in a new process. If the check exceeds the timeout, it is stopped and will be rescheduled.Motivation
We have a case where a third party library causes a check to hang sporadically. This is due to intermittent network errors, and a rerun of the check is sufficient to fix the issue. Currently, the only way to stop the check is to restart the agent. If we run the check using process isolation and have a timeout for the process, the check will timeout after hanging, and get rescheduled.
Review checklist (to be filled by reviewers)
qa/skip-qalabel if the PR doesn't need to be tested during QA.backport/<branch-name>label to the PR and it will automatically open a backport PR once this one is merged