Add robust trial status polling #4750

bernardbeckerman · 2026-01-09T19:49:18Z

Summary:
Adds a robust wrapper method for poll_trial_status that:

If poll_trial_status fails, falls back to polling trials one-at-a-time.
Each trial for which individual status polling fails is marked as abandoned.

This is important for runners whose status-polling logic can undergo sporadic exceptions, in order to avoid failing the entire orchestrator process.

Alternatives considered:

Institute a PollStatusResult analogous to MetricFetchResult. Decided against because I don't think this logic belongs in runner.poll_trial_status, since it's good to have this centralized instead of replicated across all runners' poll_trial_status methods. Also this would require migration across all runner.poll_trial_status methods.
Add this to Orchestrator.poll_trial_status (V1 of this diff). This could make sense because this logic is mainly for use in the Orchestrator anyway, but I prefer keeping retry logic closer to where polling happens.

Differential Revision: D90390842

Summary: Adds a robust wrapper method for poll_trial_status that: 1. If poll_trial_status fails, falls back to polling trials one-at-a-time. 2. Each trial for which individual status polling fails is marked as abandoned. This is important for runners whose status-polling logic can undergo sporadic exceptions, in order to avoid failing the entire orchestrator process. Alternatives considered: * **Institute a PollStatusResult analogous to [MetricFetchResult](https://fburl.com/code/yyp9j66m).** Decided against because I don't think this logic belongs in runner.poll_trial_status, since it's good to have this centralized instead of replicated across all runners' poll_trial_status methods. Also this would require migration across all runner.poll_trial_status methods. * **Add this to Orchestrator.poll_trial_status (V1 of this diff)**. This could make sense because this logic is mainly for use in the Orchestrator anyway, but I prefer keeping retry logic closer to where polling happens. Differential Revision: D90390842

meta-codesync · 2026-01-09T19:49:26Z

@bernardbeckerman has exported this pull request. If you are a Meta employee, you can view the originating Diff in D90390842.

codecov-commenter · 2026-01-09T20:20:50Z

Codecov Report

❌ Patch coverage is 96.90722% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 96.72%. Comparing base (5dccf83) to head (ad5a8f5).
⚠️ Report is 2 commits behind head on main.

Files with missing lines	Patch %	Lines
ax/core/tests/test_runner.py	92.68%	3 Missing ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #4750   +/-   ##
=======================================
  Coverage   96.72%   96.72%           
=======================================
  Files         582      582           
  Lines       60718    60813   +95     
=======================================
+ Hits        58732    58824   +92     
- Misses       1986     1989    +3

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Summary: Adds a robust wrapper method for poll_trial_status that: 1. If poll_trial_status fails, falls back to polling trials one-at-a-time. 2. Each trial for which individual status polling fails is marked as abandoned. This is important for runners whose status-polling logic can undergo sporadic exceptions, in order to avoid failing the entire orchestrator process. Alternatives considered: * **Institute a PollStatusResult analogous to [MetricFetchResult](https://fburl.com/code/yyp9j66m).** Decided against because I don't think this logic belongs in runner.poll_trial_status, since it's good to have this centralized instead of replicated across all runners' poll_trial_status methods. Also this would require migration across all runner.poll_trial_status methods. * **Add this to Orchestrator.poll_trial_status (V1 of this diff)**. This could make sense because this logic is mainly for use in the Orchestrator anyway, but I prefer keeping retry logic closer to where polling happens. Also considered making this configurable, but opted for an opinionated solution that avoids config bloat. Differential Revision: D90532278

meta-cla bot added the CLA Signed Do not delete this pull request or issue due to inactivity. label Jan 9, 2026

meta-codesync bot added fb-exported meta-exported labels Jan 9, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add robust trial status polling #4750

Add robust trial status polling #4750

bernardbeckerman commented Jan 9, 2026

Uh oh!

meta-codesync bot commented Jan 9, 2026

Uh oh!

codecov-commenter commented Jan 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add robust trial status polling #4750

Are you sure you want to change the base?

Add robust trial status polling #4750

Conversation

bernardbeckerman commented Jan 9, 2026

Uh oh!

meta-codesync bot commented Jan 9, 2026

Uh oh!

codecov-commenter commented Jan 9, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants