Nina autotune timeout by NinaCai · Pull Request #38 · AI-Hypercomputer/accelerator-agents

NinaCai · 2026-05-11T21:46:44Z

Change autotune timeout to 1.5 hours.
Change per kernel run timeout to 300s.

google-cla · 2026-05-11T21:46:54Z

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

shangkunwang01 · 2026-05-11T22:11:08Z

  code_template: str
  search_space: dict[str, list]
-  timeout: Optional[int] = 30
+  timeout: Optional[int] = 300


It is possible that the tpu server is running and the eval server is time out when there are many combination in your grid search. Should we make this timeout dynamic (= autotune_time_out/#combination?)

I think the logic should be when eval_server is timeout, just kill all processes in grid search. This timeout should be roughly how long a kernel runs, and it doesn't need to be associated with total timeout in eval_server.

That would be the best but I doubt the current implementation can achieve this.
When eval_server timed out, the tpu_server will not automatically shut down the grid search. That's why I want to make each grid_search timeout to be at most equal to autotune_time_out/#combination.

Right, this will be added in the next PR. Subprocess hanging is a bug to all eval types not just autotune. Dynamic timeout would be worse if there are too many combinations and each process have very short timeout. Then none of these process can actually show any result.

NinaCai and others added 2 commits May 11, 2026 21:26

add timeout

5dfb68b

make autotune timeout a constant

52c922f

NinaCai requested a review from shangkunwang01 May 11, 2026 21:46

shangkunwang01 reviewed May 11, 2026

View reviewed changes

Comment thread MaxKernel/hitl_agent/server_utils/eval_server.py Outdated

shangkunwang01 reviewed May 11, 2026

View reviewed changes

shangkunwang01 requested changes May 11, 2026

View reviewed changes

use timeout based on the eval type

35541b2

NinaCai requested a review from shangkunwang01 May 11, 2026 23:09

shangkunwang01 requested changes May 12, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nina autotune timeout#38

Nina autotune timeout#38
NinaCai wants to merge 3 commits into
mainfrom
nina-autotune-timeout

NinaCai commented May 11, 2026

Uh oh!

google-cla Bot commented May 11, 2026

Uh oh!

Uh oh!

shangkunwang01 May 11, 2026

Uh oh!

NinaCai May 11, 2026

Uh oh!

shangkunwang01 May 12, 2026 •

edited

Loading

Uh oh!

NinaCai May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

NinaCai commented May 11, 2026

Uh oh!

google-cla Bot commented May 11, 2026

Uh oh!

Uh oh!

shangkunwang01 May 11, 2026

Choose a reason for hiding this comment

Uh oh!

NinaCai May 11, 2026

Choose a reason for hiding this comment

Uh oh!

shangkunwang01 May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

NinaCai May 12, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

shangkunwang01 May 12, 2026 •

edited

Loading