Nina autotune timeout#38
Conversation
|
Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). View this failed invocation of the CLA check for more information. For the most up to date status, view the checks section at the bottom of the pull request. |
| code_template: str | ||
| search_space: dict[str, list] | ||
| timeout: Optional[int] = 30 | ||
| timeout: Optional[int] = 300 |
There was a problem hiding this comment.
It is possible that the tpu server is running and the eval server is time out when there are many combination in your grid search. Should we make this timeout dynamic (= autotune_time_out/#combination?)
There was a problem hiding this comment.
I think the logic should be when eval_server is timeout, just kill all processes in grid search. This timeout should be roughly how long a kernel runs, and it doesn't need to be associated with total timeout in eval_server.
There was a problem hiding this comment.
That would be the best but I doubt the current implementation can achieve this.
When eval_server timed out, the tpu_server will not automatically shut down the grid search. That's why I want to make each grid_search timeout to be at most equal to autotune_time_out/#combination.
There was a problem hiding this comment.
Right, this will be added in the next PR. Subprocess hanging is a bug to all eval types not just autotune. Dynamic timeout would be worse if there are too many combinations and each process have very short timeout. Then none of these process can actually show any result.
Change autotune timeout to 1.5 hours.
Change per kernel run timeout to 300s.