Add template and scaling logic#6414
Add template and scaling logic#6414vofish wants to merge 4 commits intoGoogleCloudPlatform:masterfrom
Conversation
vofish
commented
Jan 30, 2026
- Add deployment template
- Add basic scale up and down logic
| kubernetes_scale_benchmark.Cleanup(bm_spec) | ||
| container_service.RunRetryableKubectlCommand( | ||
| ['delete', 'deployment', 'app'], | ||
| timeout=kubernetes_scale_benchmark._GetScaleTimeout(), |
There was a problem hiding this comment.
We could consider sharing the command, but we need both the timeout to be different (here you are sharing but that scales based off pod count) & the name ('app' instead of 'scaleup') so I'll leave to you whether you want to refactor into a function taking 2 arguments timeout & name or just duplicate.
| samples = kubernetes_scale_benchmark.ParseStatusChanges( | ||
| 'node', | ||
| start_time, | ||
| resources_to_ignore=set(), |
There was a problem hiding this comment.
Should the initial node(s) be ignored?
| ]) | ||
| kubernetes_commands.WaitForRollout( | ||
| 'deployment/app', | ||
| timeout=kubernetes_scale_benchmark._GetScaleTimeout(), |
There was a problem hiding this comment.
need to swap timeout usage since that scales with kubernetes_scale_benchmark flags. Possibly take a variable in that function like "number of nodes/pods" that defaults to kubernetes_scale_benchmark flags. Also remove the _ since you're making it non public. Otherwise, make a new function in this benchmark.
|
|
||
|
|
||
| def _WaitForScaledNodesDeletion(initial_node_count: int) -> bool: | ||
| timeout = 20 * 60 + kubernetes_scale_benchmark._GetScaleTimeout() |
There was a problem hiding this comment.
This seems to take a while huh? We should quiz Piotr (who's now in the chat) about this. As annoying as it would be, it's possible we're ok with waiting for hour(s).
| ) | ||
| _ScaleDeploymentReplicas(0) | ||
| if _WaitForScaledNodesDeletion(initial_node_count): | ||
| _ScaleDeploymentReplicas(NUM_NODES.value) |
There was a problem hiding this comment.
We need to also collect samples for the second deployment, and for the scale down. Add more ParseStatusChanges calls.
It also looks like that function already takes start_time - great! Expand that function to ignore timestamps older than the start time. Add a unit test to that effect as well.
| # Do one scale up, scale down, then scale up again. | ||
| return [] | ||
| _ScaleDeploymentReplicas(NUM_NODES.value) | ||
| samples = kubernetes_scale_benchmark.ParseStatusChanges( |
There was a problem hiding this comment.
Add some metadata to each set of samples (scale up 1, scale down, scale up 2) indicating which phase it was in. ie for these add metadata like "phase": "scaleup1".
hubatish
left a comment
There was a problem hiding this comment.
It's coming along. Big file with a lot of code now
| @@ -0,0 +1,54 @@ | |||
| {% if cloud == 'GCP' %} | |||
| apiVersion: cloud.google.com/v1 | |||
There was a problem hiding this comment.
I thought you removed compute class. Am I looking at the most recent version?
|
|
||
| def _ScaleDeployment(replicas: int) -> None: | ||
| """Scales the 'app' deployment to the given replica count.""" | ||
| container_service.RunKubectlCommand([ |
There was a problem hiding this comment.
Ok, defining the deployment at first & then just using scale to change the number of replicas.. this looks good & actually more standard GKE user friendly than what we've been doing modifying the yaml & reapplying the whole manifest.
| initial_nodes, | ||
| ) | ||
| else: | ||
| logging.warning( |
There was a problem hiding this comment.
Possibly we should just fail the benchmark. How often does this happen? Do you think it would be an intermittent problem or mostly one you encountered during testing?
| """ | ||
| stdout, _, _ = container_service.RunKubectlCommand( | ||
| ['get', 'pods', '-l', 'app=app', '-o', 'json'], | ||
| suppress_logging=True, |
There was a problem hiding this comment.
The log does look nicer for you suppressing the output here.. Quite possibly we should do the same for kubernetes_scale_benchmark.ParseStatusChanges (or do the same if node/pod scale count is high enough). That's probably a follow up change.
| desired_replicas, | ||
| ) | ||
| return samples | ||
| time.sleep(_POLL_INTERVAL_SECONDS) |
There was a problem hiding this comment.
Ah, I was looking for the time.sleep; all of this is in the while loop.
| ) | ||
| ) | ||
| # Always emit a Ready count for a consistent time-series. | ||
| samples.append( |
There was a problem hiding this comment.
Does this result in duplicate Ready count samples?
| done = True | ||
| break | ||
| if elapsed >= timeout: | ||
| logging.warning( |
There was a problem hiding this comment.
again we should just fail the benchmark if we time out. I don't want to have to search for "Timed out waiting for nodes..." to see if my results are valid or not.
| return False | ||
|
|
||
|
|
||
| def _PollNodeDeletionUntilDone( |
There was a problem hiding this comment.
Both Poll functions are complex enough that unit tests mocking time.sleep & kubernetes_commands operations would be helpful to determine their correctness.