Skip to content

Add template and scaling logic#6414

Open
vofish wants to merge 4 commits intoGoogleCloudPlatform:masterfrom
kiryl-filatau:azure-5k
Open

Add template and scaling logic#6414
vofish wants to merge 4 commits intoGoogleCloudPlatform:masterfrom
kiryl-filatau:azure-5k

Conversation

@vofish
Copy link
Collaborator

@vofish vofish commented Jan 30, 2026

  • Add deployment template
  • Add basic scale up and down logic

kubernetes_scale_benchmark.Cleanup(bm_spec)
container_service.RunRetryableKubectlCommand(
['delete', 'deployment', 'app'],
timeout=kubernetes_scale_benchmark._GetScaleTimeout(),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could consider sharing the command, but we need both the timeout to be different (here you are sharing but that scales based off pod count) & the name ('app' instead of 'scaleup') so I'll leave to you whether you want to refactor into a function taking 2 arguments timeout & name or just duplicate.

samples = kubernetes_scale_benchmark.ParseStatusChanges(
'node',
start_time,
resources_to_ignore=set(),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the initial node(s) be ignored?

])
kubernetes_commands.WaitForRollout(
'deployment/app',
timeout=kubernetes_scale_benchmark._GetScaleTimeout(),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to swap timeout usage since that scales with kubernetes_scale_benchmark flags. Possibly take a variable in that function like "number of nodes/pods" that defaults to kubernetes_scale_benchmark flags. Also remove the _ since you're making it non public. Otherwise, make a new function in this benchmark.



def _WaitForScaledNodesDeletion(initial_node_count: int) -> bool:
timeout = 20 * 60 + kubernetes_scale_benchmark._GetScaleTimeout()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to take a while huh? We should quiz Piotr (who's now in the chat) about this. As annoying as it would be, it's possible we're ok with waiting for hour(s).

)
_ScaleDeploymentReplicas(0)
if _WaitForScaledNodesDeletion(initial_node_count):
_ScaleDeploymentReplicas(NUM_NODES.value)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to also collect samples for the second deployment, and for the scale down. Add more ParseStatusChanges calls.

It also looks like that function already takes start_time - great! Expand that function to ignore timestamps older than the start time. Add a unit test to that effect as well.

# Do one scale up, scale down, then scale up again.
return []
_ScaleDeploymentReplicas(NUM_NODES.value)
samples = kubernetes_scale_benchmark.ParseStatusChanges(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add some metadata to each set of samples (scale up 1, scale down, scale up 2) indicating which phase it was in. ie for these add metadata like "phase": "scaleup1".

Copy link
Collaborator

@hubatish hubatish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's coming along. Big file with a lot of code now

@@ -0,0 +1,54 @@
{% if cloud == 'GCP' %}
apiVersion: cloud.google.com/v1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought you removed compute class. Am I looking at the most recent version?


def _ScaleDeployment(replicas: int) -> None:
"""Scales the 'app' deployment to the given replica count."""
container_service.RunKubectlCommand([
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, defining the deployment at first & then just using scale to change the number of replicas.. this looks good & actually more standard GKE user friendly than what we've been doing modifying the yaml & reapplying the whole manifest.

initial_nodes,
)
else:
logging.warning(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly we should just fail the benchmark. How often does this happen? Do you think it would be an intermittent problem or mostly one you encountered during testing?

"""
stdout, _, _ = container_service.RunKubectlCommand(
['get', 'pods', '-l', 'app=app', '-o', 'json'],
suppress_logging=True,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The log does look nicer for you suppressing the output here.. Quite possibly we should do the same for kubernetes_scale_benchmark.ParseStatusChanges (or do the same if node/pod scale count is high enough). That's probably a follow up change.

desired_replicas,
)
return samples
time.sleep(_POLL_INTERVAL_SECONDS)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I was looking for the time.sleep; all of this is in the while loop.

)
)
# Always emit a Ready count for a consistent time-series.
samples.append(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this result in duplicate Ready count samples?

done = True
break
if elapsed >= timeout:
logging.warning(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again we should just fail the benchmark if we time out. I don't want to have to search for "Timed out waiting for nodes..." to see if my results are valid or not.

return False


def _PollNodeDeletionUntilDone(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both Poll functions are complex enough that unit tests mocking time.sleep & kubernetes_commands operations would be helpful to determine their correctness.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants