Add template and scaling logic by vofish · Pull Request #6414 · GoogleCloudPlatform/PerfKitBenchmarker

vofish · 2026-01-30T16:03:23Z

Add deployment template
Add basic scale up and down logic

hubatish · 2026-02-03T16:17:41Z

perfkitbenchmarker/linux_benchmarks/kubernetes_node_scale_benchmark.py

-  kubernetes_scale_benchmark.Cleanup(bm_spec)
+  container_service.RunRetryableKubectlCommand(
+      ['delete', 'deployment', 'app'],
+      timeout=kubernetes_scale_benchmark._GetScaleTimeout(),


We could consider sharing the command, but we need both the timeout to be different (here you are sharing but that scales based off pod count) & the name ('app' instead of 'scaleup') so I'll leave to you whether you want to refactor into a function taking 2 arguments timeout & name or just duplicate.

hubatish · 2026-02-03T17:35:53Z

perfkitbenchmarker/linux_benchmarks/kubernetes_node_scale_benchmark.py

+  samples = kubernetes_scale_benchmark.ParseStatusChanges(
+      'node',
+      start_time,
+      resources_to_ignore=set(),


Should the initial node(s) be ignored?

hubatish · 2026-02-03T17:37:29Z

perfkitbenchmarker/linux_benchmarks/kubernetes_node_scale_benchmark.py

+  ])
+  kubernetes_commands.WaitForRollout(
+      'deployment/app',
+      timeout=kubernetes_scale_benchmark._GetScaleTimeout(),


need to swap timeout usage since that scales with kubernetes_scale_benchmark flags. Possibly take a variable in that function like "number of nodes/pods" that defaults to kubernetes_scale_benchmark flags. Also remove the _ since you're making it non public. Otherwise, make a new function in this benchmark.

hubatish · 2026-02-03T17:38:43Z

perfkitbenchmarker/linux_benchmarks/kubernetes_node_scale_benchmark.py

+
+
+def _WaitForScaledNodesDeletion(initial_node_count: int) -> bool:
+  timeout = 20 * 60 + kubernetes_scale_benchmark._GetScaleTimeout()


This seems to take a while huh? We should quiz Piotr (who's now in the chat) about this. As annoying as it would be, it's possible we're ok with waiting for hour(s).

hubatish · 2026-02-03T17:41:34Z

perfkitbenchmarker/linux_benchmarks/kubernetes_node_scale_benchmark.py

+  )
+  _ScaleDeploymentReplicas(0)
+  if _WaitForScaledNodesDeletion(initial_node_count):
+    _ScaleDeploymentReplicas(NUM_NODES.value)


We need to also collect samples for the second deployment, and for the scale down. Add more ParseStatusChanges calls.

It also looks like that function already takes start_time - great! Expand that function to ignore timestamps older than the start time. Add a unit test to that effect as well.

hubatish · 2026-02-03T17:43:04Z

perfkitbenchmarker/linux_benchmarks/kubernetes_node_scale_benchmark.py

  # Do one scale up, scale down, then scale up again.
-  return []
+  _ScaleDeploymentReplicas(NUM_NODES.value)
+  samples = kubernetes_scale_benchmark.ParseStatusChanges(


Add some metadata to each set of samples (scale up 1, scale down, scale up 2) indicating which phase it was in. ie for these add metadata like "phase": "scaleup1".

hubatish

It's coming along. Big file with a lot of code now

hubatish · 2026-02-13T16:41:34Z

perfkitbenchmarker/data/container/kubernetes_scale/kubernetes_node_scale.yaml.j2

@@ -0,0 +1,54 @@
+{% if cloud == 'GCP' %}
+apiVersion: cloud.google.com/v1


I thought you removed compute class. Am I looking at the most recent version?

hubatish · 2026-02-13T16:46:12Z

perfkitbenchmarker/linux_benchmarks/kubernetes_node_scale_benchmark.py

+
+def _ScaleDeployment(replicas: int) -> None:
+  """Scales the 'app' deployment to the given replica count."""
+  container_service.RunKubectlCommand([


Ok, defining the deployment at first & then just using scale to change the number of replicas.. this looks good & actually more standard GKE user friendly than what we've been doing modifying the yaml & reapplying the whole manifest.

hubatish · 2026-02-13T16:48:01Z

perfkitbenchmarker/linux_benchmarks/kubernetes_node_scale_benchmark.py

+        initial_nodes,
+    )
+  else:
+    logging.warning(


Possibly we should just fail the benchmark. How often does this happen? Do you think it would be an intermittent problem or mostly one you encountered during testing?

hubatish · 2026-02-13T17:06:26Z

perfkitbenchmarker/linux_benchmarks/kubernetes_node_scale_benchmark.py

+  """
+  stdout, _, _ = container_service.RunKubectlCommand(
+      ['get', 'pods', '-l', 'app=app', '-o', 'json'],
+      suppress_logging=True,


The log does look nicer for you suppressing the output here.. Quite possibly we should do the same for kubernetes_scale_benchmark.ParseStatusChanges (or do the same if node/pod scale count is high enough). That's probably a follow up change.

hubatish · 2026-02-13T17:07:39Z

perfkitbenchmarker/linux_benchmarks/kubernetes_node_scale_benchmark.py

+          desired_replicas,
+      )
+      return samples
+    time.sleep(_POLL_INTERVAL_SECONDS)


Ah, I was looking for the time.sleep; all of this is in the while loop.

hubatish · 2026-02-13T17:08:53Z

perfkitbenchmarker/linux_benchmarks/kubernetes_node_scale_benchmark.py

+          )
+      )
+    # Always emit a Ready count for a consistent time-series.
+    samples.append(


Does this result in duplicate Ready count samples?

hubatish · 2026-02-13T17:11:05Z

perfkitbenchmarker/linux_benchmarks/kubernetes_node_scale_benchmark.py

+      done = True
+      break
+    if elapsed >= timeout:
+      logging.warning(


again we should just fail the benchmark if we time out. I don't want to have to search for "Timed out waiting for nodes..." to see if my results are valid or not.

hubatish · 2026-02-13T17:13:28Z

perfkitbenchmarker/linux_benchmarks/kubernetes_node_scale_benchmark.py

+  return False
+
+
+def _PollNodeDeletionUntilDone(


Both Poll functions are complex enough that unit tests mocking time.sleep & kubernetes_commands operations would be helpful to determine their correctness.

vofish added 2 commits January 30, 2026 18:01

Add template and scaling logic

0991f6e

Add scaling down logic and gathering metrics

502bfab

hubatish reviewed Feb 3, 2026

View reviewed changes

vofish added 2 commits February 10, 2026 11:17

Add scaling down logic, phases and gathering metrics

cf15c47

Refactor kubernetes_node_scale benchmark

f178b3d

hubatish reviewed Feb 13, 2026

View reviewed changes



		def _WaitForScaledNodesDeletion(initial_node_count: int) -> bool:
		timeout = 20 * 60 + kubernetes_scale_benchmark._GetScaleTimeout()

		@@ -0,0 +1,54 @@
		{% if cloud == 'GCP' %}
		apiVersion: cloud.google.com/v1

Conversation

vofish commented Jan 30, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hubatish left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants