Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
apiVersion: karpenter.sh/v1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this file a duplicate of https://github.com/GoogleCloudPlatform/PerfKitBenchmarker/blob/master/perfkitbenchmarker/data/container/kubernetes_ai_inference/aws-gpu-nodepool.yaml.j2 ? It looks very similar.
Suggestions:

  1. We should have code reuse rather than duplication. As mentioned in my other comment, move the application of this nodepool yaml to the resource rather than benchmark specific yamls. This would probably then live in the data/container/karpenter folder.
  2. If not, at least use the same file.
  3. Worst case / in the future copies like this should show in git as a cp using https://stackoverflow.com/questions/1043388/record-file-copy-operation-with-git . This would also make it more obvious if anything has changed.
    But really I think I'm looking for Cassandra Benchmark Missing Quotes #1 here.

kind: NodePool
metadata:
name: {{ gpu_nodepool_name | default('gpu') }}
spec:
disruption:
consolidateAfter: {{ gpu_consolidate_after | default('1m') }}
consolidationPolicy: {{ gpu_consolidation_policy | default('WhenEmptyOrUnderutilized') }}
limits:
cpu: {{ gpu_nodepool_cpu_limit | default(1000) }}
template:
metadata:
labels:
pkb_nodepool: {{ gpu_nodepool_label | default('gpu') }}
spec:
nodeClassRef:
group: karpenter.k8s.aws
kind: EC2NodeClass
name: {{ karpenter_ec2nodeclass_name | default('default') }}
requirements:
- key: kubernetes.io/arch
operator: In
values: {{ gpu_arch | default(['amd64']) }}
- key: kubernetes.io/os
operator: In
values: {{ gpu_os | default(['linux']) }}
- key: karpenter.sh/capacity-type
operator: In
values: {{ gpu_capacity_types | default(['on-demand']) }}
- key: karpenter.k8s.aws/instance-category
operator: In
values: {{ gpu_instance_categories | default(['g']) }}
- key: karpenter.k8s.aws/instance-family
operator: In
values: {{ gpu_instance_families | default(['g6','g6e']) }}
taints:
- key: {{ gpu_taint_key | default('nvidia.com/gpu') }}
effect: NoSchedule
Original file line number Diff line number Diff line change
Expand Up @@ -13,13 +13,24 @@ spec:
labels:
name: {{ Name }}
spec:
{%- if NvidiaGpuRequest and Cloud == 'aws' %}
nodeSelector:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Zach recently changed how the nodeSelectors area applied. Take a look at #6291, and in particular, the (new) GetNodeSelectors() methods in the providers. (e.g. within providers/aws/elastic_kubernetes_service.py).

That said, your nodeSelector here refers back to your aws-gpu-nodepool.yaml.j2 file, so I think the new GetNodeSelectors() method probably doesn't apply - keeping this in the yaml.j2 file seems like the right approach here.

No action required (though take a quick look at that #6291.)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the pointer, I checked #6291.
This does not apply in our case, so you are right, no changes are needed here. Thanks.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I disagree here. A few reasons:

  • GetNodeSelectors & ModifyPodSpecPlacementYaml + the associated ConvertManifestToYamlDicts -> ModifyPodSpecPlacementYaml -> ApplyYaml flow is basically doing the same thing & the code you've written can be written in this manner. It's very inconsistent & makes for an odd exception for specifically Karpenter to use the old ApplyManifest path instead. To some extent this is just a "my new way vs your old way" & I suspect the old way make more sense / you may have already gotten some of this code working via the old way before syncing & realizing the old way existed.. but now that the new way does exist we shouldn't mix & match between the two.
  • Now how would this be implemented here & why could an exception make sense? Rich notes that the GetNodeSelectors function here refers to the aws-gpu-nodepool.yaml.j2 nodepool , which is only applied in this benchmark.. hence putting this code in the resource would be tricky.
  • However, it doesn't actually look like this code is benchmark specific and should instead be resource specific. See my comments about aws-gpu-nodepool.yaml.j2 looking the same as in kubernetes ai inference.

So all of this should instead just go into the EksKarpenterCluster class & its GetNodeSelectors + ModifyPodSpecPlacementYaml code.

karpenter.sh/nodepool: {{ GpuNodepoolName | default('gpu') }}
{%- endif %}
containers:
- name: {{ Name }}
image: {{ Image }}
{%- if Command %}
command: {{ Command }}
{%- endif %}
resources:
requests:
cpu: {{ CpuRequest }}
memory: {{ MemoryRequest }}
ephemeral-storage: {{ EphemeralStorageRequest }}
{%- if NvidiaGpuRequest %}
nvidia.com/gpu: {{ NvidiaGpuRequest }}
{%- endif %}
limits:
cpu: {{ CpuRequest }}
memory: {{ MemoryRequest }}
Expand Down Expand Up @@ -53,3 +64,8 @@ spec:
operator: "Exists"
effect: "NoExecute"
tolerationSeconds: {{ PodTimeout }}
{%- if NvidiaGpuRequest and Cloud == 'aws' %}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, that same PR introduced ModifyPodSpecPlacyementYaml method, which seems like it could apply here. But again, this block refers back to your nodepool yaml.j2 file, so I think you should probably leave it as is.

No action required.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the pointer and the context.

- key: {{ GpuTaintKey | default('nvidia.com/gpu') }}
operator: Exists
effect: NoSchedule
{%- endif %}
55 changes: 47 additions & 8 deletions perfkitbenchmarker/linux_benchmarks/kubernetes_scale_benchmark.py
Original file line number Diff line number Diff line change
Expand Up @@ -97,8 +97,7 @@ def Prepare(bm_spec: benchmark_spec.BenchmarkSpec):
def _GetRolloutCreationTime(rollout_name: str) -> int:
"""Returns the time when the rollout was created."""
out, _, _ = container_service.RunRetryableKubectlCommand([
'rollout',
'history',
'get',
rollout_name,
'-o',
'jsonpath={.metadata.creationTimestamp}',
Expand Down Expand Up @@ -180,8 +179,32 @@ def ScaleUpPods(
max_wait_time = _GetScaleTimeout()
resource_timeout = max_wait_time + 60 * 5 # 5 minutes after waiting to avoid
# pod delete events from polluting data collection.
yaml_docs = cluster.ConvertManifestToYamlDicts(
MANIFEST_TEMPLATE,

# Ensure a GPU NodePool exists for EKS Karpenter before applying the workload.
is_eks_karpenter_aws_gpu = (
virtual_machine.GPU_COUNT.value
and FLAGS.cloud.lower() == 'aws'
and getattr(cluster, 'CLUSTER_TYPE', None) == 'Karpenter'
)

if is_eks_karpenter_aws_gpu:
cluster.ApplyManifest(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NB: Because ScaleUpPods is called twice, you'll end up applying this nodepool manifest twice. Which is fine - the second time k8s will notice it's already present and do nothing.

But perhaps we should put this into the Prepare() function instead? (And if that works, maybe we should put the first invocation of ScaleUpPods there too... though if you want to do that, then a separate PR might be better.)

'container/kubernetes_scale/aws-gpu-nodepool.yaml.j2',
gpu_nodepool_name='gpu',
gpu_nodepool_label='gpu',
karpenter_ec2nodeclass_name='default',
gpu_instance_categories=['g'],
gpu_instance_families=['g6', 'g6e'],
gpu_capacity_types=['on-demand'],
gpu_arch=['amd64'],
gpu_os=['linux'],
gpu_taint_key='nvidia.com/gpu',
gpu_consolidate_after='1m',
gpu_consolidation_policy='WhenEmptyOrUnderutilized',
gpu_nodepool_cpu_limit=1000,
)

manifest_kwargs = dict(
Name='kubernetes-scaleup',
Replicas=num_new_pods,
CpuRequest=CPUS_PER_POD.value,
Expand All @@ -192,12 +215,28 @@ def ScaleUpPods(
EphemeralStorageRequest='10Mi',
RolloutTimeout=max_wait_time,
PodTimeout=resource_timeout,
Cloud=FLAGS.cloud.lower(),
)
cluster.ModifyPodSpecPlacementYaml(
yaml_docs,
'kubernetes-scaleup',
cluster.default_nodepool.machine_type,

if is_eks_karpenter_aws_gpu:
manifest_kwargs.update({
'GpuNodepoolName': 'gpu',
'GpuTaintKey': 'nvidia.com/gpu',
})

yaml_docs = cluster.ConvertManifestToYamlDicts(
MANIFEST_TEMPLATE,
**manifest_kwargs,
)

# Non-GPU path: keep existing placement behavior.
if not is_eks_karpenter_aws_gpu:
cluster.ModifyPodSpecPlacementYaml(
yaml_docs,
'kubernetes-scaleup',
cluster.default_nodepool.machine_type,
)

resource_names = cluster.ApplyYaml(yaml_docs)

assert resource_names
Expand Down