Kubernetes Job service #5113

javanlacerda · 2026-01-08T21:24:36Z

This PR introduces full support for scheduling and managing fuzzing tasks on Kubernetes clusters,
specifically targeting GKE. It implements a new KubernetesService to
handle batch job creation, supports Kata Containers for isolation, and includes robust testing
and configuration mechanisms.

Key Features:

Kubernetes Service: A new backend for RemoteTaskInterface that schedules tasks as Kubernetes
Jobs. It supports both standard and Kata Container runtimes, automatic Service Account
creation with Workload Identity, and intelligent job limiting to prevent cluster overload.
Traffic Shifting (RemoteTaskGate): A new gating mechanism (RemoteTaskGate) that intelligently
routes tasks between the legacy GCP Batch service and the new Kubernetes service based on
configurable probabilities, allowing for a gradual, controlled migration.
Feature Flags: A new dynamic configuration system backed by Datastore to control runtime
behaviors like job concurrency limits.

Detailed Changes by Module:

Kubernetes Integration (src/clusterfuzz/_internal/k8s/):
- service.py: Implemented KubernetesService for job lifecycle management (creation,
  monitoring, limiting). Includes GKE credential loading, Kata Container spec generation,
  and Service Account provisioning.
- Tests: Added k8s_service_test.py (unit), k8s_service_limit_test.py (limits), and
  k8s_service_e2e_test.py (integration on Kind).
Remote Task Management (src/clusterfuzz/_internal/remote_task/):
- init.py: Introduced RemoteTaskGate, a smart router that implements
  RemoteTaskInterface. It initializes both GcpBatchService and KubernetesService and
  distributes tasks between them based on probabilities defined in job_frequency.py. This
  enables traffic splitting (e.g., 10% to K8s, 90% to Batch) for safe rollout.
- job_frequency.py: Added logic to manage task scheduling frequency and split ratios.
- Refactored core task logic to use the generic RemoteTask and RemoteTaskInterface
  abstractions.
Datastore & Configuration (src/clusterfuzz/_internal/datastore/):
- data_types.py: Added FeatureFlag model to store configuration dynamically.
- feature_flags.py: Added FeatureFlags enum/helper for type-safe access to flags (e.g.,
  K8S_PENDING_JOBS_LIMITER).
Batch & Legacy Refactoring (src/clusterfuzz/_internal/batch/):
- service.py: Updated to align with the new RemoteTask interface.
- Removed obsolete gcp.py and google_cloud_utils/batch.py utilities in favor of the new
  structure.
Infrastructure & CI:
- .github/workflows/kubernetes-e2e-tests.yaml: New workflow for running E2E tests on a Kind
  cluster.
- Pipfile / src/Pipfile: Added kubernetes client and updated Google Cloud dependencies.
Bot & Metrics:
- src/python/bot/startup/run_bot.py: Updates to support K8s-based bot execution via the new
  gate.
- src/clusterfuzz/_internal/metrics/: Enhanced logging and monitoring for remote tasks.

Evidences:

Batch and Kata containers fuzzing hours, proving the Remote Gate, the Batch and Kubernetes services are working properly. The Feature Flag is used to set the job_frequency, then it proves the feature flag and its usage is working as well.

jonathanmetzman

left some very surface level comments.

Pipfile

butler.py

src/clusterfuzz/_internal/tests/core/platforms/__init__.py

src/python/bot/startup/run_bot.py

src/setup.py

This commit introduces the Kubernetes job client and service, providing a mechanism to schedule tasks on Kubernetes clusters (including GKE and Kind), supporting both standard and Kata Containers. Key Features & Changes: - **Kubernetes Service**: Implemented `KubernetesService` in `clusterfuzz._internal.k8s.service` to manage job creation. - **Kata Support**: Added specialized job creation for Kata Containers (`create_kata_container_job`) with required security context (`privileged`, `capabilities: ALL`), networking (`hostNetwork: True`), and environment variables (`HOST_UID`). - **Dependency Management**: Added `kubernetes` and necessary Google Cloud dependencies (`google-api-python-client`, `google-cloud-storage`, `google-cloud-ndb`, etc.) to `Pipfile`. - **E2E Testing**: - Created `tests.core.k8s.k8s_service_e2e_test` to verify job lifecycle on a local Kind cluster. - Updated `local/tests/kubernetes_e2e_test.bash` to provision the test environment. - Updated CI workflow (`.github/workflows/kubernetes-e2e-tests.yaml`) to install JDK 21 (required for Datastore emulator). - Tests now verify job "Running" status to avoid timeouts with long-running commands. - `KubernetesService` skips default credential loading when `K8S_E2E` is set to utilize the test-provided kubeconfig. - **Unit Tests**: Added comprehensive unit tests in `tests.core.k8s.k8s_service_test` and `tests.core.kubernetes.kubernetes_test`, including mocking of `load_kube_config` and `_load_gke_credentials` to ensure robust testing without external dependencies.

Signed-off-by: Javan Lacerda <javanlacerda@google.com>

src/local/butler/format.py

Signed-off-by: Javan Lacerda <javanlacerda@google.com>

Pipfile

Signed-off-by: Javan Lacerda <javanlacerda@google.com>

jonathanmetzman · 2026-01-16T19:42:44Z

local/tests/kubernetes_e2e_test.bash

+
+pip install pipenv
+
+# Install dependencies.
+pipenv --python 3.11
+pipenv install

-class KubernetesJobClient(RemoteTaskInterface):
-  """A remote task execution client for Kubernetes.
-  
-  This class is a placeholder for a future implementation of a remote task
-  execution client that uses Kubernetes. It is not yet implemented.
-  """
+./local/install_deps.bash


This is only intended to be used in CI?

jonathanmetzman · 2026-01-16T19:43:13Z

src/clusterfuzz/_internal/base/tasks/__init__.py


    # If we get here the task succeeded in running. Acknowledge the message.
-    self._pubsub_message.ack()
+    if not self.do_not_ack:


What is this change?

Its part of the job limiter for the Kubernetes service, we can probably use this for implement the job limiter for Batch as well, using the new feature they implemented for us. The rationale behind is if the task cannot be scheduled for Kubernetes because it already reached the limit of jobs, the message should not be acked, allowing the other adapter, such as Batch, to process the message.

jonathanmetzman · 2026-01-16T19:44:05Z

src/clusterfuzz/_internal/datastore/feature_flags.py

@@ -0,0 +1,70 @@
+# Copyright 2024 Google LLC


jonathanmetzman · 2026-01-16T19:44:43Z

src/clusterfuzz/_internal/k8s/__init__.py

@@ -0,0 +1,14 @@
+# Copyright 2025 Google LLC


jonathanmetzman · 2026-01-16T20:41:20Z

src/clusterfuzz/_internal/k8s/service.py

+      logs.info(f'Scheduling {remote_task.command}, {remote_task.job_type}.')
+      config = configs[(remote_task.command, remote_task.job_type)]
+      job_specs[config].append(remote_task.input_download_url)
+    logs.info('Creating batch jobs.')


Shouldn't this be k8s?

Yes. fixing

jonathanmetzman · 2026-01-16T20:41:26Z

src/clusterfuzz/_internal/k8s/service.py

+          namespace='default',
+          label_selector='app.kubernetes.io/name=clusterfuzz-kata-job',
+          field_selector='status.phase=Pending')
+      logs.info(f"Found {len(pods.items)} pending jobs.")


single quotes.

jonathanmetzman · 2026-01-16T20:46:12Z

src/clusterfuzz/_internal/k8s/service.py

+                     service_account_name: str) -> dict:
+  """Creates the body of a Kubernetes job."""
+
+  job_name = config.job_type.replace('_', '-') + '-' + str(uuid.uuid4()).split(


I think at OSS-Fuzz scale there is actually a high chance of a collision in a single day (>50%, https://en.wikipedia.org/wiki/Birthday_problem).
How bad is that? Why can't we use the full uuid?
If there's a length problem maybe we want to truncate the job name?

Also it seems like this is duplicated from create_kata_container_job. Let's try to share code.

jonathanmetzman · 2026-01-16T20:46:19Z

src/clusterfuzz/_internal/k8s/service.py

+  """Creates the body of a Kubernetes job."""
+
+  job_name = config.job_type.replace('_', '-') + '-' + str(uuid.uuid4()).split(
+      '-', maxsplit=1)[0]


Why do maxsplit? It seems unnatural and unnecessary

jonathanmetzman · 2026-01-16T20:48:49Z

src/clusterfuzz/_internal/k8s/service.py

+        logs.error(f"Cluster {CLUSTER_NAME} not found in project {project}.")
+        print(f"DEBUG: Cluster {CLUSTER_NAME} not found in project {project}.")
+        return
+
+    except Exception as e:
+      logs.error(f"Failed to list clusters in {project}: {e}")


Single quotes.

jonathanmetzman · 2026-01-16T20:59:44Z

src/clusterfuzz/_internal/batch/service.py

 from clusterfuzz._internal.datastore import ndb_utils
+from clusterfuzz._internal.google_cloud_utils import credentials
 from clusterfuzz._internal.metrics import logs
+from clusterfuzz._internal.remote_task import types


This shadows a python builtin module. Can you rename it to remote_task_types.

jonathanmetzman · 2026-01-18T18:27:26Z

src/clusterfuzz/_internal/k8s/service.py

+    separate batch job for each group. This allows tasks with similar
+    requirements to be processed together, which can improve efficiency.
+    """
+    if feature_flags.FeatureFlags.K8S_PENDING_JOBS_LIMITER.enabled and \


This condition is very hard to read. Can you break it up?

jonathanmetzman · 2026-01-18T18:28:01Z

src/clusterfuzz/_internal/k8s/service.py

+    jobs = []
+    logs.info('Batching utask_mains.')
+    for config, input_urls in job_specs.items():
+      # TODO(javanlacerda): Batch multiple tasks into a single job.


Is that actually need for kata?

jonathanmetzman · 2026-01-18T18:28:18Z

src/clusterfuzz/_internal/k8s/service.py

+
+  if not remote_tasks:
+    return {}
+  #TODO(javanlacerda): Create remote task config


nit: Space after #

.github/workflows/kubernetes-e2e-tests.yaml

jonathanmetzman · 2026-01-18T18:48:39Z

Pipfile

+Jinja2 = "==3.1.2"
+oauth2client = "==4.1.3"
+requests = "==2.21.0"
+PyYAML = "==6.0"


Why are we using a different version here and in src/Pipfile?

jonathanmetzman · 2026-01-18T18:49:18Z

Pipfile

+google-cloud-datastore = "==2.16.1"
+Jinja2 = "==3.1.2"
+oauth2client = "==4.1.3"
+requests = "==2.21.0"


Why are we adding such old versions of requests and httplib2?

jonathanmetzman · 2026-01-18T18:50:31Z

src/clusterfuzz/_internal/k8s/job_template.yaml

@@ -0,0 +1,61 @@
+# Copyright 2026 Google LLC


What's difference between thsi and the next template?

It might have been a good idea to consider knative instead of rebuilding batch.

jonathanmetzman · 2026-01-18T18:53:26Z

src/clusterfuzz/_internal/k8s/job_template.yaml

+        emptyDir:
+          medium: Memory
+          sizeLimit: 1.9Gi
+      restartPolicy: Never


FYI I do restart for non fuzz jobs.

jonathanmetzman · 2026-01-18T18:53:56Z

src/clusterfuzz/_internal/k8s/kata_job_template.yaml

+        - name: UPDATE_WEB_TESTS
+          value: 'False'
+      restartPolicy: Never
+      volumes:


Why don't we mount dev/shm here?

jonathanmetzman · 2026-01-18T18:55:18Z

This is cool. I maybe would tried cloud run before kata because 1. It is probably less management? 2. It might be more performant because as far as I know doesn't use nested virt.

jonathanmetzman · 2026-01-18T18:55:40Z

Are we using preemptibles btw?

Signed-off-by: Javan Lacerda <javanlacerda@google.com>

ViniciustCosta · 2026-01-19T13:28:10Z

src/clusterfuzz/_internal/remote_task/remote_task_adapters.py

+
+from enum import Enum
+
+from clusterfuzz._internal.batch.service import GcpBatchService


We usually import the module, not the class, per: go/pystyle#imports

In this case, as all backends would expose a service module, I guess you could do from clusturefuzz...batch import service as batch_service

ViniciustCosta · 2026-01-19T13:29:09Z

src/clusterfuzz/_internal/remote_task/remote_task_adapters.py

+from clusterfuzz._internal.k8s.service import KubernetesService
+
+
+class RemoteTaskAdapters(Enum):


ViniciustCosta · 2026-01-19T13:34:28Z

src/clusterfuzz/_internal/remote_task/types.py

+
+  @abc.abstractmethod
+  def create_utask_main_jobs(self, remote_tasks: list[RemoteTask]):
+    """Creates a many remote tasks for uworker main tasks."""


nit: "Creates many remote jobs ..."

ViniciustCosta · 2026-01-19T13:44:35Z

src/clusterfuzz/_internal/remote_task/__init__.py

-    and returns a representation of the created job.
+  def __init__(self):
+    # Avoiding circular import
+    from clusterfuzz._internal.remote_task import remote_task_adapters


What is the circular import avoided here? Is it the case where other module imports both this and the remote_task_adapters?

If that's the case, I think we can expect that other modules won't need to import remote_task_adapter as it should be an abstraction only used by the remote task gate, so maybe we don't need this local import.

decoNR · 2026-01-19T13:22:44Z

src/clusterfuzz/_internal/tests/core/k8s/k8s_service_e2e_test.py

+      raise unittest.SkipTest('K8S_E2E environment variable not set.')
+
+    cls.mock_batch_config = mock.Mock()
+    cls.mock_batch_config.get.return_value = 'test-project'


It appears this is overwritten on line 114 by cls.mock_batch_config.get.side_effect = get_batch_config.

decoNR · 2026-01-19T13:47:30Z

src/clusterfuzz/_internal/base/tasks/__init__.py


    # If we get here the task succeeded in running. Acknowledge the message.
-    self._pubsub_message.ack()
+    if not self.do_not_ack:


IMO readability would be improved by using ack instead of do_not_ack (go/tott/764).

decoNR · 2026-01-19T14:00:16Z

src/clusterfuzz/_internal/batch/service.py

-                                                 tasks.TASK_LEASE_SECONDS)
-
-
 WeightedSubconfig = collections.namedtuple('WeightedSubconfig',


nit: Maybe this can be moved to the top of the file, below BatchWorkloadSpec.

Signed-off-by: Javan Lacerda <javanlacerda@google.com>

javanlacerda changed the title ~~Pr/dependencies~~ Kubernetes Job service Jan 8, 2026

javanlacerda force-pushed the pr/dependencies branch 5 times, most recently from 7edfb48 to b271b83 Compare January 10, 2026 22:46

javanlacerda marked this pull request as ready for review January 10, 2026 22:50

javanlacerda requested review from ViniciustCosta, decoNR, hunsche and jonathanmetzman and removed request for ViniciustCosta and hunsche January 10, 2026 22:50

jonathanmetzman reviewed Jan 12, 2026

View reviewed changes

javanlacerda force-pushed the pr/dependencies branch 4 times, most recently from d5684e5 to 44737d3 Compare January 15, 2026 00:44

javanlacerda added 10 commits January 15, 2026 09:37

pipenv lock

697cdd7

Signed-off-by: Javan Lacerda <javanlacerda@google.com>

install kubernetes

3310027

Signed-off-by: Javan Lacerda <javanlacerda@google.com>

Update dependencies and fix linting

7f12a07

move use_batch

b986275

Signed-off-by: Javan Lacerda <javanlacerda@google.com>

add todo

5b7b4d5

Signed-off-by: Javan Lacerda <javanlacerda@google.com>

Pr/metrics logging (#5115)

ca629cd

Signed-off-by: Javan Lacerda <javanlacerda@google.com>

fixes

234f0ac

Signed-off-by: Javan Lacerda <javanlacerda@google.com>

fix lint

6838442

Signed-off-by: Javan Lacerda <javanlacerda@google.com>

mock gcloud auth default

eec7e6b

Signed-off-by: Javan Lacerda <javanlacerda@google.com>

javanlacerda force-pushed the pr/dependencies branch 9 times, most recently from 78a7de2 to 0901b6d Compare January 16, 2026 13:00

create adapters enum

3692699

Signed-off-by: Javan Lacerda <javanlacerda@google.com>

javanlacerda force-pushed the pr/dependencies branch from 0901b6d to 3692699 Compare January 16, 2026 13:23

ViniciustCosta reviewed Jan 16, 2026

View reviewed changes

src/local/butler/format.py Show resolved Hide resolved

lint

ae2e936

Signed-off-by: Javan Lacerda <javanlacerda@google.com>

javanlacerda force-pushed the pr/dependencies branch from b771b50 to ae2e936 Compare January 16, 2026 15:47

jonathanmetzman reviewed Jan 16, 2026

View reviewed changes

Pipfile Outdated Show resolved Hide resolved

set ttl for the job after completion

c108e35

Signed-off-by: Javan Lacerda <javanlacerda@google.com>

jonathanmetzman reviewed Jan 16, 2026

View reviewed changes

jonathanmetzman reviewed Jan 18, 2026

View reviewed changes

revert Pipfile

092e27f

Signed-off-by: Javan Lacerda <javanlacerda@google.com>

javanlacerda force-pushed the pr/dependencies branch from eb601c2 to 092e27f Compare January 18, 2026 19:21

fix JTW

9b3a503

Signed-off-by: Javan Lacerda <javanlacerda@google.com>

javanlacerda force-pushed the pr/dependencies branch from 8adc287 to 9b3a503 Compare January 18, 2026 19:41

use actions/checkout@v4

941c7d1

Signed-off-by: Javan Lacerda <javanlacerda@google.com>

javanlacerda force-pushed the pr/dependencies branch from 6040f7d to 941c7d1 Compare January 18, 2026 19:46

ViniciustCosta reviewed Jan 19, 2026

View reviewed changes

decoNR reviewed Jan 19, 2026

View reviewed changes

paintitblacktattoostudio1591-sudo mentioned this pull request Jan 19, 2026

Moss #5132

Open

fix single quotes usage

2114adf

Signed-off-by: Javan Lacerda <javanlacerda@google.com>


		from enum import Enum

		from clusterfuzz._internal.batch.service import GcpBatchService

		from clusterfuzz._internal.k8s.service import KubernetesService


		class RemoteTaskAdapters(Enum):

		tasks.TASK_LEASE_SECONDS)


		WeightedSubconfig = collections.namedtuple('WeightedSubconfig',

Kubernetes Job service #5113

Are you sure you want to change the base?

Kubernetes Job service #5113

Conversation

javanlacerda commented Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jonathanmetzman left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jonathanmetzman commented Jan 18, 2026

Uh oh!

jonathanmetzman commented Jan 18, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

javanlacerda commented Jan 8, 2026 •

edited

Loading

decoNR Jan 19, 2026 •

edited

Loading