Skip to content

Bulter.py: Adds preprocess command for local preprocess#5266

Open
IvanBM18 wants to merge 4 commits intomasterfrom
feature/butler/local_preprocess
Open

Bulter.py: Adds preprocess command for local preprocess#5266
IvanBM18 wants to merge 4 commits intomasterfrom
feature/butler/local_preprocess

Conversation

@IvanBM18
Copy link
Copy Markdown
Collaborator

@IvanBM18 IvanBM18 commented May 5, 2026

Adds preprocess butler script

This command allows developers to trigger the preprocess portion of a fuzz task and in consecuence generate the serialized and compressed uworker_input payload, upload it to real GCS, and get the signed download URL, exactly as it happens remotely. We can then use the resulting url to trigger a task in any backend that we want:

  • In swarming trough a prpc request
  • In batch trough manually posting the task to the utask_main queue

This accelerates local debugging of the tworker preprocessing phase without relying on remote execution queues, which has proven to take multiple hours to "ACK" a task request.

Note: To use this command you need the Secret Manager Secret Accessor for Dev or setup a service account key in your local(by using the gcloud auth cli) that has said role and any other role required for a tworker's preprocess.

Changes

  • Added the preprocess subcommand.
    • It interacts with the actual Datastore and GCS based on the provided configuration.
    • It fetches and populates the uworker_env with:
      • Job-specific environment variables from the Datastore.
      • Fuzzer-specific environment variables (for blackbox fuzzers).
      • Required logging metadata (CF_TASK_NAME, CF_TASK_ARGUMENT, CF_TASK_JOB_NAME, CF_TASK_ID) to ensure logs in the subsequent uworker_main step have the correct context.

Tests performed

Executed the following command in dev:

pipenv run python butler.py preprocess --fuzzer <fuzzer> --job <job> -c <config_dir>

Successfully creates and uploads the payload and returns a valid signed URL. This signed url was later used to trigger a swarming task trough prpc, here are the logs

@IvanBM18 IvanBM18 self-assigned this May 5, 2026
@IvanBM18 IvanBM18 requested a review from a team as a code owner May 5, 2026 04:01
@IvanBM18 IvanBM18 changed the title Bulter.py: Adds uworker_preprocess command for local preprocess Bulter.py: Adds preprocess command for local preprocess May 5, 2026
return uworker_env


def _early_setup(args):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: any reason not to name it just setup?

Copy link
Copy Markdown
Collaborator Author

@IvanBM18 IvanBM18 May 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since most of this script is just env setup i thought it was easier for me to understand what made this method different or what is purpose is if i called it early_setup

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think setup by itself makes more sense and it is also more common to find throughout the codebase but I am also okay if we go through with it as is.

@jardondiego
Copy link
Copy Markdown
Collaborator

Note: To use this command you need the Secret Manager Secret Accessor for Dev or setup a service account in your local that has those permissions.

Did you use a .json service account for this? It would be nice to have a few sentences/links to understand it better.

@jardondiego
Copy link
Copy Markdown
Collaborator

Does it need to be a subcommand of butler itself? Why not make it a standalone script? I think it's a good idea to not over-populate butler with subcommands. What do you think?

uworker_env = _get_job_environment(args.job)
uworker_env.update(_get_fuzzer_environment(args.fuzzer, args.job))

# Replicate what process_command_impl does in a real tworker
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we use process_command_impl() then instead?

Copy link
Copy Markdown
Collaborator Author

@IvanBM18 IvanBM18 May 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the end of said method we call run_command():
https://github.com/google/clusterfuzz/blob/master/src/clusterfuzz/_internal/bot/tasks/commands.py#L482
Which in turns triggers a workflow in which the preprocess step immediately queues the main task for remote execution when finished or just straight ups executes all 3 steps in the same machine(depending on setup), but we don't want that, we want to stop just after finishing the preprocess so we could manually trigger the main portion wherever and whenever we need to

@IvanBM18
Copy link
Copy Markdown
Collaborator Author

IvanBM18 commented May 7, 2026

@jardondiego

Note: To use this command you need the Secret Manager Secret Accessor for Dev or setup a service account in your local that has those permissions.

Did you use a .json service account for this? It would be nice to have a few sentences/links to understand it better.

Not in this case, but its possible to use a service account, you just need to generate a key, save it in your local and set it up as the default credentials for any gcloud library and cli operation. This is done using the gcloud auth subcommand.

Added more context in the description so future reviewers can easily understand this

@IvanBM18
Copy link
Copy Markdown
Collaborator Author

IvanBM18 commented May 7, 2026

@jardondiego

Does it need to be a subcommand of butler itself? Why not make it a standalone script? I think it's a good idea to not over-populate butler with subcommands. What do you think?

Yes, its need to as butler already handles a lot of bootstrapping operations for the same purpose, for example if we didn't use butler, we would need to add methods to read, parse, and populate configurations based on the yaml files

@jardondiego
Copy link
Copy Markdown
Collaborator

Yes, its need to as butler already handles a lot of bootstrapping operations for the same purpose, for example if we didn't use butler, we would need to add methods to read, parse, and populate configurations based on the yaml files

I think we are referring to different things, what I mean is that I think we could have as a standalone butler script so that we can run it with as

python butler.py run <name_of_script> --non-dry-run --config $MY_DIR

That way we don't have to handle all of that by ourselves. Does it make sense?

@jardondiego
Copy link
Copy Markdown
Collaborator

Added more context in the description so future reviewers can easily understand this

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants