Cubed Batch

Warning: Cubed Batch is currently just a proof-of-concept

The idea behind Cubed Batch is to have a dedicated number of long-lived processes running in parallel, with each process running a share of tasks for the overall Cubed computation.

There is less overhead since a new process is started for each batch of tasks, rather than for each task.

This style of execution is well-suited to batch runners, like Unix processes, GNU parallel, HPC job managers, Coiled Batch, etc.

The way it works is that your code defines a computation, then saves the plan as a file. The tasks in the plan are split into a fixed number of partitions, which are then processed in parallel.

Example

The following is a small Python program to set up a Cubed computation and save its plan in a file. Crucially, the computation is not run (note the compute=False in to_zarr).

python -c "import cubed
import cubed as xp
from cubed_batch import save_batch_job

a = xp.asarray([[1, 2, 3], [4, 5, 6], [7, 8, 9]], chunks=(2, 2))
b = a + 1
c = cubed.to_zarr(b, 'out.zarr', compute=False)

save_batch_job(c.plan(), c.spec, batch_job_path='cubed_batch_job1')
"

Next, we can use the cubed-batch CLI to turn the plan file into a series of commands to run the computation:

cubed-batch generate-script --runner sequential cubed_batch_job1 2

This says that we want to split the computation into two partitions (for two processes) and use a sequential runner - i.e. just run one command after another.

The output is

cubed-batch run-partition cubed_batch_job1 create-arrays 1 0
cubed-batch run-partition cubed_batch_job1 op-004 2 0
cubed-batch run-partition cubed_batch_job1 op-004 2 1

It's convenient to pipe this into a script then run it:

cubed-batch generate-script --runner sequential cubed_batch_job1 2 > script.sh
source script.sh

If this succeeds, then we can see the contents of the Zarr output:

python -c "import zarr; print(zarr.open('out.zarr')[:])"

[[ 2  3  4]
 [ 5  6  7]
 [ 8  9 10]]

GNU parallel

You can also use GNU parallel to run partitions in parallel.

cubed-batch generate-script --runner gnu-parallel plan.pickle 2

produces

parallel -j 1 cubed-batch run-partition plan.pickle create-arrays 1 {} ::: $(seq 0 0)
parallel -j 2 cubed-batch run-partition plan.pickle op-004 2 {} ::: $(seq 0 1)

Next steps

This is just a proof-of-concept, but here are some others things that could be added or improved:

Try it out with an HPC job manager
Try it out with Coiled Batch
Try it out with Sky Pilot Pools
Get it working with Icechunk (each partition could serialize its session to a file, then a finalize command could merge them)
Do progress bars make sense for batch execution?
Support retries
Support backup tasks for stragglers
Have a default, well-known place to store the plan (related to the compute directory perhaps?)

Credits

The inspiration for Cubed Batch is Jerome Kelleher's implementation of distributed computations in bio2zarr.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
cubed_batch		cubed_batch
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Cubed Batch

Example

GNU parallel

Next steps

Credits

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Cubed Batch

Example

GNU parallel

Next steps

Credits

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages