Warning: Cubed Batch is currently just a proof-of-concept
The idea behind Cubed Batch is to have a dedicated number of long-lived processes running in parallel, with each process running a share of tasks for the overall Cubed computation.
There is less overhead since a new process is started for each batch of tasks, rather than for each task.
This style of execution is well-suited to batch runners, like Unix processes, GNU parallel, HPC job managers, Coiled Batch, etc.
The way it works is that your code defines a computation, then saves the plan as a file. The tasks in the plan are split into a fixed number of partitions, which are then processed in parallel.
The following is a small Python program to set up a Cubed computation and save its plan in a file. Crucially, the computation is not run (note the compute=False in to_zarr).
python -c "import cubed
import cubed as xp
from cubed_batch import save_batch_job
a = xp.asarray([[1, 2, 3], [4, 5, 6], [7, 8, 9]], chunks=(2, 2))
b = a + 1
c = cubed.to_zarr(b, 'out.zarr', compute=False)
save_batch_job(c.plan(), c.spec, batch_job_path='cubed_batch_job1')
"Next, we can use the cubed-batch CLI to turn the plan file into a series of commands to run the computation:
cubed-batch generate-script --runner sequential cubed_batch_job1 2This says that we want to split the computation into two partitions (for two processes) and use a sequential runner - i.e. just run one command after another.
The output is
cubed-batch run-partition cubed_batch_job1 create-arrays 1 0
cubed-batch run-partition cubed_batch_job1 op-004 2 0
cubed-batch run-partition cubed_batch_job1 op-004 2 1
It's convenient to pipe this into a script then run it:
cubed-batch generate-script --runner sequential cubed_batch_job1 2 > script.sh
source script.shIf this succeeds, then we can see the contents of the Zarr output:
python -c "import zarr; print(zarr.open('out.zarr')[:])"[[ 2 3 4]
[ 5 6 7]
[ 8 9 10]]
You can also use GNU parallel to run partitions in parallel.
cubed-batch generate-script --runner gnu-parallel plan.pickle 2produces
parallel -j 1 cubed-batch run-partition plan.pickle create-arrays 1 {} ::: $(seq 0 0)
parallel -j 2 cubed-batch run-partition plan.pickle op-004 2 {} ::: $(seq 0 1)
This is just a proof-of-concept, but here are some others things that could be added or improved:
- Try it out with an HPC job manager
- Try it out with Coiled Batch
- Try it out with Sky Pilot Pools
- Get it working with Icechunk (each partition could serialize its session to a file, then a finalize command could merge them)
- Do progress bars make sense for batch execution?
- Support retries
- Support backup tasks for stragglers
- Have a default, well-known place to store the plan (related to the compute directory perhaps?)
The inspiration for Cubed Batch is Jerome Kelleher's implementation of distributed computations in bio2zarr.