Skip to content

cubed-dev/cubed-batch

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Cubed Batch

Warning: Cubed Batch is currently just a proof-of-concept

The idea behind Cubed Batch is to have a dedicated number of long-lived processes running in parallel, with each process running a share of tasks for the overall Cubed computation.

There is less overhead since a new process is started for each batch of tasks, rather than for each task.

This style of execution is well-suited to batch runners, like Unix processes, GNU parallel, HPC job managers, Coiled Batch, etc.

The way it works is that your code defines a computation, then saves the plan as a file. The tasks in the plan are split into a fixed number of partitions, which are then processed in parallel.

Example

The following is a small Python program to set up a Cubed computation and save its plan in a file. Crucially, the computation is not run (note the compute=False in to_zarr).

python -c "import cubed
import cubed as xp
from cubed_batch import save_batch_job

a = xp.asarray([[1, 2, 3], [4, 5, 6], [7, 8, 9]], chunks=(2, 2))
b = a + 1
c = cubed.to_zarr(b, 'out.zarr', compute=False)

save_batch_job(c.plan(), c.spec, batch_job_path='cubed_batch_job1')
"

Next, we can use the cubed-batch CLI to turn the plan file into a series of commands to run the computation:

cubed-batch generate-script --runner sequential cubed_batch_job1 2

This says that we want to split the computation into two partitions (for two processes) and use a sequential runner - i.e. just run one command after another.

The output is

cubed-batch run-partition cubed_batch_job1 create-arrays 1 0
cubed-batch run-partition cubed_batch_job1 op-004 2 0
cubed-batch run-partition cubed_batch_job1 op-004 2 1

It's convenient to pipe this into a script then run it:

cubed-batch generate-script --runner sequential cubed_batch_job1 2 > script.sh
source script.sh

If this succeeds, then we can see the contents of the Zarr output:

python -c "import zarr; print(zarr.open('out.zarr')[:])"
[[ 2  3  4]
 [ 5  6  7]
 [ 8  9 10]]

GNU parallel

You can also use GNU parallel to run partitions in parallel.

cubed-batch generate-script --runner gnu-parallel plan.pickle 2

produces

parallel -j 1 cubed-batch run-partition plan.pickle create-arrays 1 {} ::: $(seq 0 0)
parallel -j 2 cubed-batch run-partition plan.pickle op-004 2 {} ::: $(seq 0 1)

Next steps

This is just a proof-of-concept, but here are some others things that could be added or improved:

  • Try it out with an HPC job manager
  • Try it out with Coiled Batch
  • Try it out with Sky Pilot Pools
  • Get it working with Icechunk (each partition could serialize its session to a file, then a finalize command could merge them)
  • Do progress bars make sense for batch execution?
  • Support retries
  • Support backup tasks for stragglers
  • Have a default, well-known place to store the plan (related to the compute directory perhaps?)

Credits

The inspiration for Cubed Batch is Jerome Kelleher's implementation of distributed computations in bio2zarr.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages