When dump() fails during result serialization in backend_write_file, the process crashes without writing an error into the HDF5 file. This leaves the file in the _r.h5 state permanently — get_future_from_cache then raises FileNotFoundError (since it only looks for _i.h5 or _o.h5), making the failure undetectable through the normal API.
Reproduction:
Submit a task whose return value is large enough that np.void(cloudpickle.dumps(data_value)) exceeds the numpy void array size limit (~2 GB). The process crashes with:
File ".../executorlib/task_scheduler/file/backend.py", line 49, in backend_write_file
dump(
file_name=file_name_out + "_r.h5",
data_dict={"output": output["result"], "runtime": runtime},
)
File ".../executorlib/standalone/hdf.py", line 39, in dump
data=np.void(cloudpickle.dumps(data_value)),
TypeError: byte-like to large to store inside array.
Current behavior:
backend_write_file renames _i.h5 → _r.h5, then calls dump() which raises. The exception propagates out, and the final os.rename(_r.h5 → _o.h5) never executes. The file remains as _r.h5 with no "output" or "error" key — a state that get_future_from_cache cannot interpret.
Expected behavior:
If dump() fails when writing the result, backend_write_file should catch the exception, write it as an "error" key into the _r.h5 file, and then rename to _o.h5. This way get_future_from_cache can detect it as a failed task via the normal "error" path.
Suggested fix:
In backend_write_file (executorlib/task_scheduler/file/backend.py):
def backend_write_file(file_name: str, output: Any, runtime: float) -> None:
file_name_out = os.path.splitext(file_name)[0][:-2]
os.rename(file_name, file_name_out + "_r.h5")
try:
if "result" in output:
dump(
file_name=file_name_out + "_r.h5",
data_dict={"output": output["result"], "runtime": runtime},
)
else:
dump(
file_name=file_name_out + "_r.h5",
data_dict={"error": output["error"], "runtime": runtime},
)
except Exception as serialize_error:
# Serialization failed — store the error so the job is not stuck
dump(
file_name=file_name_out + "_r.h5",
data_dict={"error": serialize_error, "runtime": runtime},
)
os.rename(file_name_out + "_r.h5", file_name_out + "_o.h5")
This ensures that any serialization failure (including the np.void size limit) is surfaced through the existing error-handling path rather than leaving the job in a limbo state.
Environment:
- executorlib version: 1.9.2
- Python 3.13
- Triggered by a 10,000-atom melt-quench simulation returning ~2+ GB of pickled data
When
dump()fails during result serialization inbackend_write_file, the process crashes without writing an error into the HDF5 file. This leaves the file in the_r.h5state permanently —get_future_from_cachethen raisesFileNotFoundError(since it only looks for_i.h5or_o.h5), making the failure undetectable through the normal API.Reproduction:
Submit a task whose return value is large enough that
np.void(cloudpickle.dumps(data_value))exceeds the numpy void array size limit (~2 GB). The process crashes with:Current behavior:
backend_write_filerenames_i.h5→_r.h5, then callsdump()which raises. The exception propagates out, and the finalos.rename(_r.h5 → _o.h5)never executes. The file remains as_r.h5with no"output"or"error"key — a state thatget_future_from_cachecannot interpret.Expected behavior:
If
dump()fails when writing the result,backend_write_fileshould catch the exception, write it as an"error"key into the_r.h5file, and then rename to_o.h5. This wayget_future_from_cachecan detect it as a failed task via the normal"error"path.Suggested fix:
In
backend_write_file(executorlib/task_scheduler/file/backend.py):This ensures that any serialization failure (including the
np.voidsize limit) is surfaced through the existing error-handling path rather than leaving the job in a limbo state.Environment: