-
Notifications
You must be signed in to change notification settings - Fork 47
Open
Description
Dear Deepsomatic team,
I'm currently facing issues in using deepsomatic on an HPC cluster.
I'm using the latest Deepsomatic (v1.10) via a singularity image created from your docker one, on nodes of 16 cpus and 512 Gb ram.
The command I'm using:
rule DeepSomatic:
input:
somatic_input
output:
data+"/deepvariant/{TUMORS}_postprocess.vcf.gz"
params:
genome=config['GENOME']
resources:
partition = "big"
shell:
"""
singularity run -B /usr/lib/locale/:/usr/lib/locale/,/data/:/data/,/work/:/work/ /work/tchenegros/deepsomatic.sif run_deepsomatic \
--model_type=WES \
--ref={params.genome} \
--reads_tumor={input[0]} \
--reads_normal={input[1]} \
--output_vcf={output} \
--num_shards=1 \
--logging_dir={data}/deepvariant/logs \
--intermediate_results_dir={data}/deepvariant/intermediate \
--vcf_stats_report=true \
--postprocess_variants_extra_args="num_partitions=32"
"""
My problem seems to be that the make_examples_somatic step produce corrupted output.
I have an error at the start of the call_variants step.
I tried first with num_shards=16, and had this type of error for every sample of my cohort:
I0227 09:50:26.194697 140665094316032 dv_utils.py:333] From /data/tchenegros/projects/exomes_macrogen/deepvariant/intermediate/make_examples_somatic.tfrecord-00000-of-00016.gz.example_info.json: Shape of input examples: [200, 221
, 7], Channels of input examples: [1, 2, 3, 4, 5, 6, 19].
I0227 09:50:32.806108 140665094316032 dv_utils.py:333] From /opt/models/deepsomatic/wes/model.example_info.json: Shape of input examples: [200, 221, 7], Channels of input examples: [1, 2, 3, 4, 5, 6, 19].
I0227 09:50:32.806345 140665094316032 call_variants.py:887] example_shape: [200, 221, 7]
I0227 09:50:32.998093 140665094316032 call_variants.py:953] Total 1 writing processes started.
2026-02-27 09:50:33.072077: W tensorflow/core/framework/dataset.cc:959] Input of GeneratorDatasetOp::Dataset will not be optimized because the dataset does not implement the AsGraphDefInternal() method needed to apply optimizatio
ns.
I0227 09:50:49.598364 140665094316032 call_variants.py:1031] Predicted 1024 examples in 1 batches [1.621 sec per 100].
2026-02-27 09:51:00.778046: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: DATA_LOSS: corrupted record at 77137743
[[{{node MultiDeviceIteratorGetNextFromShard}}]]
2026-02-27 09:51:11.505378: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: DATA_LOSS: corrupted record at 77137743
[[{{node MultiDeviceIteratorGetNextFromShard}}]]
[[RemoteCall]]
2026-02-27 09:51:11.505964: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: DATA_LOSS: corrupted record at 83952381
[[{{node MultiDeviceIteratorGetNextFromShard}}]]
I tried to reduce the num_shards to 1 as shown earlier and got another message:
INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:CPU:0',)
I0303 06:29:03.295802 140208193400832 mirrored_strategy.py:423] Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:CPU:0',)
2026-03-03 06:29:03.297399: F external/local_tsl/tsl/platform/env.cc:391] Check failed: -1 != path_length (-1 vs. -1)
Fatal Python error: Aborted
Current thread 0x00007f84c38ff000 (most recent call first):
File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/ops/gen_resource_variable_ops.py", line 1267 in var_handle_op
File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/ops/resource_variable_ops.py", line 169 in _variable_handle_from_shape_and_dtype
File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/ops/resource_variable_ops.py", line 241 in eager_safe_variable_handle
File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/ops/resource_variable_ops.py", line 2028 in _init_from_args
File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/ops/resource_variable_ops.py", line 1829 in __init__
File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/ops/variables.py", line 200 in __call__
File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/util/traceback_utils.py", line 150 in error_handler
File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/distribute/cross_device_utils.py", line 290 in __init__
File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/distribute/cross_device_ops.py", line 1102 in __init__
File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/distribute/mirrored_strategy.py", line 402 in _make_collective_ops_with_fallbacks
File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/distribute/mirrored_strategy.py", line 368 in _initialize_strategy
File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/distribute/mirrored_strategy.py", line 342 in __init__
File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/distribute/mirrored_strategy.py", line 286 in __init__
File "/data/tchenegros/projects/exomes_macrogen/Bazel.runfiles_ph5ry_90/runfiles/com_google_deepvariant/deepvariant/call_variants.py", line 782 in call_variants
File "/data/tchenegros/projects/exomes_macrogen/Bazel.runfiles_ph5ry_90/runfiles/com_google_deepvariant/deepvariant/call_variants.py", line 1092 in main
File "/data/tchenegros/projects/exomes_macrogen/Bazel.runfiles_ph5ry_90/runfiles/absl_py/absl/app.py", line 258 in _run_main
File "/data/tchenegros/projects/exomes_macrogen/Bazel.runfiles_ph5ry_90/runfiles/absl_py/absl/app.py", line 312 in run
File "/data/tchenegros/projects/exomes_macrogen/Bazel.runfiles_ph5ry_90/runfiles/com_google_deepvariant/deepvariant/call_variants.py", line 1116 in <module>
I didn't saw any errors in the make_examples_somatic step.
I can provide the full logs on request.
Thank you in advance for your help with this.
Best regards,
Thomas
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels