Skip to content

Corrupted make_examples_somatic output #85

@TChenegros

Description

@TChenegros

Dear Deepsomatic team,

I'm currently facing issues in using deepsomatic on an HPC cluster.
I'm using the latest Deepsomatic (v1.10) via a singularity image created from your docker one, on nodes of 16 cpus and 512 Gb ram.

The command I'm using:

rule DeepSomatic:   
   input:   
       somatic_input   
   output:   
       data+"/deepvariant/{TUMORS}_postprocess.vcf.gz"
   params:
       genome=config['GENOME']
   resources:
       partition = "big"
   shell:
       """
       singularity run -B /usr/lib/locale/:/usr/lib/locale/,/data/:/data/,/work/:/work/ /work/tchenegros/deepsomatic.sif run_deepsomatic \
       --model_type=WES \
       --ref={params.genome} \
       --reads_tumor={input[0]} \
       --reads_normal={input[1]} \
       --output_vcf={output} \
       --num_shards=1 \
       --logging_dir={data}/deepvariant/logs \
       --intermediate_results_dir={data}/deepvariant/intermediate \
       --vcf_stats_report=true \
       --postprocess_variants_extra_args="num_partitions=32"
       """

My problem seems to be that the make_examples_somatic step produce corrupted output.
I have an error at the start of the call_variants step.
I tried first with num_shards=16, and had this type of error for every sample of my cohort:

I0227 09:50:26.194697 140665094316032 dv_utils.py:333] From /data/tchenegros/projects/exomes_macrogen/deepvariant/intermediate/make_examples_somatic.tfrecord-00000-of-00016.gz.example_info.json: Shape of input examples: [200, 221
, 7], Channels of input examples: [1, 2, 3, 4, 5, 6, 19].
I0227 09:50:32.806108 140665094316032 dv_utils.py:333] From /opt/models/deepsomatic/wes/model.example_info.json: Shape of input examples: [200, 221, 7], Channels of input examples: [1, 2, 3, 4, 5, 6, 19].
I0227 09:50:32.806345 140665094316032 call_variants.py:887] example_shape: [200, 221, 7]
I0227 09:50:32.998093 140665094316032 call_variants.py:953] Total 1 writing processes started.
2026-02-27 09:50:33.072077: W tensorflow/core/framework/dataset.cc:959] Input of GeneratorDatasetOp::Dataset will not be optimized because the dataset does not implement the AsGraphDefInternal() method needed to apply optimizatio
ns.
I0227 09:50:49.598364 140665094316032 call_variants.py:1031] Predicted 1024 examples in 1 batches [1.621 sec per 100].
2026-02-27 09:51:00.778046: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: DATA_LOSS: corrupted record at 77137743
         [[{{node MultiDeviceIteratorGetNextFromShard}}]]
2026-02-27 09:51:11.505378: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: DATA_LOSS: corrupted record at 77137743
         [[{{node MultiDeviceIteratorGetNextFromShard}}]]
         [[RemoteCall]]
2026-02-27 09:51:11.505964: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: DATA_LOSS: corrupted record at 83952381
         [[{{node MultiDeviceIteratorGetNextFromShard}}]]

I tried to reduce the num_shards to 1 as shown earlier and got another message:

INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:CPU:0',)
I0303 06:29:03.295802 140208193400832 mirrored_strategy.py:423] Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:CPU:0',)
2026-03-03 06:29:03.297399: F external/local_tsl/tsl/platform/env.cc:391] Check failed: -1 != path_length (-1 vs. -1)
Fatal Python error: Aborted

Current thread 0x00007f84c38ff000 (most recent call first):
  File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/ops/gen_resource_variable_ops.py", line 1267 in var_handle_op
  File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/ops/resource_variable_ops.py", line 169 in _variable_handle_from_shape_and_dtype
  File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/ops/resource_variable_ops.py", line 241 in eager_safe_variable_handle
  File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/ops/resource_variable_ops.py", line 2028 in _init_from_args
  File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/ops/resource_variable_ops.py", line 1829 in __init__
  File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/ops/variables.py", line 200 in __call__
  File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/util/traceback_utils.py", line 150 in error_handler
  File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/distribute/cross_device_utils.py", line 290 in __init__
  File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/distribute/cross_device_ops.py", line 1102 in __init__
  File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/distribute/mirrored_strategy.py", line 402 in _make_collective_ops_with_fallbacks
  File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/distribute/mirrored_strategy.py", line 368 in _initialize_strategy
  File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/distribute/mirrored_strategy.py", line 342 in __init__
  File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/distribute/mirrored_strategy.py", line 286 in __init__
  File "/data/tchenegros/projects/exomes_macrogen/Bazel.runfiles_ph5ry_90/runfiles/com_google_deepvariant/deepvariant/call_variants.py", line 782 in call_variants
  File "/data/tchenegros/projects/exomes_macrogen/Bazel.runfiles_ph5ry_90/runfiles/com_google_deepvariant/deepvariant/call_variants.py", line 1092 in main
  File "/data/tchenegros/projects/exomes_macrogen/Bazel.runfiles_ph5ry_90/runfiles/absl_py/absl/app.py", line 258 in _run_main
  File "/data/tchenegros/projects/exomes_macrogen/Bazel.runfiles_ph5ry_90/runfiles/absl_py/absl/app.py", line 312 in run
  File "/data/tchenegros/projects/exomes_macrogen/Bazel.runfiles_ph5ry_90/runfiles/com_google_deepvariant/deepvariant/call_variants.py", line 1116 in <module>

I didn't saw any errors in the make_examples_somatic step.
I can provide the full logs on request.

Thank you in advance for your help with this.

Best regards,

Thomas

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions