Discrepancy between paper and code: Multi-GPU training support

Hi, thanks for sharing this interesting work and making the code available. I'm trying to reproduce the results reported in the paper, but I've encountered several issues that make it difficult to achieve the claimed performance. I'd appreciate your clarification on the following points:

The paper states:  
>"All the experiments are implemented with the PyTorch platform and trained/tested on 4 NVIDIA A100 GPUs."*

However, the current codebase does not appear to fully support multi-GPU training:
- The `TODO` list includes an unchecked item: **"Fix bugs in Multi-GPU parallel"**, suggesting known issues in distributed training.
- The training script (`train.py`) uses `CUDA_VISIBLE_DEVICES` and single-process execution, but does not use `torch.distributed` or `DistributedDataParallel` (DDP). This limits training to single-GPU or inefficient `DataParallel` mode.
- There is no use of `local_rank`, `DistributedSampler`, or proper process group initialization.

 Could you clarify:
- Were the reported results indeed obtained using 4 A100 GPUs in a distributed setting?
- If so, was a different (internal) version of the code used? If yes, could you release the fixed version or provide guidance on how to properly enable multi-GPU training?

Without a working multi-GPU setup, it's challenging to train at the scale described in the paper, especially for 3D medical data.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discrepancy between paper and code: Multi-GPU training support #139

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Discrepancy between paper and code: Multi-GPU training support #139

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions