Single-machine multi-GPU
{
"Python":"3.8.10",
"torch":"1.8.1",
"torchvision":"0.9.1",
"dali": "1.2.0",
"CUDA":"11.1",
"cuDNN":8005,
"GPU":{
"#0":{
"name":"Quadro RTX 6000",
"memory":"23.65GB"
},
"#1":{
"name":"Quadro RTX 6000",
"memory":"23.65GB"
}
},
"Platform":{
"system":"Linux",
"node":"4029GP-TRT",
"version":"#83~18.04.1-Ubuntu SMP Tue May 11 16:01:00 UTC 2021",
"machine":"x86_64",
"processor":"x86_64"
}
}Batch size: 512, conv layers: 11, epochs: 5
Baseline : 276.980s
| +cudnn_benchmark | +AMP | +cudnn_benchmark +AMP | ||
|---|---|---|---|---|
| DP | 163.740 | 104.807 | 74.948 | 73.862 |
| DDP | 142.497 | 102.535 | 67.095 | 72.998 |
-
DP:torch.nn.DataParallel -
AMP:torch.cuda.amp -
DDP:torch.nn.parallel.DistributedDataParallel -
cudnn_benchmark:torch.backends.cudnn.benchmark = True -
pin_memory=True -
non_blocking=True -
optimizer.zero_grad(set_to_none=True)
# $1 is the epochs
./running.sh 5
# Or run the commands in the script directly.Drop caches for i/o benchmark test.
sync
# To free pagecache:
echo 1 > /proc/sys/vm/drop_caches
# To free reclaimable slab objects (includes dentries and inodes):
echo 2 > /proc/sys/vm/drop_caches
#To free slab objects and pagecache:
echo 3 > /proc/sys/vm/drop_cachesBatch size: 256/2, workers: 8 x 2
| Bottleneck | +DALI/CPU | Bottleneck | +DALI/GPU | Bottleneck | ||
|---|---|---|---|---|---|---|
| HDD | ~25M/s | IO | ~40M/s | IO | ~40M/s | IO |
| SSD | ~230M/s | CPU | ~500M/s | CPU | ~600M/s | IO |
-
SSD -
DALI: The NVIDIA Data Loading Library -
LMDB
# $1 is the script, $2 is the imagenet dataset path.
./loading.sh loading_faster.py '/datasets/ILSVRC2012/'
# Or run the commands in the script directly.The average resolution of ImageNet images is 469x387, but they are usually cropped to 256x256 or 224x224 in your image preprocessing step. So we could speed up reading by downscaling the image size.
Especially, the entire dataset can be loaded into memory.
# N: the max size of smaller edge
python resize_imagenet.py --src </path/to/imagenet> --dst </path/to/imagenet/resized> --max-size NAs reported in Fixing the train-test resolution discrepancy, you can use smaller image size when training models.