-
Notifications
You must be signed in to change notification settings - Fork 1
'not enough memory'error after backup a training state #10
Description
My PC has 32G RAM,and they were only used 40% while training.
But after backup a training state,the log says
RuntimeError: [enforce fail at ..\c10\core\CPUAllocator.cpp:72] data. DefaultCPUAllocator: not enough memory: you tried to allocate 24774144 bytes. Buy new RAM!
I already set n_workers: 1 and batch_size: 1,the result is that the training tried to allocate only 24774144 bytes.
That's less than 25MB,and my PC has 32GB RAM,why it's not enough?
MY
CPU: AMD RYZEN 3700x (8cores)
GPU: Geforce 1660super
CUDA: cuda_11.0.2_451.48_win10
pytorch: torch-1.6.0-cp38-cp38-win_amd64
This is the log:
export CUDA_VISIBLE_DEVICES=0
20-08-11 07:32:47.805 - INFO: name: debug_newtest
use_tb_logger: True
model: sr
scale: 4
gpu_ids: [0]
datasets:[
train:[
name: DIV2K
mode: LRHR
dataroot_HR: ./data_samples/div2k/div2k_train_hr
dataroot_LR: ./data_samples/div2k/DIV2K_train_LR
subset_file: None
use_shuffle: True
n_workers: 1
batch_size: 1
HR_size: 64
use_flip: True
use_rot: True
phase: train
scale: 4
data_type: img
LR_nc: 3
HR_nc: 3
]
val:[
name: val_set5
mode: LRHR
dataroot_HR: ./data_samples/div2k/div2k_valid_hr
dataroot_LR: ./data_samples/div2k/DIV2K_valid_LR
phase: val
scale: 4
data_type: img
LR_nc: 3
HR_nc: 3
]
]
path:[
root: ./
pretrain_model_G: ./experiments/pretrained_models/4x_ArtStation1337_FatalityMKII90000G_05_rebout_02.pth
experiments_root: ./experiments\debug_newtest
models: ./experiments\debug_newtest\models
training_state: ./experiments\debug_newtest\training_state
log: ./experiments\debug_newtest
val_images: ./experiments\debug_newtest\val_images
]
network_G:[
which_model_G: RRDB_net
norm_type: None
mode: CNA
nf: 64
nb: 23
in_nc: 3
out_nc: 3
gc: 32
group: 1
scale: 4
]
train:[
lr_G: 0.0002
lr_scheme: MultiStepLR
lr_steps: [200000, 400000, 600000, 800000]
lr_gamma: 0.5
pixel_criterion: l1
pixel_weight: 1
val_freq: 8
manual_seed: 0
niter: 1000000
lr_decay_iter: 10
]
logger:[
print_freq: 2
save_checkpoint_freq: 8
backup_freq: 2
]
is_train: True
batch_multiplier: 1
20-08-11 07:32:47.805 - INFO: Random seed: 0
20-08-11 07:32:47.815 - INFO: Dataset [LRHRDataset - DIV2K] is created.
20-08-11 07:32:47.815 - INFO: Number of train images: 800, iters: 800
20-08-11 07:32:47.815 - INFO: Total epochs needed: 1250 for iters 1,000,000
20-08-11 07:32:47.817 - INFO: Dataset [LRHRDataset - val_set5] is created.
20-08-11 07:32:47.817 - INFO: Number of val images in [val_set5]: 100
20-08-11 07:32:47.946 - INFO: Initialization method [kaiming]
20-08-11 07:32:49.037 - INFO: Loading pretrained model for G [./experiments/pretrained_models/4x_ArtStation1337_FatalityMKII90000G_05_rebout_02.pth] ...
20-08-11 07:32:49.205 - INFO: Remove frequency separation.
20-08-11 07:32:49.205 - INFO: Remove feature loss.
20-08-11 07:32:49.206 - INFO: Remove HFEN loss.
20-08-11 07:32:49.207 - INFO: Remove TV loss.
20-08-11 07:32:49.207 - INFO: Remove SSIM loss.
20-08-11 07:32:49.207 - INFO: Remove LPIPS loss.
20-08-11 07:32:49.207 - INFO: Remove GAN loss.
20-08-11 07:32:49.211 - INFO: Model [SRRaGANModel] is created.
20-08-11 07:32:49.211 - INFO: Start training from epoch: 0, iter: 0
20-08-11 07:32:51.501 - INFO: <epoch: 0, iter: 2, lr:2.000e-04> l_g_pix: 7.1235e-02
20-08-11 07:32:51.796 - INFO: Backup models and training states saved.
20-08-11 07:32:52.356 - INFO: <epoch: 0, iter: 4, lr:2.000e-04> l_g_pix: 5.8562e-02
20-08-11 07:32:52.600 - INFO: Backup models and training states saved.
20-08-11 07:32:53.110 - INFO: <epoch: 0, iter: 6, lr:2.000e-04> l_g_pix: 4.0749e-02
20-08-11 07:32:53.568 - INFO: Backup models and training states saved.
20-08-11 07:32:54.078 - INFO: <epoch: 0, iter: 8, lr:2.000e-04> l_g_pix: 2.3442e-02
20-08-11 07:32:54.280 - INFO: Models and training states saved.
20-08-11 07:32:54.576 - INFO: Backup models and training states saved.
Setting up Perceptual loss...
Loading model from: D:\FUN\GAME\3ds\Texture\ESRGAN\train_AI\BasicSR-lite\codes\models\modules\LPIPS\lpips_weights\v0.1\squeeze.pth
...[net-lin [squeeze]] initialized
...Done
Traceback (most recent call last):
File "./codes/train.py", line 252, in
main()
File "./codes/train.py", line 213, in main
avg_lpips += lpips.calculate_lpips(cropped_sr_img, cropped_gt_img)
File "D:\FUN\GAME\3ds\Texture\ESRGAN\train_AI\BasicSR-lite\codes\models\modules\LPIPS\compute_dists.py", line 33, in calculate_lpips
dist01 = model.forward(img2,img1)
File "D:\FUN\GAME\3ds\Texture\ESRGAN\train_AI\BasicSR-lite\codes\models\modules\LPIPS\perceptual_loss.py", line 39, in forward
return self.model.forward(target, pred)
File "D:\FUN\GAME\3ds\Texture\ESRGAN\train_AI\BasicSR-lite\codes\models\modules\LPIPS\dist_model.py", line 116, in forward
return self.net.forward(in0, in1, retPerLayer=retPerLayer)
File "D:\FUN\GAME\3ds\Texture\ESRGAN\train_AI\BasicSR-lite\codes\models\modules\LPIPS\networks_basic.py", line 67, in forward
feats0[kk], feats1[kk] = util.normalize_tensor(outs0[kk]), util.normalize_tensor(outs1[kk])
File "D:\FUN\GAME\3ds\Texture\ESRGAN\train_AI\BasicSR-lite\codes\models\modules\LPIPS\perceptual_loss.py", line 42, in normalize_tensor
norm_factor = torch.sqrt(torch.sum(in_feat**2,dim=1,keepdim=True))
RuntimeError: [enforce fail at ..\c10\core\CPUAllocator.cpp:72] data. DefaultCPUAllocator: not enough memory: you tried to allocate 24774144 bytes. Buy new RAM!