Skip to content
This repository was archived by the owner on Oct 1, 2020. It is now read-only.
This repository was archived by the owner on Oct 1, 2020. It is now read-only.

'not enough memory'error after backup a training state #10

@satoshils

Description

@satoshils

My PC has 32G RAM,and they were only used 40% while training.
But after backup a training state,the log says
RuntimeError: [enforce fail at ..\c10\core\CPUAllocator.cpp:72] data. DefaultCPUAllocator: not enough memory: you tried to allocate 24774144 bytes. Buy new RAM!
I already set n_workers: 1 and batch_size: 1,the result is that the training tried to allocate only 24774144 bytes.
That's less than 25MB,and my PC has 32GB RAM,why it's not enough?
MY
CPU: AMD RYZEN 3700x (8cores)
GPU: Geforce 1660super
CUDA: cuda_11.0.2_451.48_win10
pytorch: torch-1.6.0-cp38-cp38-win_amd64

This is the log:

export CUDA_VISIBLE_DEVICES=0
20-08-11 07:32:47.805 - INFO: name: debug_newtest
use_tb_logger: True
model: sr
scale: 4
gpu_ids: [0]
datasets:[
train:[
name: DIV2K
mode: LRHR
dataroot_HR: ./data_samples/div2k/div2k_train_hr
dataroot_LR: ./data_samples/div2k/DIV2K_train_LR
subset_file: None
use_shuffle: True
n_workers: 1
batch_size: 1
HR_size: 64
use_flip: True
use_rot: True
phase: train
scale: 4
data_type: img
LR_nc: 3
HR_nc: 3
]
val:[
name: val_set5
mode: LRHR
dataroot_HR: ./data_samples/div2k/div2k_valid_hr
dataroot_LR: ./data_samples/div2k/DIV2K_valid_LR
phase: val
scale: 4
data_type: img
LR_nc: 3
HR_nc: 3
]
]
path:[
root: ./
pretrain_model_G: ./experiments/pretrained_models/4x_ArtStation1337_FatalityMKII90000G_05_rebout_02.pth
experiments_root: ./experiments\debug_newtest
models: ./experiments\debug_newtest\models
training_state: ./experiments\debug_newtest\training_state
log: ./experiments\debug_newtest
val_images: ./experiments\debug_newtest\val_images
]
network_G:[
which_model_G: RRDB_net
norm_type: None
mode: CNA
nf: 64
nb: 23
in_nc: 3
out_nc: 3
gc: 32
group: 1
scale: 4
]
train:[
lr_G: 0.0002
lr_scheme: MultiStepLR
lr_steps: [200000, 400000, 600000, 800000]
lr_gamma: 0.5
pixel_criterion: l1
pixel_weight: 1
val_freq: 8
manual_seed: 0
niter: 1000000
lr_decay_iter: 10
]
logger:[
print_freq: 2
save_checkpoint_freq: 8
backup_freq: 2
]
is_train: True
batch_multiplier: 1

20-08-11 07:32:47.805 - INFO: Random seed: 0
20-08-11 07:32:47.815 - INFO: Dataset [LRHRDataset - DIV2K] is created.
20-08-11 07:32:47.815 - INFO: Number of train images: 800, iters: 800
20-08-11 07:32:47.815 - INFO: Total epochs needed: 1250 for iters 1,000,000
20-08-11 07:32:47.817 - INFO: Dataset [LRHRDataset - val_set5] is created.
20-08-11 07:32:47.817 - INFO: Number of val images in [val_set5]: 100
20-08-11 07:32:47.946 - INFO: Initialization method [kaiming]
20-08-11 07:32:49.037 - INFO: Loading pretrained model for G [./experiments/pretrained_models/4x_ArtStation1337_FatalityMKII90000G_05_rebout_02.pth] ...
20-08-11 07:32:49.205 - INFO: Remove frequency separation.
20-08-11 07:32:49.205 - INFO: Remove feature loss.
20-08-11 07:32:49.206 - INFO: Remove HFEN loss.
20-08-11 07:32:49.207 - INFO: Remove TV loss.
20-08-11 07:32:49.207 - INFO: Remove SSIM loss.
20-08-11 07:32:49.207 - INFO: Remove LPIPS loss.
20-08-11 07:32:49.207 - INFO: Remove GAN loss.
20-08-11 07:32:49.211 - INFO: Model [SRRaGANModel] is created.
20-08-11 07:32:49.211 - INFO: Start training from epoch: 0, iter: 0
20-08-11 07:32:51.501 - INFO: <epoch: 0, iter: 2, lr:2.000e-04> l_g_pix: 7.1235e-02
20-08-11 07:32:51.796 - INFO: Backup models and training states saved.
20-08-11 07:32:52.356 - INFO: <epoch: 0, iter: 4, lr:2.000e-04> l_g_pix: 5.8562e-02
20-08-11 07:32:52.600 - INFO: Backup models and training states saved.
20-08-11 07:32:53.110 - INFO: <epoch: 0, iter: 6, lr:2.000e-04> l_g_pix: 4.0749e-02
20-08-11 07:32:53.568 - INFO: Backup models and training states saved.
20-08-11 07:32:54.078 - INFO: <epoch: 0, iter: 8, lr:2.000e-04> l_g_pix: 2.3442e-02
20-08-11 07:32:54.280 - INFO: Models and training states saved.
20-08-11 07:32:54.576 - INFO: Backup models and training states saved.
Setting up Perceptual loss...
Loading model from: D:\FUN\GAME\3ds\Texture\ESRGAN\train_AI\BasicSR-lite\codes\models\modules\LPIPS\lpips_weights\v0.1\squeeze.pth
...[net-lin [squeeze]] initialized
...Done
Traceback (most recent call last):
File "./codes/train.py", line 252, in
main()
File "./codes/train.py", line 213, in main
avg_lpips += lpips.calculate_lpips(cropped_sr_img, cropped_gt_img)
File "D:\FUN\GAME\3ds\Texture\ESRGAN\train_AI\BasicSR-lite\codes\models\modules\LPIPS\compute_dists.py", line 33, in calculate_lpips
dist01 = model.forward(img2,img1)
File "D:\FUN\GAME\3ds\Texture\ESRGAN\train_AI\BasicSR-lite\codes\models\modules\LPIPS\perceptual_loss.py", line 39, in forward
return self.model.forward(target, pred)
File "D:\FUN\GAME\3ds\Texture\ESRGAN\train_AI\BasicSR-lite\codes\models\modules\LPIPS\dist_model.py", line 116, in forward
return self.net.forward(in0, in1, retPerLayer=retPerLayer)
File "D:\FUN\GAME\3ds\Texture\ESRGAN\train_AI\BasicSR-lite\codes\models\modules\LPIPS\networks_basic.py", line 67, in forward
feats0[kk], feats1[kk] = util.normalize_tensor(outs0[kk]), util.normalize_tensor(outs1[kk])
File "D:\FUN\GAME\3ds\Texture\ESRGAN\train_AI\BasicSR-lite\codes\models\modules\LPIPS\perceptual_loss.py", line 42, in normalize_tensor
norm_factor = torch.sqrt(torch.sum(in_feat**2,dim=1,keepdim=True))
RuntimeError: [enforce fail at ..\c10\core\CPUAllocator.cpp:72] data. DefaultCPUAllocator: not enough memory: you tried to allocate 24774144 bytes. Buy new RAM!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions