'not enough memory'error after backup a training state

My PC has 32G RAM,and they were only used 40% while training.
But after backup a training state,the log says
`RuntimeError: [enforce fail at ..\c10\core\CPUAllocator.cpp:72] data. DefaultCPUAllocator: not enough memory: you tried to allocate 24774144 bytes. Buy new RAM!`
I already set `n_workers: 1` and `batch_size: 1`,the result is that the training tried to allocate only 24774144 bytes.
That's less than 25MB,and my PC has 32GB RAM,why it's not enough?
MY 
CPU: AMD RYZEN 3700x (8cores)
GPU: Geforce 1660super
CUDA: cuda_11.0.2_451.48_win10
pytorch: torch-1.6.0-cp38-cp38-win_amd64

This is the log:

export CUDA_VISIBLE_DEVICES=0
20-08-11 07:32:47.805 - INFO:   name: debug_newtest
  use_tb_logger: True
  model: sr
  scale: 4
  gpu_ids: [0]
  datasets:[
    train:[
      name: DIV2K
      mode: LRHR
      dataroot_HR: ./data_samples/div2k/div2k_train_hr
      dataroot_LR: ./data_samples/div2k/DIV2K_train_LR
      subset_file: None
      use_shuffle: True
      n_workers: 1
      batch_size: 1
      HR_size: 64
      use_flip: True
      use_rot: True
      phase: train
      scale: 4
      data_type: img
      LR_nc: 3
      HR_nc: 3
    ]
    val:[
      name: val_set5
      mode: LRHR
      dataroot_HR: ./data_samples/div2k/div2k_valid_hr
      dataroot_LR: ./data_samples/div2k/DIV2K_valid_LR
      phase: val
      scale: 4
      data_type: img
      LR_nc: 3
      HR_nc: 3
    ]
  ]
  path:[
    root: ./
    pretrain_model_G: ./experiments/pretrained_models/4x_ArtStation1337_FatalityMKII90000G_05_rebout_02.pth
    experiments_root: ./experiments\debug_newtest
    models: ./experiments\debug_newtest\models
    training_state: ./experiments\debug_newtest\training_state
    log: ./experiments\debug_newtest
    val_images: ./experiments\debug_newtest\val_images
  ]
  network_G:[
    which_model_G: RRDB_net
    norm_type: None
    mode: CNA
    nf: 64
    nb: 23
    in_nc: 3
    out_nc: 3
    gc: 32
    group: 1
    scale: 4
  ]
  train:[
    lr_G: 0.0002
    lr_scheme: MultiStepLR
    lr_steps: [200000, 400000, 600000, 800000]
    lr_gamma: 0.5
    pixel_criterion: l1
    pixel_weight: 1
    val_freq: 8
    manual_seed: 0
    niter: 1000000
    lr_decay_iter: 10
  ]
  logger:[
    print_freq: 2
    save_checkpoint_freq: 8
    backup_freq: 2
  ]
  is_train: True
  batch_multiplier: 1

20-08-11 07:32:47.805 - INFO: Random seed: 0
20-08-11 07:32:47.815 - INFO: Dataset [LRHRDataset - DIV2K] is created.
20-08-11 07:32:47.815 - INFO: Number of train images: 800, iters: 800
20-08-11 07:32:47.815 - INFO: Total epochs needed: 1250 for iters 1,000,000
20-08-11 07:32:47.817 - INFO: Dataset [LRHRDataset - val_set5] is created.
20-08-11 07:32:47.817 - INFO: Number of val images in [val_set5]: 100
20-08-11 07:32:47.946 - INFO: Initialization method [kaiming]
20-08-11 07:32:49.037 - INFO: Loading pretrained model for G [./experiments/pretrained_models/4x_ArtStation1337_FatalityMKII90000G_05_rebout_02.pth] ...
20-08-11 07:32:49.205 - INFO: Remove frequency separation.
20-08-11 07:32:49.205 - INFO: Remove feature loss.
20-08-11 07:32:49.206 - INFO: Remove HFEN loss.
20-08-11 07:32:49.207 - INFO: Remove TV loss.
20-08-11 07:32:49.207 - INFO: Remove SSIM loss.
20-08-11 07:32:49.207 - INFO: Remove LPIPS loss.
20-08-11 07:32:49.207 - INFO: Remove GAN loss.
20-08-11 07:32:49.211 - INFO: Model [SRRaGANModel] is created.
20-08-11 07:32:49.211 - INFO: Start training from epoch: 0, iter: 0
20-08-11 07:32:51.501 - INFO: <epoch:  0, iter:       2, lr:2.000e-04> l_g_pix: 7.1235e-02
20-08-11 07:32:51.796 - INFO: Backup models and training states saved.
20-08-11 07:32:52.356 - INFO: <epoch:  0, iter:       4, lr:2.000e-04> l_g_pix: 5.8562e-02
20-08-11 07:32:52.600 - INFO: Backup models and training states saved.
20-08-11 07:32:53.110 - INFO: <epoch:  0, iter:       6, lr:2.000e-04> l_g_pix: 4.0749e-02
20-08-11 07:32:53.568 - INFO: Backup models and training states saved.
20-08-11 07:32:54.078 - INFO: <epoch:  0, iter:       8, lr:2.000e-04> l_g_pix: 2.3442e-02
20-08-11 07:32:54.280 - INFO: Models and training states saved.
20-08-11 07:32:54.576 - INFO: Backup models and training states saved.
Setting up Perceptual loss...
Loading model from: D:\FUN\GAME\3ds\Texture\ESRGAN\train_AI\BasicSR-lite\codes\models\modules\LPIPS\lpips_weights\v0.1\squeeze.pth
...[net-lin [squeeze]] initialized
...Done
Traceback (most recent call last):
  File "./codes/train.py", line 252, in <module>
    main()
  File "./codes/train.py", line 213, in main
    avg_lpips += lpips.calculate_lpips(cropped_sr_img, cropped_gt_img)
  File "D:\FUN\GAME\3ds\Texture\ESRGAN\train_AI\BasicSR-lite\codes\models\modules\LPIPS\compute_dists.py", line 33, in calculate_lpips
    dist01 = model.forward(img2,img1)
  File "D:\FUN\GAME\3ds\Texture\ESRGAN\train_AI\BasicSR-lite\codes\models\modules\LPIPS\perceptual_loss.py", line 39, in forward
    return self.model.forward(target, pred)
  File "D:\FUN\GAME\3ds\Texture\ESRGAN\train_AI\BasicSR-lite\codes\models\modules\LPIPS\dist_model.py", line 116, in forward
    return self.net.forward(in0, in1, retPerLayer=retPerLayer)
  File "D:\FUN\GAME\3ds\Texture\ESRGAN\train_AI\BasicSR-lite\codes\models\modules\LPIPS\networks_basic.py", line 67, in forward
    feats0[kk], feats1[kk] = util.normalize_tensor(outs0[kk]), util.normalize_tensor(outs1[kk])
  File "D:\FUN\GAME\3ds\Texture\ESRGAN\train_AI\BasicSR-lite\codes\models\modules\LPIPS\perceptual_loss.py", line 42, in normalize_tensor
    norm_factor = torch.sqrt(torch.sum(in_feat**2,dim=1,keepdim=True))
RuntimeError: [enforce fail at ..\c10\core\CPUAllocator.cpp:72] data. DefaultCPUAllocator: not enough memory: you tried to allocate 24774144 bytes. Buy new RAM!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

'not enough memory'error after backup a training state #10

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

'not enough memory'error after backup a training state #10

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions