RuntimeError: CUDA error: unspecified launch failure

Hi, I'm trying to train the zero cost model, and encountered the following issue. Wanna ask here if anyone could help?
For your info, I'm runing on google colab with T4 GPU.
`!TORCH_USE_CUDA_DSA=1 CUDA_LAUNCH_BLOCKING=1 python3 train.py --train_model --workload_runs ../zero-shot-data/runs/deepdb_augmented/airline/index_workload_100k_s2_c8220.json ../zero-shot-data/runs/deepdb_augmented/airline/workload_100k_s1_c8220.json ../zero-shot-data/runs/deepdb_augmented/ssb/index_workload_100k_s2_c8220.json ../zero-shot-data/runs/deepdb_augmented/ssb/workload_100k_s1_c8220.json ../zero-shot-data/runs/deepdb_augmented/tpc_h/index_workload_100k_s2_c8220.json ../zero-shot-data/runs/deepdb_augmented/tpc_h/workload_100k_s1_c8220.json ../zero-shot-data/runs/deepdb_augmented/walmart/index_workload_100k_s2_c8220.json ../zero-shot-data/runs/deepdb_augmented/walmart/workload_100k_s1_c8220.json ../zero-shot-data/runs/deepdb_augmented/financial/index_workload_100k_s2_c8220.json ../zero-shot-data/runs/deepdb_augmented/financial/workload_100k_s1_c8220.json ../zero-shot-data/runs/deepdb_augmented/basketball/index_workload_100k_s2_c8220.json ../zero-shot-data/runs/deepdb_augmented/basketball/workload_100k_s1_c8220.json ../zero-shot-data/runs/deepdb_augmented/accidents/index_workload_100k_s2_c8220.json ../zero-shot-data/runs/deepdb_augmented/accidents/workload_100k_s1_c8220.json ../zero-shot-data/runs/deepdb_augmented/movielens/index_workload_100k_s2_c8220.json ../zero-shot-data/runs/deepdb_augmented/movielens/workload_100k_s1_c8220.json ../zero-shot-data/runs/deepdb_augmented/baseball/index_workload_100k_s2_c8220.json ../zero-shot-data/runs/deepdb_augmented/baseball/workload_100k_s1_c8220.json ../zero-shot-data/runs/deepdb_augmented/hepatitis/index_workload_100k_s2_c8220.json ../zero-shot-data/runs/deepdb_augmented/hepatitis/workload_100k_s1_c8220.json ../zero-shot-data/runs/deepdb_augmented/tournament/index_workload_100k_s2_c8220.json ../zero-shot-data/runs/deepdb_augmented/tournament/workload_100k_s1_c8220.json ../zero-shot-data/runs/deepdb_augmented/credit/index_workload_100k_s2_c8220.json ../zero-shot-data/runs/deepdb_augmented/credit/workload_100k_s1_c8220.json ../zero-shot-data/runs/deepdb_augmented/employee/index_workload_100k_s2_c8220.json ../zero-shot-data/runs/deepdb_augmented/employee/workload_100k_s1_c8220.json ../zero-shot-data/runs/deepdb_augmented/consumer/index_workload_100k_s2_c8220.json ../zero-shot-data/runs/deepdb_augmented/consumer/workload_100k_s1_c8220.json ../zero-shot-data/runs/deepdb_augmented/geneea/index_workload_100k_s2_c8220.json ../zero-shot-data/runs/deepdb_augmented/geneea/workload_100k_s1_c8220.json ../zero-shot-data/runs/deepdb_augmented/genome/index_workload_100k_s2_c8220.json ../zero-shot-data/runs/deepdb_augmented/genome/workload_100k_s1_c8220.json ../zero-shot-data/runs/deepdb_augmented/carcinogenesis/index_workload_100k_s2_c8220.json ../zero-shot-data/runs/deepdb_augmented/carcinogenesis/workload_100k_s1_c8220.json ../zero-shot-data/runs/deepdb_augmented/seznam/index_workload_100k_s2_c8220.json ../zero-shot-data/runs/deepdb_augmented/seznam/workload_100k_s1_c8220.json ../zero-shot-data/runs/deepdb_augmented/fhnk/index_workload_100k_s2_c8220.json ../zero-shot-data/runs/deepdb_augmented/fhnk/workload_100k_s1_c8220.json --test_workload_runs ../zero-shot-data/runs/deepdb_augmented/imdb/index_workload_100k_s2_c8220.json ../zero-shot-data/runs/deepdb_augmented/imdb/workload_100k_s1_c8220.json ../zero-shot-data/runs/deepdb_augmented/imdb/synthetic_c8220.json ../zero-shot-data/runs/deepdb_augmented/imdb/scale_c8220.json ../zero-shot-data/runs/deepdb_augmented/imdb/job-light_c8220.json --statistics_file ../zero-shot-data/runs/deepdb_augmented/statistics_workload_combined.json --target ../zero-shot-data/evaluation/db_generalization_tune_est/ --hyperparameter_path setup/tuned_hyperparameters/tune_est_best_config.json --max_epoch_tuples 100000 --loss_class_name QLoss  --device cuda:0 --filename_model imdb_0 --num_workers 16 --database postgres --seed 0`

```
Reading hyperparameters from setup/tuned_hyperparameters/tune_est_best_config.json
No of Plans: 190000
/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py:560: UserWarning: This DataLoader will create 16 worker processes in total. Our suggested max number of worker in current system is 8, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
  warnings.warn(_create_warning_msg(
No of Plans: 5000
/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py:560: UserWarning: This DataLoader will create 16 worker processes in total. Our suggested max number of worker in current system is 8, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
  warnings.warn(_create_warning_msg(
No of Plans: 5000
/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py:560: UserWarning: This DataLoader will create 16 worker processes in total. Our suggested max number of worker in current system is 8, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
  warnings.warn(_create_warning_msg(
No of Plans: 4565
/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py:560: UserWarning: This DataLoader will create 16 worker processes in total. Our suggested max number of worker in current system is 8, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
  warnings.warn(_create_warning_msg(
No of Plans: 382
/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py:560: UserWarning: This DataLoader will create 16 worker processes in total. Our suggested max number of worker in current system is 8, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
  warnings.warn(_create_warning_msg(
No of Plans: 50
/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py:560: UserWarning: This DataLoader will create 16 worker processes in total. Our suggested max number of worker in current system is 8, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
  warnings.warn(_create_warning_msg(
PostgresZeroShotModel(
  (loss_fxn): QLoss()
  (fcout): Sequential(
    (0): FcLayer(
      (layers): Sequential(
        (0): Linear(in_features=128, out_features=192, bias=True)
        (1): LeakyReLU(negative_slope=0.01, inplace=True)
      )
    )
    (1): FcLayer(
      (layers): Sequential(
        (0): Linear(in_features=192, out_features=192, bias=True)
        (1): LeakyReLU(negative_slope=0.01, inplace=True)
      )
    )
    (2): FcLayer(
      (layers): Sequential(
        (0): Linear(in_features=192, out_features=192, bias=True)
        (1): LeakyReLU(negative_slope=0.01, inplace=True)
      )
    )
    (3): FcLayer(
      (layers): Sequential(
        (0): Linear(in_features=192, out_features=192, bias=True)
        (1): LeakyReLU(negative_slope=0.01, inplace=True)
      )
    )
    (4): FcLayer(
      (layers): Sequential(
        (0): Linear(in_features=192, out_features=1, bias=True)
        (1): LeakyReLU(negative_slope=0.01, inplace=True)
      )
    )
  )
  (tree_models): ModuleDict(
    (column_output_column): MscnConv(
      (fcout): Sequential(
        (0): FcLayer(
          (layers): Sequential(
            (0): Linear(in_features=256, out_features=153, bias=True)
            (1): LeakyReLU(negative_slope=0.01, inplace=True)
          )
        )
        (1): FcLayer(
          (layers): Sequential(
            (0): Linear(in_features=153, out_features=153, bias=True)
            (1): LeakyReLU(negative_slope=0.01, inplace=True)
          )
        )
        (2): FcLayer(
          (layers): Sequential(
            (0): Linear(in_features=153, out_features=128, bias=True)
            (1): LeakyReLU(negative_slope=0.01, inplace=True)
          )
        )
      )
    )
    (to_plan): MscnConv(
      (fcout): Sequential(
        (0): FcLayer(
          (layers): Sequential(
            (0): Linear(in_features=256, out_features=153, bias=True)
            (1): LeakyReLU(negative_slope=0.01, inplace=True)
          )
        )
        (1): FcLayer(
          (layers): Sequential(
            (0): Linear(in_features=153, out_features=153, bias=True)
            (1): LeakyReLU(negative_slope=0.01, inplace=True)
          )
        )
        (2): FcLayer(
          (layers): Sequential(
            (0): Linear(in_features=153, out_features=128, bias=True)
            (1): LeakyReLU(negative_slope=0.01, inplace=True)
          )
        )
      )
    )
    (intra_plan): MscnConv(
      (fcout): Sequential(
        (0): FcLayer(
          (layers): Sequential(
            (0): Linear(in_features=256, out_features=153, bias=True)
            (1): LeakyReLU(negative_slope=0.01, inplace=True)
          )
        )
        (1): FcLayer(
          (layers): Sequential(
            (0): Linear(in_features=153, out_features=153, bias=True)
            (1): LeakyReLU(negative_slope=0.01, inplace=True)
          )
        )
        (2): FcLayer(
          (layers): Sequential(
            (0): Linear(in_features=153, out_features=128, bias=True)
            (1): LeakyReLU(negative_slope=0.01, inplace=True)
          )
        )
      )
    )
    (intra_pred): MscnConv(
      (fcout): Sequential(
        (0): FcLayer(
          (layers): Sequential(
            (0): Linear(in_features=256, out_features=153, bias=True)
            (1): LeakyReLU(negative_slope=0.01, inplace=True)
          )
        )
        (1): FcLayer(
          (layers): Sequential(
            (0): Linear(in_features=153, out_features=153, bias=True)
            (1): LeakyReLU(negative_slope=0.01, inplace=True)
          )
        )
        (2): FcLayer(
          (layers): Sequential(
            (0): Linear(in_features=153, out_features=128, bias=True)
            (1): LeakyReLU(negative_slope=0.01, inplace=True)
          )
        )
      )
    )
  )
  (node_type_encoders): ModuleDict(
    (column): NodeTypeEncoder(
      (fcout): Sequential(
        (0): FcLayer(
          (layers): Sequential(
            (0): Linear(in_features=14, out_features=21, bias=True)
            (1): LeakyReLU(negative_slope=0.01, inplace=True)
          )
        )
        (1): FcLayer(
          (layers): Sequential(
            (0): Linear(in_features=21, out_features=21, bias=True)
            (1): LeakyReLU(negative_slope=0.01, inplace=True)
          )
        )
        (2): FcLayer(
          (layers): Sequential(
            (0): Linear(in_features=21, out_features=21, bias=True)
            (1): LeakyReLU(negative_slope=0.01, inplace=True)
          )
        )
        (3): FcLayer(
          (layers): Sequential(
            (0): Linear(in_features=21, out_features=21, bias=True)
            (1): LeakyReLU(negative_slope=0.01, inplace=True)
          )
        )
        (4): FcLayer(
          (layers): Sequential(
            (0): Linear(in_features=21, out_features=128, bias=True)
            (1): LeakyReLU(negative_slope=0.01, inplace=True)
          )
        )
      )
      (embeddings): ModuleDict(
        (data_type): EmbeddingInitializer(
          (embed): Embedding(10, 10)
          (do): Dropout(p=0.0, inplace=False)
        )
      )
    )
    (table): NodeTypeEncoder(
      (fcout): Sequential(
        (0): FcLayer(
          (layers): Sequential(
            (0): Linear(in_features=2, out_features=3, bias=True)
            (1): LeakyReLU(negative_slope=0.01, inplace=True)
          )
        )
        (1): FcLayer(
          (layers): Sequential(
            (0): Linear(in_features=3, out_features=3, bias=True)
            (1): LeakyReLU(negative_slope=0.01, inplace=True)
          )
        )
        (2): FcLayer(
          (layers): Sequential(
            (0): Linear(in_features=3, out_features=3, bias=True)
            (1): LeakyReLU(negative_slope=0.01, inplace=True)
          )
        )
        (3): FcLayer(
          (layers): Sequential(
            (0): Linear(in_features=3, out_features=3, bias=True)
            (1): LeakyReLU(negative_slope=0.01, inplace=True)
          )
        )
        (4): FcLayer(
          (layers): Sequential(
            (0): Linear(in_features=3, out_features=128, bias=True)
            (1): LeakyReLU(negative_slope=0.01, inplace=True)
          )
        )
      )
      (embeddings): ModuleDict()
    )
    (output_column): NodeTypeEncoder(
      (fcout): Sequential(
        (0): FcLayer(
          (layers): Sequential(
            (0): Linear(in_features=5, out_features=7, bias=True)
            (1): LeakyReLU(negative_slope=0.01, inplace=True)
          )
        )
        (1): FcLayer(
          (layers): Sequential(
            (0): Linear(in_features=7, out_features=7, bias=True)
            (1): LeakyReLU(negative_slope=0.01, inplace=True)
          )
        )
        (2): FcLayer(
          (layers): Sequential(
            (0): Linear(in_features=7, out_features=7, bias=True)
            (1): LeakyReLU(negative_slope=0.01, inplace=True)
          )
        )
        (3): FcLayer(
          (layers): Sequential(
            (0): Linear(in_features=7, out_features=7, bias=True)
            (1): LeakyReLU(negative_slope=0.01, inplace=True)
          )
        )
        (4): FcLayer(
          (layers): Sequential(
            (0): Linear(in_features=7, out_features=128, bias=True)
            (1): LeakyReLU(negative_slope=0.01, inplace=True)
          )
        )
      )
      (embeddings): ModuleDict(
        (aggregation): EmbeddingInitializer(
          (embed): Embedding(5, 5)
          (do): Dropout(p=0.0, inplace=False)
        )
      )
    )
    (filter_column): NodeTypeEncoder(
      (fcout): Sequential(
        (0): FcLayer(
          (layers): Sequential(
            (0): Linear(in_features=20, out_features=30, bias=True)
            (1): LeakyReLU(negative_slope=0.01, inplace=True)
          )
        )
        (1): FcLayer(
          (layers): Sequential(
            (0): Linear(in_features=30, out_features=30, bias=True)
            (1): LeakyReLU(negative_slope=0.01, inplace=True)
          )
        )
        (2): FcLayer(
          (layers): Sequential(
            (0): Linear(in_features=30, out_features=30, bias=True)
            (1): LeakyReLU(negative_slope=0.01, inplace=True)
          )
        )
        (3): FcLayer(
          (layers): Sequential(
            (0): Linear(in_features=30, out_features=30, bias=True)
            (1): LeakyReLU(negative_slope=0.01, inplace=True)
          )
        )
        (4): FcLayer(
          (layers): Sequential(
            (0): Linear(in_features=30, out_features=128, bias=True)
            (1): LeakyReLU(negative_slope=0.01, inplace=True)
          )
        )
      )
      (embeddings): ModuleDict(
        (operator): EmbeddingInitializer(
          (embed): Embedding(5, 5)
          (do): Dropout(p=0.0, inplace=False)
        )
        (data_type): EmbeddingInitializer(
          (embed): Embedding(10, 10)
          (do): Dropout(p=0.0, inplace=False)
        )
      )
    )
    (plan): NodeTypeEncoder(
      (fcout): Sequential(
        (0): FcLayer(
          (layers): Sequential(
            (0): Linear(in_features=24, out_features=36, bias=True)
            (1): LeakyReLU(negative_slope=0.01, inplace=True)
          )
        )
        (1): FcLayer(
          (layers): Sequential(
            (0): Linear(in_features=36, out_features=36, bias=True)
            (1): LeakyReLU(negative_slope=0.01, inplace=True)
          )
        )
        (2): FcLayer(
          (layers): Sequential(
            (0): Linear(in_features=36, out_features=36, bias=True)
            (1): LeakyReLU(negative_slope=0.01, inplace=True)
          )
        )
        (3): FcLayer(
          (layers): Sequential(
            (0): Linear(in_features=36, out_features=36, bias=True)
            (1): LeakyReLU(negative_slope=0.01, inplace=True)
          )
        )
        (4): FcLayer(
          (layers): Sequential(
            (0): Linear(in_features=36, out_features=128, bias=True)
            (1): LeakyReLU(negative_slope=0.01, inplace=True)
          )
        )
      )
      (embeddings): ModuleDict(
        (op_name): EmbeddingInitializer(
          (embed): Embedding(20, 20)
          (do): Dropout(p=0.0, inplace=False)
        )
      )
    )
    (logical_pred): NodeTypeEncoder(
      (fcout): Sequential(
        (0): FcLayer(
          (layers): Sequential(
            (0): Linear(in_features=6, out_features=9, bias=True)
            (1): LeakyReLU(negative_slope=0.01, inplace=True)
          )
        )
        (1): FcLayer(
          (layers): Sequential(
            (0): Linear(in_features=9, out_features=9, bias=True)
            (1): LeakyReLU(negative_slope=0.01, inplace=True)
          )
        )
        (2): FcLayer(
          (layers): Sequential(
            (0): Linear(in_features=9, out_features=9, bias=True)
            (1): LeakyReLU(negative_slope=0.01, inplace=True)
          )
        )
        (3): FcLayer(
          (layers): Sequential(
            (0): Linear(in_features=9, out_features=9, bias=True)
            (1): LeakyReLU(negative_slope=0.01, inplace=True)
          )
        )
        (4): FcLayer(
          (layers): Sequential(
            (0): Linear(in_features=9, out_features=128, bias=True)
            (1): LeakyReLU(negative_slope=0.01, inplace=True)
          )
        )
      )
      (embeddings): ModuleDict(
        (operator): EmbeddingInitializer(
          (embed): Embedding(5, 5)
          (do): Dropout(p=0.0, inplace=False)
        )
      )
    )
  )
)
No valid checkpoint found [Errno 2] No such file or directory: '../zero-shot-data/evaluation/db_generalization_tune_est/imdb_0.pt'
Epoch 0
  0% 0/631 [00:12<?, ?it/s]
Traceback (most recent call last):
  File "/content/drive/MyDrive/Colab Notebooks/FYP/zero-shot-cost-estimation/train.py", line 54, in <module>
    train_readout_hyperparams(args.workload_runs, args.test_workload_runs, args.statistics_file, args.target,
  File "/content/drive/MyDrive/Colab Notebooks/FYP/zero-shot-cost-estimation/models/training/train.py", line 424, in train_readout_hyperparams
    train_model(workload_runs, test_workload_runs, statistics_file, target_dir, filename_model,
  File "/content/drive/MyDrive/Colab Notebooks/FYP/zero-shot-cost-estimation/models/training/train.py", line 213, in train_model
    train_epoch(epoch_stats, train_loader, model, optimizer, max_epoch_tuples)
  File "/content/drive/MyDrive/Colab Notebooks/FYP/zero-shot-cost-estimation/models/training/train.py", line 29, in train_epoch
    input_model, label, sample_idxs = custom_batch_to(batch, model.device, model.label_norm)
  File "/content/drive/MyDrive/Colab Notebooks/FYP/zero-shot-cost-estimation/models/training/utils.py", line 39, in batch_to
    recursive_to(features, device)
  File "/content/drive/MyDrive/Colab Notebooks/FYP/zero-shot-cost-estimation/models/training/utils.py", line 24, in recursive_to
    recursive_to(v, device)
  File "/content/drive/MyDrive/Colab Notebooks/FYP/zero-shot-cost-estimation/models/training/utils.py", line 21, in recursive_to
    iterable.to(device)
RuntimeError: CUDA error: unspecified launch failure
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

RuntimeError: CUDA error: unspecified launch failure #3

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

RuntimeError: CUDA error: unspecified launch failure #3

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions