-
Notifications
You must be signed in to change notification settings - Fork 18
Description
Hi, I'm trying to train the zero cost model, and encountered the following issue. Wanna ask here if anyone could help?
For your info, I'm runing on google colab with T4 GPU.
!TORCH_USE_CUDA_DSA=1 CUDA_LAUNCH_BLOCKING=1 python3 train.py --train_model --workload_runs ../zero-shot-data/runs/deepdb_augmented/airline/index_workload_100k_s2_c8220.json ../zero-shot-data/runs/deepdb_augmented/airline/workload_100k_s1_c8220.json ../zero-shot-data/runs/deepdb_augmented/ssb/index_workload_100k_s2_c8220.json ../zero-shot-data/runs/deepdb_augmented/ssb/workload_100k_s1_c8220.json ../zero-shot-data/runs/deepdb_augmented/tpc_h/index_workload_100k_s2_c8220.json ../zero-shot-data/runs/deepdb_augmented/tpc_h/workload_100k_s1_c8220.json ../zero-shot-data/runs/deepdb_augmented/walmart/index_workload_100k_s2_c8220.json ../zero-shot-data/runs/deepdb_augmented/walmart/workload_100k_s1_c8220.json ../zero-shot-data/runs/deepdb_augmented/financial/index_workload_100k_s2_c8220.json ../zero-shot-data/runs/deepdb_augmented/financial/workload_100k_s1_c8220.json ../zero-shot-data/runs/deepdb_augmented/basketball/index_workload_100k_s2_c8220.json ../zero-shot-data/runs/deepdb_augmented/basketball/workload_100k_s1_c8220.json ../zero-shot-data/runs/deepdb_augmented/accidents/index_workload_100k_s2_c8220.json ../zero-shot-data/runs/deepdb_augmented/accidents/workload_100k_s1_c8220.json ../zero-shot-data/runs/deepdb_augmented/movielens/index_workload_100k_s2_c8220.json ../zero-shot-data/runs/deepdb_augmented/movielens/workload_100k_s1_c8220.json ../zero-shot-data/runs/deepdb_augmented/baseball/index_workload_100k_s2_c8220.json ../zero-shot-data/runs/deepdb_augmented/baseball/workload_100k_s1_c8220.json ../zero-shot-data/runs/deepdb_augmented/hepatitis/index_workload_100k_s2_c8220.json ../zero-shot-data/runs/deepdb_augmented/hepatitis/workload_100k_s1_c8220.json ../zero-shot-data/runs/deepdb_augmented/tournament/index_workload_100k_s2_c8220.json ../zero-shot-data/runs/deepdb_augmented/tournament/workload_100k_s1_c8220.json ../zero-shot-data/runs/deepdb_augmented/credit/index_workload_100k_s2_c8220.json ../zero-shot-data/runs/deepdb_augmented/credit/workload_100k_s1_c8220.json ../zero-shot-data/runs/deepdb_augmented/employee/index_workload_100k_s2_c8220.json ../zero-shot-data/runs/deepdb_augmented/employee/workload_100k_s1_c8220.json ../zero-shot-data/runs/deepdb_augmented/consumer/index_workload_100k_s2_c8220.json ../zero-shot-data/runs/deepdb_augmented/consumer/workload_100k_s1_c8220.json ../zero-shot-data/runs/deepdb_augmented/geneea/index_workload_100k_s2_c8220.json ../zero-shot-data/runs/deepdb_augmented/geneea/workload_100k_s1_c8220.json ../zero-shot-data/runs/deepdb_augmented/genome/index_workload_100k_s2_c8220.json ../zero-shot-data/runs/deepdb_augmented/genome/workload_100k_s1_c8220.json ../zero-shot-data/runs/deepdb_augmented/carcinogenesis/index_workload_100k_s2_c8220.json ../zero-shot-data/runs/deepdb_augmented/carcinogenesis/workload_100k_s1_c8220.json ../zero-shot-data/runs/deepdb_augmented/seznam/index_workload_100k_s2_c8220.json ../zero-shot-data/runs/deepdb_augmented/seznam/workload_100k_s1_c8220.json ../zero-shot-data/runs/deepdb_augmented/fhnk/index_workload_100k_s2_c8220.json ../zero-shot-data/runs/deepdb_augmented/fhnk/workload_100k_s1_c8220.json --test_workload_runs ../zero-shot-data/runs/deepdb_augmented/imdb/index_workload_100k_s2_c8220.json ../zero-shot-data/runs/deepdb_augmented/imdb/workload_100k_s1_c8220.json ../zero-shot-data/runs/deepdb_augmented/imdb/synthetic_c8220.json ../zero-shot-data/runs/deepdb_augmented/imdb/scale_c8220.json ../zero-shot-data/runs/deepdb_augmented/imdb/job-light_c8220.json --statistics_file ../zero-shot-data/runs/deepdb_augmented/statistics_workload_combined.json --target ../zero-shot-data/evaluation/db_generalization_tune_est/ --hyperparameter_path setup/tuned_hyperparameters/tune_est_best_config.json --max_epoch_tuples 100000 --loss_class_name QLoss --device cuda:0 --filename_model imdb_0 --num_workers 16 --database postgres --seed 0
Reading hyperparameters from setup/tuned_hyperparameters/tune_est_best_config.json
No of Plans: 190000
/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py:560: UserWarning: This DataLoader will create 16 worker processes in total. Our suggested max number of worker in current system is 8, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
warnings.warn(_create_warning_msg(
No of Plans: 5000
/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py:560: UserWarning: This DataLoader will create 16 worker processes in total. Our suggested max number of worker in current system is 8, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
warnings.warn(_create_warning_msg(
No of Plans: 5000
/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py:560: UserWarning: This DataLoader will create 16 worker processes in total. Our suggested max number of worker in current system is 8, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
warnings.warn(_create_warning_msg(
No of Plans: 4565
/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py:560: UserWarning: This DataLoader will create 16 worker processes in total. Our suggested max number of worker in current system is 8, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
warnings.warn(_create_warning_msg(
No of Plans: 382
/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py:560: UserWarning: This DataLoader will create 16 worker processes in total. Our suggested max number of worker in current system is 8, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
warnings.warn(_create_warning_msg(
No of Plans: 50
/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py:560: UserWarning: This DataLoader will create 16 worker processes in total. Our suggested max number of worker in current system is 8, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
warnings.warn(_create_warning_msg(
PostgresZeroShotModel(
(loss_fxn): QLoss()
(fcout): Sequential(
(0): FcLayer(
(layers): Sequential(
(0): Linear(in_features=128, out_features=192, bias=True)
(1): LeakyReLU(negative_slope=0.01, inplace=True)
)
)
(1): FcLayer(
(layers): Sequential(
(0): Linear(in_features=192, out_features=192, bias=True)
(1): LeakyReLU(negative_slope=0.01, inplace=True)
)
)
(2): FcLayer(
(layers): Sequential(
(0): Linear(in_features=192, out_features=192, bias=True)
(1): LeakyReLU(negative_slope=0.01, inplace=True)
)
)
(3): FcLayer(
(layers): Sequential(
(0): Linear(in_features=192, out_features=192, bias=True)
(1): LeakyReLU(negative_slope=0.01, inplace=True)
)
)
(4): FcLayer(
(layers): Sequential(
(0): Linear(in_features=192, out_features=1, bias=True)
(1): LeakyReLU(negative_slope=0.01, inplace=True)
)
)
)
(tree_models): ModuleDict(
(column_output_column): MscnConv(
(fcout): Sequential(
(0): FcLayer(
(layers): Sequential(
(0): Linear(in_features=256, out_features=153, bias=True)
(1): LeakyReLU(negative_slope=0.01, inplace=True)
)
)
(1): FcLayer(
(layers): Sequential(
(0): Linear(in_features=153, out_features=153, bias=True)
(1): LeakyReLU(negative_slope=0.01, inplace=True)
)
)
(2): FcLayer(
(layers): Sequential(
(0): Linear(in_features=153, out_features=128, bias=True)
(1): LeakyReLU(negative_slope=0.01, inplace=True)
)
)
)
)
(to_plan): MscnConv(
(fcout): Sequential(
(0): FcLayer(
(layers): Sequential(
(0): Linear(in_features=256, out_features=153, bias=True)
(1): LeakyReLU(negative_slope=0.01, inplace=True)
)
)
(1): FcLayer(
(layers): Sequential(
(0): Linear(in_features=153, out_features=153, bias=True)
(1): LeakyReLU(negative_slope=0.01, inplace=True)
)
)
(2): FcLayer(
(layers): Sequential(
(0): Linear(in_features=153, out_features=128, bias=True)
(1): LeakyReLU(negative_slope=0.01, inplace=True)
)
)
)
)
(intra_plan): MscnConv(
(fcout): Sequential(
(0): FcLayer(
(layers): Sequential(
(0): Linear(in_features=256, out_features=153, bias=True)
(1): LeakyReLU(negative_slope=0.01, inplace=True)
)
)
(1): FcLayer(
(layers): Sequential(
(0): Linear(in_features=153, out_features=153, bias=True)
(1): LeakyReLU(negative_slope=0.01, inplace=True)
)
)
(2): FcLayer(
(layers): Sequential(
(0): Linear(in_features=153, out_features=128, bias=True)
(1): LeakyReLU(negative_slope=0.01, inplace=True)
)
)
)
)
(intra_pred): MscnConv(
(fcout): Sequential(
(0): FcLayer(
(layers): Sequential(
(0): Linear(in_features=256, out_features=153, bias=True)
(1): LeakyReLU(negative_slope=0.01, inplace=True)
)
)
(1): FcLayer(
(layers): Sequential(
(0): Linear(in_features=153, out_features=153, bias=True)
(1): LeakyReLU(negative_slope=0.01, inplace=True)
)
)
(2): FcLayer(
(layers): Sequential(
(0): Linear(in_features=153, out_features=128, bias=True)
(1): LeakyReLU(negative_slope=0.01, inplace=True)
)
)
)
)
)
(node_type_encoders): ModuleDict(
(column): NodeTypeEncoder(
(fcout): Sequential(
(0): FcLayer(
(layers): Sequential(
(0): Linear(in_features=14, out_features=21, bias=True)
(1): LeakyReLU(negative_slope=0.01, inplace=True)
)
)
(1): FcLayer(
(layers): Sequential(
(0): Linear(in_features=21, out_features=21, bias=True)
(1): LeakyReLU(negative_slope=0.01, inplace=True)
)
)
(2): FcLayer(
(layers): Sequential(
(0): Linear(in_features=21, out_features=21, bias=True)
(1): LeakyReLU(negative_slope=0.01, inplace=True)
)
)
(3): FcLayer(
(layers): Sequential(
(0): Linear(in_features=21, out_features=21, bias=True)
(1): LeakyReLU(negative_slope=0.01, inplace=True)
)
)
(4): FcLayer(
(layers): Sequential(
(0): Linear(in_features=21, out_features=128, bias=True)
(1): LeakyReLU(negative_slope=0.01, inplace=True)
)
)
)
(embeddings): ModuleDict(
(data_type): EmbeddingInitializer(
(embed): Embedding(10, 10)
(do): Dropout(p=0.0, inplace=False)
)
)
)
(table): NodeTypeEncoder(
(fcout): Sequential(
(0): FcLayer(
(layers): Sequential(
(0): Linear(in_features=2, out_features=3, bias=True)
(1): LeakyReLU(negative_slope=0.01, inplace=True)
)
)
(1): FcLayer(
(layers): Sequential(
(0): Linear(in_features=3, out_features=3, bias=True)
(1): LeakyReLU(negative_slope=0.01, inplace=True)
)
)
(2): FcLayer(
(layers): Sequential(
(0): Linear(in_features=3, out_features=3, bias=True)
(1): LeakyReLU(negative_slope=0.01, inplace=True)
)
)
(3): FcLayer(
(layers): Sequential(
(0): Linear(in_features=3, out_features=3, bias=True)
(1): LeakyReLU(negative_slope=0.01, inplace=True)
)
)
(4): FcLayer(
(layers): Sequential(
(0): Linear(in_features=3, out_features=128, bias=True)
(1): LeakyReLU(negative_slope=0.01, inplace=True)
)
)
)
(embeddings): ModuleDict()
)
(output_column): NodeTypeEncoder(
(fcout): Sequential(
(0): FcLayer(
(layers): Sequential(
(0): Linear(in_features=5, out_features=7, bias=True)
(1): LeakyReLU(negative_slope=0.01, inplace=True)
)
)
(1): FcLayer(
(layers): Sequential(
(0): Linear(in_features=7, out_features=7, bias=True)
(1): LeakyReLU(negative_slope=0.01, inplace=True)
)
)
(2): FcLayer(
(layers): Sequential(
(0): Linear(in_features=7, out_features=7, bias=True)
(1): LeakyReLU(negative_slope=0.01, inplace=True)
)
)
(3): FcLayer(
(layers): Sequential(
(0): Linear(in_features=7, out_features=7, bias=True)
(1): LeakyReLU(negative_slope=0.01, inplace=True)
)
)
(4): FcLayer(
(layers): Sequential(
(0): Linear(in_features=7, out_features=128, bias=True)
(1): LeakyReLU(negative_slope=0.01, inplace=True)
)
)
)
(embeddings): ModuleDict(
(aggregation): EmbeddingInitializer(
(embed): Embedding(5, 5)
(do): Dropout(p=0.0, inplace=False)
)
)
)
(filter_column): NodeTypeEncoder(
(fcout): Sequential(
(0): FcLayer(
(layers): Sequential(
(0): Linear(in_features=20, out_features=30, bias=True)
(1): LeakyReLU(negative_slope=0.01, inplace=True)
)
)
(1): FcLayer(
(layers): Sequential(
(0): Linear(in_features=30, out_features=30, bias=True)
(1): LeakyReLU(negative_slope=0.01, inplace=True)
)
)
(2): FcLayer(
(layers): Sequential(
(0): Linear(in_features=30, out_features=30, bias=True)
(1): LeakyReLU(negative_slope=0.01, inplace=True)
)
)
(3): FcLayer(
(layers): Sequential(
(0): Linear(in_features=30, out_features=30, bias=True)
(1): LeakyReLU(negative_slope=0.01, inplace=True)
)
)
(4): FcLayer(
(layers): Sequential(
(0): Linear(in_features=30, out_features=128, bias=True)
(1): LeakyReLU(negative_slope=0.01, inplace=True)
)
)
)
(embeddings): ModuleDict(
(operator): EmbeddingInitializer(
(embed): Embedding(5, 5)
(do): Dropout(p=0.0, inplace=False)
)
(data_type): EmbeddingInitializer(
(embed): Embedding(10, 10)
(do): Dropout(p=0.0, inplace=False)
)
)
)
(plan): NodeTypeEncoder(
(fcout): Sequential(
(0): FcLayer(
(layers): Sequential(
(0): Linear(in_features=24, out_features=36, bias=True)
(1): LeakyReLU(negative_slope=0.01, inplace=True)
)
)
(1): FcLayer(
(layers): Sequential(
(0): Linear(in_features=36, out_features=36, bias=True)
(1): LeakyReLU(negative_slope=0.01, inplace=True)
)
)
(2): FcLayer(
(layers): Sequential(
(0): Linear(in_features=36, out_features=36, bias=True)
(1): LeakyReLU(negative_slope=0.01, inplace=True)
)
)
(3): FcLayer(
(layers): Sequential(
(0): Linear(in_features=36, out_features=36, bias=True)
(1): LeakyReLU(negative_slope=0.01, inplace=True)
)
)
(4): FcLayer(
(layers): Sequential(
(0): Linear(in_features=36, out_features=128, bias=True)
(1): LeakyReLU(negative_slope=0.01, inplace=True)
)
)
)
(embeddings): ModuleDict(
(op_name): EmbeddingInitializer(
(embed): Embedding(20, 20)
(do): Dropout(p=0.0, inplace=False)
)
)
)
(logical_pred): NodeTypeEncoder(
(fcout): Sequential(
(0): FcLayer(
(layers): Sequential(
(0): Linear(in_features=6, out_features=9, bias=True)
(1): LeakyReLU(negative_slope=0.01, inplace=True)
)
)
(1): FcLayer(
(layers): Sequential(
(0): Linear(in_features=9, out_features=9, bias=True)
(1): LeakyReLU(negative_slope=0.01, inplace=True)
)
)
(2): FcLayer(
(layers): Sequential(
(0): Linear(in_features=9, out_features=9, bias=True)
(1): LeakyReLU(negative_slope=0.01, inplace=True)
)
)
(3): FcLayer(
(layers): Sequential(
(0): Linear(in_features=9, out_features=9, bias=True)
(1): LeakyReLU(negative_slope=0.01, inplace=True)
)
)
(4): FcLayer(
(layers): Sequential(
(0): Linear(in_features=9, out_features=128, bias=True)
(1): LeakyReLU(negative_slope=0.01, inplace=True)
)
)
)
(embeddings): ModuleDict(
(operator): EmbeddingInitializer(
(embed): Embedding(5, 5)
(do): Dropout(p=0.0, inplace=False)
)
)
)
)
)
No valid checkpoint found [Errno 2] No such file or directory: '../zero-shot-data/evaluation/db_generalization_tune_est/imdb_0.pt'
Epoch 0
0% 0/631 [00:12<?, ?it/s]
Traceback (most recent call last):
File "/content/drive/MyDrive/Colab Notebooks/FYP/zero-shot-cost-estimation/train.py", line 54, in <module>
train_readout_hyperparams(args.workload_runs, args.test_workload_runs, args.statistics_file, args.target,
File "/content/drive/MyDrive/Colab Notebooks/FYP/zero-shot-cost-estimation/models/training/train.py", line 424, in train_readout_hyperparams
train_model(workload_runs, test_workload_runs, statistics_file, target_dir, filename_model,
File "/content/drive/MyDrive/Colab Notebooks/FYP/zero-shot-cost-estimation/models/training/train.py", line 213, in train_model
train_epoch(epoch_stats, train_loader, model, optimizer, max_epoch_tuples)
File "/content/drive/MyDrive/Colab Notebooks/FYP/zero-shot-cost-estimation/models/training/train.py", line 29, in train_epoch
input_model, label, sample_idxs = custom_batch_to(batch, model.device, model.label_norm)
File "/content/drive/MyDrive/Colab Notebooks/FYP/zero-shot-cost-estimation/models/training/utils.py", line 39, in batch_to
recursive_to(features, device)
File "/content/drive/MyDrive/Colab Notebooks/FYP/zero-shot-cost-estimation/models/training/utils.py", line 24, in recursive_to
recursive_to(v, device)
File "/content/drive/MyDrive/Colab Notebooks/FYP/zero-shot-cost-estimation/models/training/utils.py", line 21, in recursive_to
iterable.to(device)
RuntimeError: CUDA error: unspecified launch failure
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.