This directory converts generated data and benchmark datasets into the formats required for pretraining and evaluation.
It supports two workflows:
- single-table preprocessing
- relational-database preprocessing
This stage is packaged through pyproject.toml.
From the repository root:
cd data_preprocessing
pip install -e .The DFS preprocessing pipeline is partially adapted from the dbinfer project. You can also refer to that repository for additional usage details. We gratefully acknowledge their work.
- single_table_processing.sh: merges generated single-table batches into
.h5priors. - RDB_processing.sh: end-to-end relational preprocessing pipeline.
- merge_icl_batches_to_h5.py: merges single-table generation outputs into
.h5. - merge_dbinfer_to_h5.py: converts processed RDB tasks into
.h5. - filter_h5_sampling_columns.py: downsamples columns from unsampled
.h5files.
This workflow takes raw synthetic single-table batches and merges them into pretraining-ready .h5 datasets.
Expected default inputs:
../data_generation/single_table_datasets/single_table_stage1../data_generation/single_table_datasets/single_table_stage2
These are produced by ../data_generation/single_table/single_table_generate.sh.
Default outputs:
../model_pretrain/pretrain_datasets/single_table_stage1.h5../model_pretrain/pretrain_datasets/single_table_stage2.h5
cd data_preprocessing
bash single_table_processing.sh- reads synthetic batch directories
- merges them into
.h5 - writes pretraining-ready files into
model_pretrain/pretrain_datasets/
This workflow converts raw synthetic RDBs into:
- processed task directories produced by the DFS-based preprocessing pipeline.
- intermediate unsampled
.h5files - final sampled
.h5files used for RDB_PFN pretraining
Expected default inputs are the raw RDB directories generated under:
../data_generation/RDB_datasets/
Default outputs include:
- intermediate
.h5files underRDB_datasets/ - sampled pretraining
.h5files under../model_pretrain/pretrain_datasets/
cd data_preprocessing
bash RDB_processing.shThe current pipeline performs two stages:
- It runs DFS-based preprocessing on each raw RDB directory.
- It converts the processed outputs into
.h5, then downsamples columns into final pretraining datasets.
Note the DFS preprocessing can take hundreds of hours to complete. You can modify the script to run on a subset of the datasets for testing.
After preprocessing:
- Use
model_pretrain/pretrain_datasets/as the training corpus for pretraining. - Use benchmark-ready dataset directories under
model_pretrain/rdb_datasets/for evaluation. - Continue with ../model_pretrain/README.md.