This repository accompanies the manuscript "Predicting saturated hydraulic conductivity from particle size distributions using machine learning" (de Rijk et al. 2024; https://doi.org/10.1007/s00477-024-02861-6).
The repository provides routines to perform particle size distribution (PSD) analysis, particularly workflows to estimate hydraulic conductivity with six Machine Learning (ML) algorithms:
- Decision Tree (DT)
- Random Forest (RF)
- XGBoost (XG)
- Linear Regression (LR)
- Support Vector Regression (SVR)
- Artificial Neural Network (ANN)
The package also includes methods for identification of properties, like grain diameter percentiles (d10, d50, d60 etc) and for calculation of hydraulic conductivity through empirical formulas.
The algorithms are tested on soil sample data from the "TopIntegraal" project provided by TNO. (data not yet avaialbe due to license issues, planned to be provided soon)
-
README.md- description of the project -
LICENSE- the default license is MIT -
requirements.txt- requirements for pip to install all needed packages (see below) -
data/- contains the data used in this project, extracted from the TopIntegral data set, including PSD data and hydraulic conductivity of 4593 samples and porosity measurements where available:data_PSD_Kf_por.csv- measured quantities for the 4593 samples: each row contains one sample; columns are: sieve size fractions (=PSD data) in micrometer (column header starting with F), measured hydraulic conductivity from permeameter in m/d (column Kf), log-transformed hydraulic conductivity (column logK), porosity measurements (column porosity, for those sample where available) and the specification of the lithoclass from TopIntegraal (column litho_measured)data_PSD_Kf_por_props.csv- same asdata_PSD_Kf_por.csvplus two columns on soil classes and main lithology (re-)determined from the PSDdata_PSD_Kf_por_props_Kemp.csv- same asdata_PSD_Kf_por_props.csvplus five columns on estimates of hydraulic conductivity with empirical methods (column headers specify method type)
-
results/- results of processed data (algorithm performance) and plots used in publication:Kemp_all.csv- estimated Kf values of all samples for 15 empirical methods, inlcuding specification of applicability
Data_analysis/- results of analysis of PSD for samples:data_PSD_props.csv- results of PSD analysis for all samples (grain diameters, percentage sand/silt/lutum, lithoclass)data_full_stats.csv- statistical results (mean, std, percentiles,...) of properties (Kf, percentage sand/silt/lutum, ...) for all samplesdata_sand_stats.csv- statistical results (mean, std, percentiles,...) of properties (Kf, percentage sand/silt/lutum, ...) for subset of sand samplesdata_silt_stats.csv- statistical results (mean, std, percentiles,...) of properties (Kf, percentage sand/silt/lutum, ...) for subset of silt samplesdata_clay_stats.csv- statistical results (mean, std, percentiles,...) of properties (Kf, percentage sand/silt/lutum, ...) for subset of clay samplesdata_por_stats.csv- statistical results (mean, std, percentiles,...) of properties (Kf, percentage sand/silt/lutum, porosity, ...) for subset of samples with porosity
ML_Performance/- performance measure R^2 or MSE for all 6 ML algorithms on training, testing and all sample for:Performance_PSD_Kf_topall_r2.csv- feature varialble PSD to target variable Kf for data set "Top-All" (R^2)Performance_PSD_Kf_topall_mse.csv- feature varialble PSD to target variable Kf for data set "Top-All" (MSE)Performance_PSD_Kf_sand_r2.csv- feature varialble PSD to target variable Kf for data set "Top-Sand" (R^2)Performance_PSD_Kf_sand_mse.csv- feature varialble PSD to target variable Kf for data set "Top-Sand" (MSE)Performance_PSD_Kf_silt_r2.csv- feature varialble PSD to target variable Kf for data set "Top-Silt" (R^2)Performance_PSD_Kf_silt_mse.csv- feature varialble PSD to target variable Kf for data set "Top-Silt" (MSE)Performance_PSD_Kf_clay_r2.csv- feature varialble PSD to target variable Kf for data set "Top-Clay" (R^2)Performance_PSD_Kf_clay_mse.csv- feature varialble PSD to target variable Kf for data set "Top-Clay" (MSE)Performance_PSD_Kf_por_r2.csv- feature varialble PSD to target variable porosity for data set "Top-Por" (R^2)Performance_PSD_Kf_por_mse.csv- feature varialble PSD to target variable porosity for data set "Top-Por" (MSE)Performance_dX_Kf_topall_r2.csv- feature varialble grain diameters (d_X) to target variable Kf for data set "Top-All" (R^2)Performance_dX_Kf_topall_mse.csv- feature varialble grain diameters (d_X) to target variable Kf for data set "Top-All" (MSE)Performance_dX_Kf_por_r2.csv- feature varialble grain diameters (d_X) to target variable Kf for data set "Top-Por" (R^2)Performance_dX_Kf_por_mse.csv- feature varialble grain diameters (d_X) to target variable Kf for data set "Top-Por" (MSE)Performance_dX_por_Kf_por_r2.csv- feature varialble grain diameters (d_X) and porosity to target variable Kf for data set "Top-Por" (R^2)Performance_dX_por_Kf_por_mse.csv- feature varialble grain diameters (d_X) and porosity to target variable Kf for data set "Top-Por" (MSE)Performance_PSD_por_por_r2.csv- feature varialble PSD to target variable porosity for data set "Top-Por" (R^2)Performance_PSD_por_por_mse.csv- feature varialble PSD to target variable porosity for data set "Top-Por" (MSE)
Figures_paper/- Figures of results as displayed in the main manuscript of accompanying publication:Fig01_Bar_NSE_PSD_Kf_topall.pdfFig02_Bar_NSE_PSD_Kf_soiltypes.pdfFig03_Scatter_Measured_topall.pdfFig04_Scatter_RF_Barr.pdfFig05_Feature_importance_RF_topall.pdfFig06_Scatter_Measured_dX.pdfFig07_Bar_NSE_features.pdf
Figures_SI/- Figures of results as displayed in the supporting information of accompanying publication:SI_Fig_Bar_NSE_dX_Kf_por.pdfSI_Fig_Bar_NSE_dX_Kf_topall.pdfSI_Fig_Bar_NSE_dX_por_Kf_por.pdfSI_Fig_Bar_NSE_PSD_Kf_clay.pdfSI_Fig_Bar_NSE_PSD_Kf_por.pdfSI_Fig_Bar_NSE_PSD_Kf_sand.pdfSI_Fig_Bar_NSE_PSD_Kf_silt.pdfSI_Fig_Bar_NSE_PSD_por_por.pdfSI_Fig_FeatureImportance_RF_soils.pdfSI_Fig_FeatureImportance_topall.pdfSI_Fig_Histogram_Kf.pdfSI_Fig_Scatter_Kemp.pdfSI_Fig_Scatter_Measured_clay.pdfSI_Fig_Scatter_Measured_dX_por_Kf.pdfSI_Fig_Scatter_Measured_PSD_por.pdfSI_Fig_Scatter_Measured_sand.pdfSI_Fig_Scatter_Measured_silt.pdf
-
src/- contains all scripts used for data analyses and plotting of results-
PSD_Analysis.py- script containing class "PSD_Analysis" for analysis of PSD (e.g. calculation of dX values, lithoclass) -
PSD_K_empirical.py- script containing class "PSD_to_K_Empirical" to calculate Kf from 15 different empirical formulas based on PSD information -
PSD_2K_ML.py- script containing class "PSD_2K_ML" to perform machine learning on data set -
data_dictionaries.py- script containing dictionaries with hyperparameters for the 6 ML algorithms, all feature/target combination, all data(sub)sets -
00_data_processing.py- preprocessing of raw data to transform into dataframe stored in csv file with standard format -
01_sample_data_statistics.py- Script performing data analysis of PSD and derived quantities (e.g. d10, d50, d60 etc) for all sub-datasets results are saved to "./results/Data_analysis/" -
02_K_empiricial.py- script calculating Kf from PSD information using empirical formulas implemented in class "PSD_K_empirical" for the Top-Integral data set -
03_ML_Hyperparam.py- Script performing hyperparameter testing for list of algorithms and selected data set (based on soil type) -
03_ML_Hyperparam_GridSearch.py- Script performing hyperparameter testing using GridSearch for a selected algorithm and data set type -
03_ML_Hyperparam_skopt.py- - Script performing hyperparameter testing using SKopt for a selected algorithm and data set type: -
04_ML_TrainingPerformance.py- Script evaluating performance of all six ML algorithms after training -
04_ML_TrainingPerformance_all.py- Script evaluating performance of a selected ML algorithms -
F01_Bar_NSE_AllAlgorithms_TopAll.py- reproducing Figure 1 of the manuscript -
F01_Bar_NSE_AllAlgorithms_single.py- reproducing each subplot of Figure 1 of the manuscript -
F02_Bar_NSE_AllAlgorithms_soils.py- reproducing Figure 2 of the manuscript -
F03_Scatter_vs_Measured.py- reproducing Figure 3 of the manuscript -
F03_Scatter_vs_Measured_single.py- reproducing subplots of Figure 3 of the manuscript -
F04_Scatter_vs_Empiricial.py- reproducing Figure 4 of the manuscript -
F05_FeatureImportance.py- reproducing Figure 5 of the manuscript -
F06_Scatter_vs_Measured_dX.py- reproducing Figure 6 of the manuscript -
F07_Bar_NSE_AllAlgorithms_features.py- reproducing Figure 7 of the manuscript -
SI_Bar_NSE_AllAlgorithms.py- reproducing figures with barplots of the SI -
SI_Fig_FeatureImportance_RF_soils.py- reproducing figures on feature importance of the SI -
SI_Fig_FeatureImportance_topall.py- reproducing figures on feature importance of the SI -
SI_Histogram_Kf_ML.py- producing figure with histograms of estimated Kf of the SI -
SI_Histogram_Measured_soils.py- reproducing figure of histograms of measured Kf of the SI -
SI_plot_PSD.py- reproducing figure with PSD curves of the SI -
SI_Scatter_Kemp.py- reproducing figure of scatter plots on empirical formulas of the SI -
SI_Scatter_vs_Measured.py- reproducing figures on scatterplots of Kf of the SI -
SI_Scatter_vs_Measured_por.py- reproducing figures on scatterplots of porosity of the SI
-
To locally run the scripts, clone the repository and (optionally) create a virtual environment. You can do that by running these commands in a terminl:
cd path/to/project_folder/PSD_Analysis_MachineLearning
python3 -m venv venv
source venv/bin/activate
python3 -m pip install -r requirements.txt(to activate the environment on Windows, use the command venv\Scripts\activate).
You can contact us via a.zech@uu.nl.
MIT © 2025