oosmetricsTop 10 in Training by acceleration - 2026-04-27
This repository is an engineering-refined and fully reproducible implementation based on
Kaggle NeurIPS 2025 – Open Polymer Prediction (Silver Medal) solution.
The pipeline combines:
- SMILES canonicalization and data cleaning (RDKit)
- Fingerprint + descriptor feature engineering
- Group-aware cross-validation (to avoid data leakage)
- Traditional ML models (CatBoost, XGBoost)
- Optional GNN model (PyTorch Geometric)
- Out-of-Fold (OOF) stacking and ensemble learning
- Optuna-based weight optimization (optional)
The goal is to provide:
✅ higher stability
✅ better generalization
✅ strict reproducibility
✅ clean project structure
Kaggle-NeurIPS-Open-Polymer/
├── README.md
├── requirements.txt
├── config.py
├── dataset.csv
├── src/
│ ├── train.py
│ ├── features.py
│ ├── models.py
│ ├── utils/
│ │ ├── smiles_utils.py
│ │ └── cv_utils.py
│ └── stacking/
│ └── optuna_stack.py
├── outputs/
│ ├── models/
│ ├── feats/
│ └── oof/
└── tests/
└── test_smiles.py
We strongly recommend using Conda to install RDKit.
conda create -n polymer_env python=3.10 -y
conda activate polymer_env
conda install -c conda-forge rdkit -y
pip install -r requirements.txt
• Uses RDKit to convert SMILES into canonical isomeric form
• Invalid SMILES are logged to outputs/oof/failed_smiles.csv
• Morgan fingerprints (2048 bits)
• Truncated SVD for dimensionality reduction
• RDKit descriptors (MolWt, TPSA, LogP)
• GroupKFold based on canonical SMILES
• Prevents molecule-level leakage
• CatBoostRegressor
• XGBoostRegressor
• Optional GNN (PyTorch Geometric)
• Ridge regression as meta-learner
• Optional Optuna-optimized weighted ensemble
• Fixed random seed
• Cached features
• Saved models and OOF predictions
Run a small synthetic dataset to verify installation:
python src/train.py --demo
This will:
• Generate a fake dataset
• Train models
• Output results to outputs/
You can easily add:
• New molecular descriptors
• New GNN architectures (GAT, GraphSAGE)
• New meta learners (LightGBM, ElasticNet)
• Bayesian uncertainty estimation
• Multi-task prediction heads