Repellent/Insecticide Modeling Plan With Staged Quantum Descriptors

Summary

Build the project in two stages. V1 is a chemically curated, probability-calibrated repellency model using standard cheminformatics features. V2 adds electronic-structure descriptors after V1 has identified the most informative compounds.
Use the paper’s strategy as the template: feature-tokenized transformer for tabular molecular descriptors, frozen-backbone fine-tuning on small labeled data, calibrated uncertainty with cross-validation plus conformal prediction, and post-hoc interpretability with SHAP/attention.
Do not implement a 3-class softmax model initially. non-repellent, repellent, and insecticidal are not cleanly exclusive, and the LifeChemicals insecticide file is currently an unlabeled screening pool, not supervised insecticide ground truth.
Treat “effectiveness” as a calibrated probability of activity in V1. Only treat effectiveness as potency after real numeric labels such as % repellency, % mortality, LC50, or LD50 are added.

Implementation Changes

Create a reproducible data-curation pipeline that reads the three SDFs, strips salts/counterions, canonicalizes structures, keeps one canonical record per InChIKey, preserves source metadata, and flags malformed/placeholder records in the decoy file for exclusion.
Define the first supervised endpoint as repellent_active with 1 from the repellent SDF and 0 from curated non-repellent decoys. Use non-repellent as the complement of this probability, not as a separate head.
Restrict the first production model to the dominant assay scope: Aedes aegypti repellency. Keep the other species as metadata for later multi-domain expansion, not mixed into the first supervised target.
Build V1 with two model families trained on the same scaffold split: a strong baseline using Morgan fingerprints (radius=2, 2048 bits) plus RDKit 2D descriptors, and an FTTransformer over curated tabular features. Select the winner by scaffold-split PR-AUC, ROC-AUC, MCC, Brier score, and calibration.
Pretrain the FTTransformer backbone on all curated molecules from the three SDFs using self-supervised masked-feature reconstruction. Freeze the backbone and fine-tune only the head on the labeled repellency task, mirroring the paper’s transfer-learning approach.
Export model outputs in a single prediction table with: compound_id, canonical_smiles, scaffold_id, p_repellent, p_non_repellent=1-p_repellent, prediction_interval, ood_score, and split_id.
Add conformal uncertainty exactly as a first-class output. Use scaffold-aware 5-fold cross-validation for out-of-fold predictions, then fit calibration plus conformal intervals on those predictions so each molecule gets both a probability and an uncertainty band.
Create a quantum-descriptor stage for V2 that runs after V1 and only on a selected subset. The default subset is 300 labeled molecules chosen by class balance and scaffold diversity (150 repellent, 150 non-repellent) plus 200 unlabeled LifeChemicals molecules chosen from 100 highest p_repellent and 100 highest uncertainty.
Use GFN2-xTB as the broad quantum layer. For each selected molecule, generate conformers with ETKDG, MMFF-optimize, keep the lowest-energy conformer, optimize with xTB, and extract at minimum: HOMO, LUMO, gap, dipole, total energy, partial charge mean/std/max, polarizability if available, and simple frontier-orbital-derived reactivity indices.
Use DFT only as a validation/high-fidelity layer on a smaller subset of 100 molecules drawn from the xTB set with class balance and chemical diversity. Run single-point DFT on the xTB-optimized geometry to benchmark whether xTB descriptors are directionally reliable before expanding DFT usage.
Train V2 as an ablation study, not a blind replacement. Compare three feature sets on the same labeled quantum subset: cheminformatics-only, quantum-only, and fused features. Promote the fused model only if it improves scaffold-split discrimination and calibration.
Keep the LifeChemicals library unlabeled in training. Use it for self-supervised pretraining and later screening/ranking only. Do not train an insecticidal head until external insecticide assay labels are added from ChEMBL, PubChem, or literature.
Reserve the later V3 interface now: once insecticide labels exist, extend the model to a multitask setup with independent outputs p_repellent and p_insecticidal, each with its own calibration and uncertainty, instead of collapsing them into one multiclass label.

Public Interfaces and Artifacts

Standardize on four project artifacts: curated_molecules.parquet, split_manifest.json, model_predictions.parquet, and quantum_descriptors.parquet.
curated_molecules.parquet should contain identifiers, canonical structure fields, source dataset, label fields, assay/species metadata, scaffold, and a qc_status column for excluded/problematic molecules.
quantum_descriptors.parquet should key by compound_id and store conformer provenance, method (xtb or dft), charge/multiplicity, convergence status, and the extracted electronic descriptors.
model_predictions.parquet should expose only calibrated probabilities and uncertainty outputs that downstream ranking or simulation selection will consume.

Test Plan

Verify data curation is deterministic: repeated runs produce identical canonical SMILES, label counts, and exclusion counts.
Verify no leakage: no Bemis-Murcko scaffold appears in both train and validation folds.
Benchmark V1 against random and majority baselines; require materially better PR-AUC, MCC, and Brier score.
Check probability calibration with reliability plots and expected calibration error; reject any model with good AUC but poor calibration.
For V2, evaluate gain only on the same labeled subset with quantum descriptors; require improvement in at least one ranking metric and one calibration metric before adopting fused features.
For the DFT benchmark subset, require that xTB-derived HOMO/LUMO gap rankings correlate strongly enough with DFT to justify continued xTB use as the bulk descriptor engine.
Run interpretability on the final selected model and confirm important features are chemically plausible rather than assay/source artifacts.

Assumptions and Defaults

The LifeChemicals insecticide SDF is treated as an unlabeled candidate library until real insecticide activity labels are added.
The first supervised endpoint is Aedes aegypti repellency because it is the dominant labeled target and gives the cleanest initial task.
“Effectiveness” in the first release means calibrated probability of repellency, not potency. Potency modeling is deferred until numeric assay labels exist.
The decoy SDF contains some malformed or placeholder structures; those will be excluded during curation rather than force-fit into training.
The default quantum workflow is xTB first, DFT second, and quantum descriptors are used to improve a screened subset before any attempt to scale them to the full library.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
Datasets		Datasets
experiment		experiment
predict		predict
supp_material		supp_material
.gitignore		.gitignore
README.md		README.md
experiment_summary.md		experiment_summary.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Repellent/Insecticide Modeling Plan With Staged Quantum Descriptors

Summary

Implementation Changes

Public Interfaces and Artifacts

Test Plan

Assumptions and Defaults

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Repellent/Insecticide Modeling Plan With Staged Quantum Descriptors

Summary

Implementation Changes

Public Interfaces and Artifacts

Test Plan

Assumptions and Defaults

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages