This repository contains the official code release for our ICLR 2025 paper "LLM Meets Diffusion - A Hybrid Framework for Crystal Material Generation", by Subhojyoti Khastagir*, Kishalay Das*, Pawan Goyal, Seung-Cheol Lee, Satadeep Bhattacharjee, Niloy Ganguly.
CrysLLMGen introduces a hybrid approach to generating 3D structure of crystal materials. Key contributions of CrysLLMGen are:
- Hybrid LLM + Diffusion Framework: Integrates LLMs for discrete predictions with equivariant diffusion models for continuous structural refinement.
- Two-Stage Generation: LLM proposes atom types, coordinates, and lattice; diffusion model refines them for stability and physical validity.
- Constraint-Aware Design: Supports conditional generation based on user-defined composition, space group, and natural-language prompts.
- Balanced Validity & Novelty: Achieves superior stability, structural correctness, and compositional validity compared to standalone LLMs or diffusion models.
- Architecture-Agnostic: Framework can seamlessly incorporate future LLMs and denoising architectures.
The list of dependencies is provided in the requirements.txt file, generated using pipreqs. You can install through the following commands:
pip install -r requirements.txtHowever, there may be some ad-hoc dependencies that were not captured.
If you encounter any missing packages, feel free to install them manually using pip install.
For Perov-5
python -W ignore llm_finetune.py \
--run-name 7b-perov \
--model 7b \
--num-epochs 1 \
--data-path data/perov_5For MP-20
python -W ignore llm_finetune.py \
--run-name 7b-mp \
--model 7b \
--num-epochs 1 \
--data-path data/mp_20Output: The fine-tuned LLM will be saved in:
exp/7b-perov/(Perov-5)exp/7b-mp/(MP-20)
For Perov-5
python -W ignore diff_train.py \
--dataset perov_5 \
--batch_size 512 \
--epochs 500 \
--timesteps 1000 \
--run-type trainFor MP-20
python -W ignore diff_train.py \
--dataset mp_20 \
--batch_size 512 \
--epochs 500 \
--timesteps 1000 \
--run-type trainOutput: The trained diffusion model will be saved at:
out/<Dataset>/<expt_date>/<expt_time>/
Where <Dataset> is either perov_5 or mp_20.
Use the correct --model_path and --diff_steps depending on the dataset.
python -W ignore crysllmgen_sample.py \
--model_name 7b \
--model_path <LLM_CHECKPOINT_PATH> \
--chkpt_name <DIFFUSION_CHECKPOINT_PATH> \
--num_samples 10000 \
--dataset < mp | perov> \
--temperature 1.0 \
--top_p 0.7 \
--diff_steps <700 | 800> \
--run-type sample \
--out-prefix "Crysllmgen_sample" \
--batch_size 128--model_path exp/7b-mp/checkpoint-27136--dataset mp--diff_steps 800
--model_path exp/7b-perov/checkpoint-11356--dataset perov--diff_steps 700
Generated samples are saved as .pt files:
Crysllmgen_sample_mp_10000.ptCrysllmgen_sample_perov_10000.pt
-
<DIFFUSION_CHECKPOINT_PATH>should point to:out/<Dataset>/<expt_date>/<expt_time>/ -
You can adjust
--temperatureand--top_pto balance diversity and generation quality.
After sampling, evaluate the generated structures using:
For Perov-5
python -W ignore compute_metrics.py \
--root_path <PT_FILE_PATH> \
--tasks gen \
--eval_model_name perovskite \
--gt_file data/perov_5/test.csvFor MP-20
python -W ignore compute_metrics.py \
--root_path <PT_FILE_PATH> \
--tasks gen \
--eval_model_name mp20 \
--gt_file data/mp_20/test.csv<PT_FILE_PATH> should be the directory containing:
Crysllmgen_sample_mp_10000.ptCrysllmgen_sample_perov_10000.pt
For any questions, please contact: Kishalay Das kishalaydas@kgpian.iitkgp.ac.in
If you use CrysLLMGen or our textual dataset, please cite:
@article{khastagir2025llm,
title={LLM Meets Diffusion: A Hybrid Framework for Crystal Material Generation},
author={Khastagir, Subhojyoti and Das, Kishalay and Goyal, Pawan and Lee, Seung-Cheol and Bhattacharjee, Satadeep and Ganguly, Niloy},
journal={arXiv preprint arXiv:2510.23040},
year={2025}
}
