This repository provides a comprehensive workflow for fine-tuning Sentence Transformer models using PyTorch, Hugging Face datasets, and transformers. It is designed for multi-GPU training using torchrun and supports features like dataset pre-tokenization, Weights & Biases logging, and custom learning rate schedulers.
These workflows produce INDUS-SDE-ST, the semantic-discovery model from our paper:
INDUS-SDE: A Language Model for Scientific Content Curation and Discovery Pantha et al. — KDD 2026, AI for Sciences Track · DOI: 10.1145/3770855.3818847
INDUS-SDE is a domain-adapted encoder pretrained with Weighted Dynamic Masking on NASA's Science Discovery Engine (SDE) corpus (pretraining code: NASA-IMPACT/mlm-fine-tuning). INDUS-SDE-ST — built here — fine-tunes that encoder into a sentence transformer for semantic retrieval over heterogeneous, web-sourced scientific content.
| Artifact | Hugging Face |
|---|---|
| Sentence transformer | nasa-impact/indus-sde-st-v0.2 |
| Binary (EQAT) embeddings | nasa-impact/indus-sde-st-equat-v0.1 |
| Base encoder (INDUS-SDE) | nasa-impact/indus-sde-v0.2 |
| NASA SDE IR benchmark | nasa-impact/nasa-sde-IR-benchmark-20251024-v5 |
| Pretraining (MLM / WDM) code | NASA-IMPACT/mlm-fine-tuning |
- Stage-2 synthetic pair generation (Instructor schema + system prompt) →
gen_data_stage2/· module README - SDE content-relevancy filtering (Pydantic AI) →
gen_data_stage2/filter2_llm_based/ - Stage-2 data prep / CMR–PDS pairs →
data_prep/ - Benchmark & per-checkpoint eval dumps →
eval/results_json/
On the in-domain NASA SDE IR benchmark (nasa-impact/nasa-sde-IR-benchmark-20251024-v5), INDUS-SDE-ST is the strongest retriever among released models (the comparison reported in the paper):
| Model | MRR@1 | MRR@5 | NDCG@1 | NDCG@5 |
|---|---|---|---|---|
| INDUS-ST | 0.1616 | 0.2131 | 0.1616 | 0.2351 |
| ModernBERT-ST | 0.1765 | 0.2137 | 0.1765 | 0.2288 |
| INDUS-SDE-ST | 0.2343 | 0.3122 | 0.2343 | 0.3445 |
| OpenAI text-embedding-3-small | 0.2034 | 0.2619 | 0.2034 | 0.2870 |
Note: INDUS-SDE-ST is the best in-domain retriever among released models (above). The Stage-2 ablation below compares internal training checkpoints; one unreleased run (
dutiful-thunder-110) scores slightly higher on NASA SDE IR by pushing in-domain harder, but it trades away NanoBEIR and NASA SMD IR — so the balanced checkpoint (INDUS-SDE-ST) was the one released.
Full run-by-run Stage-2 evaluation of internal checkpoints across all three benchmarks.
| SN | Checkpoint | Training approach | NanoBEIR | NASA SMD IR | NASA SDE IR | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| MRR | NDCG | MRR | NDCG | MRR | NDCG | |||||||||
| @1 | @5 | @1 | @5 | @1 | @5 | @1 | @5 | @1 | @5 | @1 | @5 | |||
| 1 | INDUS-ST | Baseline | 0.50 | 0.61 | 0.50 | 0.54 | 0.53 | 0.58 | 0.53 | 0.61 | 0.16 | 0.21 | 0.16 | 0.23 |
| 2 | indus-sde-st-v0.1 | Stage 1 ST | 0.52 | 0.61 | 0.52 | 0.55 | 0.47 | 0.53 | 0.47 | 0.55 | 0.19 | 0.24 | 0.19 | 0.26 |
| Stage 2 Training Experiments (FP32) | ||||||||||||||
| 3 | whole-moon-14 | Trained on full Stage 2 data | 0.40 | 0.51 | 0.40 | 0.45 | 0.38 | 0.45 | 0.38 | 0.47 | 0.17 | 0.23 | 0.17 | 0.26 |
| 4 | atomic-plasma-15 | Stage 2 subset (NASA-SDE + ADS) | 0.54 | 0.63 | 0.54 | 0.56 | 0.48 | 0.54 | 0.48 | 0.57 | 0.19 | 0.25 | 0.19 | 0.28 |
| 5 | peach-night-57 | Stage 2 (No Cite) + weighted sources | 0.46 | 0.57 | 0.46 | 0.51 | 0.45 | 0.51 | 0.45 | 0.54 | 0.23 | 0.30 | 0.23 | 0.34 |
| 6 | INDUS-SDE-ST | Best Stage 2: peach-night-57 + Stage 1 (anti-forgetting) | 0.47 | 0.58 | 0.47 | 0.51 | 0.48 | 0.54 | 0.48 | 0.56 | 0.23 | 0.31 | 0.23 | 0.34 |
| 7 | dutiful-thunder-110 | Stage 2 (No Cite) + S1; NASA-SDE 0.75x, Tanh | 0.43 | 0.55 | 0.43 | 0.49 | 0.43 | 0.50 | 0.43 | 0.52 | 0.26 | 0.34 | 0.26 | 0.37 |
| Binary (1-bit) & EQAT Variants | ||||||||||||||
| 8 | INDUS-SDE-ST-PB | 1-bit post-training binarization of INDUS-SDE-ST | 0.43 | 0.55 | 0.43 | 0.49 | 0.42 | 0.49 | 0.42 | 0.51 | 0.21 | 0.28 | 0.21 | 0.31 |
| 9 | azure-eon-73 | EQAT start: 1-bit, Stage 2 (No Cite), weighted | 0.38 | 0.47 | 0.38 | 0.41 | 0.25 | 0.31 | 0.25 | 0.33 | 0.21 | 0.27 | 0.21 | 0.30 |
| 10 | absurd-snowflake-85 | INDUS-SDE-ST config + 1-bit EQAT | 0.41 | 0.51 | 0.41 | 0.44 | 0.28 | 0.34 | 0.28 | 0.36 | 0.21 | 0.28 | 0.21 | 0.30 |
| 11 | INDUS-SDE-ST-EQAT | Best EQAT: Stage 2 (No Cite) + S1; NASA-SDE 0.75x | 0.41 | 0.51 | 0.41 | 0.44 | 0.32 | 0.38 | 0.32 | 0.40 | 0.22 | 0.29 | 0.22 | 0.31 |
| 12 | eternal-energy-94 | ST-EQAT config, NASA-SDE removed | 0.40 | 0.51 | 0.40 | 0.44 | 0.30 | 0.37 | 0.30 | 0.39 | 0.10 | 0.13 | 0.10 | 0.14 |
| 13 | wandering-snowflake-98 | ST-EQAT config, weight_decay 0.1 | 0.41 | 0.51 | 0.41 | 0.45 | 0.29 | 0.36 | 0.29 | 0.38 | 0.21 | 0.28 | 0.21 | 0.31 |
| 14 | dutiful-thunder-PB | 1-bit post-binarization of thunder-110 (Tanh) | 0.41 | 0.51 | 0.41 | 0.45 | 0.37 | 0.42 | 0.37 | 0.44 | 0.23 | 0.31 | 0.23 | 0.34 |
INDUS-SDE-ST is the best Stage-2 FP32 model; INDUS-SDE-ST-EQAT is the deployed 1-bit variant (10–16× storage reduction).
Before you begin, ensure you have the following installed and configured:
- Git: To clone the repository.
- Python 3.8+: The programming language used.
- uv: A fast Python package installer and resolver, used for setting up the project environment. You can install it with
pip install uv. - Kaggle Account & API Token: Required for downloading the
arxivdataset. Make sure you have yourkaggle.jsonfile set up. - Hugging Face Account & Token: Required for accessing models and datasets from the Hugging Face Hub, especially private ones.
- Weights & Biases Account & API Key: For logging training metrics and model checkpoints.
Follow these steps to set up your project environment.
First, clone the project from GitHub:
git clone https://github.com/NASA-IMPACT/st-training-workflow
cd st-training-workflowThis project uses uv to manage dependencies. To create the virtual environment and install the required packages from requirements.txt, run the following command from the root of the repository:
uv venv
uv pip install -r requirements.txtActivate the environment with:
source .venv/bin/activateMost datasets are downloaded automatically from the Hugging Face Hub during the training process. However, one of the datasets used in Stage 2 training (arxiv) must be downloaded manually from Kaggle.
From the root of the repository, create the necessary directories for the raw data. The training script expects the data to be in a data_prep directory located outside the model_exploration folder.
mkdir -p data_prep/raw/Use the Kaggle API to download the dataset and unzip it into the data_prep/raw/ directory.
# Note: Ensure your Kaggle API token is configured correctly
curl -L -o data_prep/raw/arxiv.zip https://www.kaggle.com/api/v1/datasets/download/Cornell-University/arxiv
# Unzip the contents into the raw data directory
unzip data_prep/raw/arxiv.zip -d data_prep/raw/This will place the arxiv-metadata-oai-snapshot.json file where the script model_exploration/utils.py expects to find it.
To manage secrets and important configuration, create a .env file inside the model_exploration directory. This is where you will store your API keys and other environment variables.
cd model_explorationCreate a file named .env and add the following variables.
# .env file in the 'model_exploration' directory
# W&B: Set to "end" to upload the final model checkpoint as a W&B artifact.
export WANDB_LOG_MODEL="end"
# W&B: Your Weights & Biases API key for logging.
export WANDB_API_KEY="YOUR_WANDB_API_KEY"
# Hugging Face: Your token for accessing models/datasets from the HF Hub.
export HUGGINGFACE_TOKEN="YOUR_HUGGINGFACE_TOKEN"Replace "YOUR_WANDB_API_KEY" and "YOUR_HUGGINGFACE_TOKEN" with your actual credentials.
The training is initiated using torchrun for distributed data parallel (DDP) training across multiple GPUs.
Ensure you are in the model_exploration directory before running the command.
# Example training command
torchrun --nproc_per_node=auto st_trainer_ddp_hf.py \
--model_name "nasa-impact/indus-sde-st-v0.1" \
--num_train_epochs 1 \
--batch_size 32 \
--gradient_accumulation_steps 4 \
--lr 2e-5 \
--eval_and_save_steps 1000 \
--output_base "../training_output"You can customize the training run using various command-line arguments. Here are some of the key options available in st_trainer_ddp_hf.py:
| Argument | Type | Default | Description |
|---|---|---|---|
--nrows |
int | None |
Number of rows to use from each dataset (for quick testing). |
--n_data_src |
int | None |
Limit the number of data sources for training. |
--model_name |
str | nasa-impact/indus-sde-st-v0.1 |
The base Sentence Transformer model to fine-tune from the Hugging Face Hub. |
--output_base |
str | tmp_models |
The base directory where training outputs and checkpoints will be saved. |
--num_train_epochs |
int | 1 |
The total number of training epochs to perform. |
--batch_size |
int | 64 |
The batch size per device (GPU) for training. |
--lr |
float | 2e-5 |
The initial learning rate for the AdamW optimizer. |
--warmup_ratio |
float | 0.1 |
The proportion of training steps for the learning rate warm-up. |
--eval_and_save_steps |
int | 1000 |
The frequency (in steps) to run evaluation and save a model checkpoint. |
--gradient_accumulation_steps |
int | 8 |
Number of steps to accumulate gradients before performing an optimizer step. |
--pretokenize |
flag | False |
Enable dataset pre-tokenization to speed up training by caching tokenized data. |
--resume_checkpoint_path |
str | None |
Path to a checkpoint to resume training from. |
--resume_run_id |
str | None |
The Weights & Biases run ID to resume logging to. |
--custom_lr_scheduler |
flag | True |
Use the custom learning rate scheduler with cosine decay. |
st_trainer_ddp_hf.py: The main entry point for the training script. It handles argument parsing, DDP setup, data loading, trainer initialization, and evaluation.utils.py: Contains helper functions for building dataset configurations (build_dataset_configs_s1,build_dataset_configs_s2), loading and caching datasets (load_and_cache_datasets), and preparing evaluation suites (prepare_evaluators).distributed.py: Provides utilities for setting up and managing the distributed training environment.pretokenize.py: Contains the logic for pre-tokenizing the datasets and caching them to disk to accelerate subsequent training runs.requirements.txt: A list of Python packages required to run the code.
If you use INDUS-SDE, INDUS-SDE-ST, or these workflows in your research, please cite:
@inproceedings{pantha2026indussde,
author = {Pantha, Nishan and Awale, Sajil and Kuruvanthodi, Vishnudev and KC, Simran and Ramasubramanian, Muthukumaran and Davis, Carson and Praveen, Bishwas and Foshee, Emily and Bhattacharjee, Bishwaranjan and Bugbee, Kaylin and Ramachandran, Rahul},
title = {{INDUS-SDE}: A Language Model for Scientific Content Curation and Discovery},
year = {2026},
isbn = {979-8-4007-2259-2},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
doi = {10.1145/3770855.3818847},
booktitle = {Proceedings of the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.2 (KDD 2026)},
location = {Jeju Island, Republic of Korea},
series = {KDD '26}
}