Sentence Transformer Training Workflow

This repository provides a comprehensive workflow for fine-tuning Sentence Transformer models using PyTorch, Hugging Face datasets, and transformers. It is designed for multi-GPU training using torchrun and supports features like dataset pre-tokenization, Weights & Biases logging, and custom learning rate schedulers.

INDUS-SDE-ST — Sentence Transformer for Scientific Content Discovery

These workflows produce INDUS-SDE-ST, the semantic-discovery model from our paper:

INDUS-SDE: A Language Model for Scientific Content Curation and Discovery Pantha et al. — KDD 2026, AI for Sciences Track · DOI: 10.1145/3770855.3818847

INDUS-SDE is a domain-adapted encoder pretrained with Weighted Dynamic Masking on NASA's Science Discovery Engine (SDE) corpus (pretraining code: NASA-IMPACT/mlm-fine-tuning). INDUS-SDE-ST — built here — fine-tunes that encoder into a sentence transformer for semantic retrieval over heterogeneous, web-sourced scientific content.

Models & data

Artifact	Hugging Face
Sentence transformer	`nasa-impact/indus-sde-st-v0.2`
Binary (EQAT) embeddings	`nasa-impact/indus-sde-st-equat-v0.1`
Base encoder (INDUS-SDE)	`nasa-impact/indus-sde-v0.2`
NASA SDE IR benchmark	`nasa-impact/nasa-sde-IR-benchmark-20251024-v5`
Pretraining (MLM / WDM) code	`NASA-IMPACT/mlm-fine-tuning`

Where the paper's data pipeline lives

Stage-2 synthetic pair generation (Instructor schema + system prompt) → gen_data_stage2/ · module README
SDE content-relevancy filtering (Pydantic AI) → gen_data_stage2/filter2_llm_based/
Stage-2 data prep / CMR–PDS pairs → data_prep/
Benchmark & per-checkpoint eval dumps → eval/results_json/

NASA SDE IR — headline result (vs. released models)

On the in-domain NASA SDE IR benchmark (nasa-impact/nasa-sde-IR-benchmark-20251024-v5), INDUS-SDE-ST is the strongest retriever among released models (the comparison reported in the paper):

Model	MRR@1	MRR@5	NDCG@1	NDCG@5
INDUS-ST	0.1616	0.2131	0.1616	0.2351
ModernBERT-ST	0.1765	0.2137	0.1765	0.2288
INDUS-SDE-ST	0.2343	0.3122	0.2343	0.3445
OpenAI text-embedding-3-small	0.2034	0.2619	0.2034	0.2870

Note: INDUS-SDE-ST is the best in-domain retriever among released models (above). The Stage-2 ablation below compares internal training checkpoints; one unreleased run (dutiful-thunder-110) scores slightly higher on NASA SDE IR by pushing in-domain harder, but it trades away NanoBEIR and NASA SMD IR — so the balanced checkpoint (INDUS-SDE-ST) was the one released.

Stage-2 checkpoint ablation (extended results)

Full run-by-run Stage-2 evaluation of internal checkpoints across all three benchmarks.

SN	Checkpoint	Training approach	NanoBEIR				NASA SMD IR				NASA SDE IR
			MRR		NDCG		MRR		NDCG		MRR		NDCG
			@1	@5	@1	@5	@1	@5	@1	@5	@1	@5	@1	@5
1	INDUS-ST	Baseline	0.50	0.61	0.50	0.54	0.53	0.58	0.53	0.61	0.16	0.21	0.16	0.23
2	indus-sde-st-v0.1	Stage 1 ST	0.52	0.61	0.52	0.55	0.47	0.53	0.47	0.55	0.19	0.24	0.19	0.26
Stage 2 Training Experiments (FP32)
3	whole-moon-14	Trained on full Stage 2 data	0.40	0.51	0.40	0.45	0.38	0.45	0.38	0.47	0.17	0.23	0.17	0.26
4	atomic-plasma-15	Stage 2 subset (NASA-SDE + ADS)	0.54	0.63	0.54	0.56	0.48	0.54	0.48	0.57	0.19	0.25	0.19	0.28
5	peach-night-57	Stage 2 (No Cite) + weighted sources	0.46	0.57	0.46	0.51	0.45	0.51	0.45	0.54	0.23	0.30	0.23	0.34
6	INDUS-SDE-ST	Best Stage 2: peach-night-57 + Stage 1 (anti-forgetting)	0.47	0.58	0.47	0.51	0.48	0.54	0.48	0.56	0.23	0.31	0.23	0.34
7	dutiful-thunder-110	Stage 2 (No Cite) + S1; NASA-SDE 0.75x, Tanh	0.43	0.55	0.43	0.49	0.43	0.50	0.43	0.52	0.26	0.34	0.26	0.37
Binary (1-bit) & EQAT Variants
8	INDUS-SDE-ST-PB	1-bit post-training binarization of INDUS-SDE-ST	0.43	0.55	0.43	0.49	0.42	0.49	0.42	0.51	0.21	0.28	0.21	0.31
9	azure-eon-73	EQAT start: 1-bit, Stage 2 (No Cite), weighted	0.38	0.47	0.38	0.41	0.25	0.31	0.25	0.33	0.21	0.27	0.21	0.30
10	absurd-snowflake-85	INDUS-SDE-ST config + 1-bit EQAT	0.41	0.51	0.41	0.44	0.28	0.34	0.28	0.36	0.21	0.28	0.21	0.30
11	INDUS-SDE-ST-EQAT	Best EQAT: Stage 2 (No Cite) + S1; NASA-SDE 0.75x	0.41	0.51	0.41	0.44	0.32	0.38	0.32	0.40	0.22	0.29	0.22	0.31
12	eternal-energy-94	ST-EQAT config, NASA-SDE removed	0.40	0.51	0.40	0.44	0.30	0.37	0.30	0.39	0.10	0.13	0.10	0.14
13	wandering-snowflake-98	ST-EQAT config, weight_decay 0.1	0.41	0.51	0.41	0.45	0.29	0.36	0.29	0.38	0.21	0.28	0.21	0.31
14	dutiful-thunder-PB	1-bit post-binarization of thunder-110 (Tanh)	0.41	0.51	0.41	0.45	0.37	0.42	0.37	0.44	0.23	0.31	0.23	0.34

INDUS-SDE-ST is the best Stage-2 FP32 model; INDUS-SDE-ST-EQAT is the deployed 1-bit variant (10–16× storage reduction).

1. Prerequisites

Before you begin, ensure you have the following installed and configured:

Git: To clone the repository.
Python 3.8+: The programming language used.
uv: A fast Python package installer and resolver, used for setting up the project environment. You can install it with pip install uv.
Kaggle Account & API Token: Required for downloading the arxiv dataset. Make sure you have your kaggle.json file set up.
Hugging Face Account & Token: Required for accessing models and datasets from the Hugging Face Hub, especially private ones.
Weights & Biases Account & API Key: For logging training metrics and model checkpoints.

2. Setup and Installation

Follow these steps to set up your project environment.

Step 2.1: Clone the Repository

First, clone the project from GitHub:

git clone https://github.com/NASA-IMPACT/st-training-workflow
cd st-training-workflow

Step 2.2: Set Up the Python Environment

This project uses uv to manage dependencies. To create the virtual environment and install the required packages from requirements.txt, run the following command from the root of the repository:

uv venv
uv pip install -r requirements.txt

Activate the environment with:

source .venv/bin/activate

3. Data Preparation

Most datasets are downloaded automatically from the Hugging Face Hub during the training process. However, one of the datasets used in Stage 2 training (arxiv) must be downloaded manually from Kaggle.

Step 3.1: Create Data Directories

From the root of the repository, create the necessary directories for the raw data. The training script expects the data to be in a data_prep directory located outside the model_exploration folder.

mkdir -p data_prep/raw/

Step 3.2: Download and Extract the ArXiv Dataset

Use the Kaggle API to download the dataset and unzip it into the data_prep/raw/ directory.

# Note: Ensure your Kaggle API token is configured correctly
curl -L -o data_prep/raw/arxiv.zip https://www.kaggle.com/api/v1/datasets/download/Cornell-University/arxiv

# Unzip the contents into the raw data directory
unzip data_prep/raw/arxiv.zip -d data_prep/raw/

This will place the arxiv-metadata-oai-snapshot.json file where the script model_exploration/utils.py expects to find it.

4. Environment Configuration

To manage secrets and important configuration, create a .env file inside the model_exploration directory. This is where you will store your API keys and other environment variables.

Step 4.1: Navigate to the `model_exploration` Directory

cd model_exploration

Step 4.2: Create the `.env` File

Create a file named .env and add the following variables.

# .env file in the 'model_exploration' directory

# W&B: Set to "end" to upload the final model checkpoint as a W&B artifact.
export WANDB_LOG_MODEL="end"

# W&B: Your Weights & Biases API key for logging.
export WANDB_API_KEY="YOUR_WANDB_API_KEY"

# Hugging Face: Your token for accessing models/datasets from the HF Hub.
export HUGGINGFACE_TOKEN="YOUR_HUGGINGFACE_TOKEN"

Replace "YOUR_WANDB_API_KEY" and "YOUR_HUGGINGFACE_TOKEN" with your actual credentials.

5. Running the Training

The training is initiated using torchrun for distributed data parallel (DDP) training across multiple GPUs.

Step 5.1: Start the Training Script

Ensure you are in the model_exploration directory before running the command.

# Example training command
torchrun --nproc_per_node=auto st_trainer_ddp_hf.py \
    --model_name "nasa-impact/indus-sde-st-v0.1" \
    --num_train_epochs 1 \
    --batch_size 32 \
    --gradient_accumulation_steps 4 \
    --lr 2e-5 \
    --eval_and_save_steps 1000 \
    --output_base "../training_output"

Step 5.2: Command-Line Arguments

You can customize the training run using various command-line arguments. Here are some of the key options available in st_trainer_ddp_hf.py:

Argument	Type	Default	Description
`--nrows`	int	`None`	Number of rows to use from each dataset (for quick testing).
`--n_data_src`	int	`None`	Limit the number of data sources for training.
`--model_name`	str	`nasa-impact/indus-sde-st-v0.1`	The base Sentence Transformer model to fine-tune from the Hugging Face Hub.
`--output_base`	str	`tmp_models`	The base directory where training outputs and checkpoints will be saved.
`--num_train_epochs`	int	`1`	The total number of training epochs to perform.
`--batch_size`	int	`64`	The batch size per device (GPU) for training.
`--lr`	float	`2e-5`	The initial learning rate for the AdamW optimizer.
`--warmup_ratio`	float	`0.1`	The proportion of training steps for the learning rate warm-up.
`--eval_and_save_steps`	int	`1000`	The frequency (in steps) to run evaluation and save a model checkpoint.
`--gradient_accumulation_steps`	int	`8`	Number of steps to accumulate gradients before performing an optimizer step.
`--pretokenize`	flag	`False`	Enable dataset pre-tokenization to speed up training by caching tokenized data.
`--resume_checkpoint_path`	str	`None`	Path to a checkpoint to resume training from.
`--resume_run_id`	str	`None`	The Weights & Biases run ID to resume logging to.
`--custom_lr_scheduler`	flag	`True`	Use the custom learning rate scheduler with cosine decay.

6. Codebase Overview

st_trainer_ddp_hf.py: The main entry point for the training script. It handles argument parsing, DDP setup, data loading, trainer initialization, and evaluation.
utils.py: Contains helper functions for building dataset configurations (build_dataset_configs_s1, build_dataset_configs_s2), loading and caching datasets (load_and_cache_datasets), and preparing evaluation suites (prepare_evaluators).
distributed.py: Provides utilities for setting up and managing the distributed training environment.
pretokenize.py: Contains the logic for pre-tokenizing the datasets and caching them to disk to accelerate subsequent training runs.
requirements.txt: A list of Python packages required to run the code.

Citation

If you use INDUS-SDE, INDUS-SDE-ST, or these workflows in your research, please cite:

@inproceedings{pantha2026indussde,
  author    = {Pantha, Nishan and Awale, Sajil and Kuruvanthodi, Vishnudev and KC, Simran and Ramasubramanian, Muthukumaran and Davis, Carson and Praveen, Bishwas and Foshee, Emily and Bhattacharjee, Bishwaranjan and Bugbee, Kaylin and Ramachandran, Rahul},
  title     = {{INDUS-SDE}: A Language Model for Scientific Content Curation and Discovery},
  year      = {2026},
  isbn      = {979-8-4007-2259-2},
  publisher = {Association for Computing Machinery},
  address   = {New York, NY, USA},
  doi       = {10.1145/3770855.3818847},
  booktitle = {Proceedings of the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.2 (KDD 2026)},
  location  = {Jeju Island, Republic of Korea},
  series    = {KDD '26}
}

Name		Name	Last commit message	Last commit date
Latest commit History 93 Commits
.vscode		.vscode
agu_plots		agu_plots
data_exploration		data_exploration
data_prep		data_prep
dedup		dedup
eval		eval
gen_data_stage2		gen_data_stage2
model_exploration		model_exploration
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sentence Transformer Training Workflow

INDUS-SDE-ST — Sentence Transformer for Scientific Content Discovery

Models & data

Where the paper's data pipeline lives

NASA SDE IR — headline result (vs. released models)

Stage-2 checkpoint ablation (extended results)

1. Prerequisites

2. Setup and Installation

Step 2.1: Clone the Repository

Step 2.2: Set Up the Python Environment

3. Data Preparation

Step 3.1: Create Data Directories

Step 3.2: Download and Extract the ArXiv Dataset

4. Environment Configuration

Step 4.1: Navigate to the `model_exploration` Directory

Step 4.2: Create the `.env` File

5. Running the Training

Step 5.1: Start the Training Script

Step 5.2: Command-Line Arguments

6. Codebase Overview

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Sentence Transformer Training Workflow

INDUS-SDE-ST — Sentence Transformer for Scientific Content Discovery

Models & data

Where the paper's data pipeline lives

NASA SDE IR — headline result (vs. released models)

Stage-2 checkpoint ablation (extended results)

1. Prerequisites

2. Setup and Installation

Step 2.1: Clone the Repository

Step 2.2: Set Up the Python Environment

3. Data Preparation

Step 3.1: Create Data Directories

Step 3.2: Download and Extract the ArXiv Dataset

4. Environment Configuration

Step 4.1: Navigate to the model_exploration Directory

Step 4.2: Create the .env File

5. Running the Training

Step 5.1: Start the Training Script

Step 5.2: Command-Line Arguments

6. Codebase Overview

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Step 4.1: Navigate to the `model_exploration` Directory

Step 4.2: Create the `.env` File

Packages