VLLM Host - HPC Model Serving

This repo provides a complete workflow for hosting large AI models on HPC clusters using vLLM and SLURM.

🚀 Quick Start

1. Setup Environment

# Enter repo
cd /data/horse/ws/s1787956-host_vllm/code/host_vllm

# Install dependencies
uv sync

# Make run script executable
chmod +x run.sh

2. Configure

Edit config.yaml and set active_model:

# Options: gpt-oss-20b, gpt-oss-120b, glm-4.7-flash, glm-4.7, glm-4.6v, glm-4.6v-fp16, qwen3-vl-235b
active_model: glm-4.6v  # Default: 8-bit vision model on 2 GPUs

3. Launch Model Server

./run.sh                    # Use default config.yaml
./run.sh -c my_config.yaml  # Use custom config file
./run.sh -h                 # Show help

# This will:
# - Read the config file
# - Submit the appropriate SLURM job
# - Show the job ID and node assignment

4. Query the Model

Once the job is running:

# Text-only query
python main.py --model glm-4.6v --prompt "Describe the solar system"

# With image (for vision models)
python main.py --model glm-4.6v --prompt "What's in this image?" --image-url "https://..."

# With local image
python main.py --model glm-4.6v --prompt "Analyze this" --image /path/to/photo.jpg

# List all available models
python main.py --list-models

📁 Directory Structure

.
├── config.yaml           # Main configuration (default)
├── run.sh               # Launcher script
├── main.py              # Example client
├── pyproject.toml       # Dependencies
├── jobs/                # SLURM job scripts
│   ├── gpt_oss_20b_single_h100.sh
│   ├── gpt_oss_120b_single_h100.sh
│   ├── glm4_7_flash_2gpus.sh
│   ├── glm4_7_2nodes_4gpus.sh
│   ├── glm4_6v_2gpus_fp8.sh      # GLM-4.6V (8-bit, 2 GPUs)
│   ├── glm4_6v_4gpus_fp16.sh     # GLM-4.6V (16-bit, 4 GPUs)
│   └── qwen3_vl_235b_4nodes.sh
├── scripts/             # Query scripts
│   ├── query_openai_compatible.py  # Default query script
│   └── README.md
└── logs/                # Job output logs

🎯 Available Models

Model	HuggingFace ID	GPUs	Vision	Quantization
gpt-oss-20b	`openai/gpt-oss-20b`	1	❌	MXFP4
gpt-oss-120b	`openai/gpt-oss-120b`	1	❌	MXFP4
glm-4.7-flash	`zai-org/GLM-4.7-Flash`	2	❌	FP8
glm-4.7	`zai-org/GLM-4.7`	8 (2 nodes)	❌	FP8
glm-4.6v	`zai-org/GLM-4.6V-FP8`	2	✅	FP8
glm-4.6v-fp16	`zai-org/GLM-4.6V`	4	✅	BF16
qwen3-vl-235b	`Qwen/Qwen3-VL-235B-A22B-Thinking`	16 (4 nodes)	✅	FP8

⚙️ Precision Configuration

8-bit precision (FP8/MXFP4) is used by default for efficiency. For 16-bit precision, use double the GPUs:

# 8-bit (default) - 2 GPUs
active_model: glm-4.6v

# 16-bit - 4 GPUs
active_model: glm-4.6v-fp16

📝 Complete Example

# 1. Setup
uv sync

# 2. Configure for vision model
# Edit config.yaml: active_model: glm-4.6v

# 3. Launch
./run.sh
# Output: Job 12345 submitted, running on node hpc-node-01

# 4. Query
export VLLM_HOST=hpc-node-01
python main.py --model glm-4.6v --prompt "What do you see?" --image-url "https://example.com/image.jpg"

# 5. Monitor
squeue -u $USER
tail -f logs/glm4_6v_2gpus_fp8_*.out

📄 Custom Configuration Files

You can use multiple configuration files and select them at runtime:

# Create multiple configs
cp config.yaml production.yaml
cp config.yaml development.yaml

# Edit production.yaml to use different model or SLURM settings

# Run with custom config
./run.sh -c production.yaml
./run.sh -c development.yaml

🔧 Configuration Reference

Model Configuration

Each model in config.yaml supports:

models:
  my-model:
    name: "Display Name"
    huggingface_id: "org/model-id"
    job_script: "jobs/my_model.sh"           # SLURM job script
    query_script: "scripts/query.py"         # Optional: custom query script
    port: 8000                               # vLLM server port
    gpus: 1                                  # GPUs required
    nodes: 1                                 # Nodes required
    quantization: "fp8"                      # Quantization mode
    precision: "8bit"                        # Precision (8bit/16bit)
    vision: false                            # Vision model support
    # Optional: Override global SLURM settings per model
    time: "24:00:00"                         # Job time limit (default from slurm.time_default)
    partition: "long"                        # SLURM partition (default from slurm.partition)
    description: "Description for display"

SLURM Configuration

slurm:
  partition: "capella"     # SLURM partition
  account: ""              # SLURM account (optional)
  time_default: "12:00:00" # Default walltime
  cpus_per_task: 16        # CPUs per task

🧩 Query Scripts

Models can define custom query_script in config. Default is scripts/query_openai_compatible.py.

To create a custom query script:

Create a script in scripts/ directory
Make it executable: chmod +x scripts/my_script.py
Reference it in your model config

See scripts/README.md for more details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VLLM Host - HPC Model Serving

🚀 Quick Start

1. Setup Environment

2. Configure

3. Launch Model Server

4. Query the Model

📁 Directory Structure

🎯 Available Models

⚙️ Precision Configuration

📝 Complete Example

📄 Custom Configuration Files

🔧 Configuration Reference

Model Configuration

SLURM Configuration

🧩 Query Scripts

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
jobs		jobs
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
README_NEW.md		README_NEW.md
config.yaml		config.yaml
main.py		main.py
pyproject.toml		pyproject.toml
run.sh		run.sh
uv.lock		uv.lock

License

KatherLab/host_vllm

Folders and files

Latest commit

History

Repository files navigation

VLLM Host - HPC Model Serving

🚀 Quick Start

1. Setup Environment

2. Configure

3. Launch Model Server

4. Query the Model

📁 Directory Structure

🎯 Available Models

⚙️ Precision Configuration

📝 Complete Example

📄 Custom Configuration Files

🔧 Configuration Reference

Model Configuration

SLURM Configuration

🧩 Query Scripts

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages