Paper: https://arxiv.org/abs/2505.22919 | Dataset: PhysioNet | Code: GitHub
ER-Reason is a benchmark for evaluating large language models (LLMs) on clinical reasoning across key stages of the emergency room (ER) workflow. Unlike benchmarks based on medical licensing exams, ER-Reason evaluates not just what decisions models make, but how their reasoning evolves as clinical evidence accumulates.
ER-Reason consists of two components:
- Longitudinal clinical notes from 3,984 hospital encounters comprising 25,174 de-identified clinical notes across discharge summaries, progress notes, H&Ps, consult notes, imaging reports, and ER provider notes — supporting evaluation across triage intake, disposition planning, and final diagnosis.
- SCT reasoning evaluation comprising 194 physician-authored patient cases annotated by two ER physicians (2,555 total annotations), with three metrics — DxUpdate, DxTrajectory, and FinalDx — that measure sequential belief updating against physician consensus.
The ER-Reason dataset is hosted on PhysioNet and requires credentialed access:
- Register at physionet.org and complete CITI training
- Request access at: https://physionet.org/content/er-reason/1.0.0/
- Once approved, download the dataset files
Note: The dataset contains de-identified patient data and is governed by a PhysioNet data use agreement. Do not share or redistribute.
| File | Description | Key columns |
|---|---|---|
er_reason.csv |
Main dataset — one row per encounter, includes all clinical note text, demographic fields, acuity level, disposition, and primary ED diagnosis | patientdurablekey, encounterkey, primarychiefcomplaintname, primaryeddiagnosisname, acuitylevel, eddisposition, ED_Provider_Notes_Text, Discharge_Summary_Text, One_Sentence_Extracted |
icd_10_codes.csv |
Ground-truth ICD-10 codes per encounter — one row per code, multiple rows per encounter | patientdurablekey, encounterkey, value (ICD-10 code), displaystring (diagnosis name) |
annotated_sct.csv |
SCT evaluation cases derived from er_reason.csv — one row per encounter, includes the one-sentence patient summary and up to 5 differential/evidence pairs per case |
encounterkey, One_Sentence_Extracted, differential_1–differential_5, evidence_1–evidence_5 |
sct-annotations.csv |
Master list of the 194 SCT case encounterkeys — used to filter er_reason.csv to SCT encounters |
encounterkey |
sct_cleaned_annotations.csv |
Physician rationales for each differential/evidence step — one row per (encounter, differential) pair | encounterkey, differential, rationale |
gt_clean.csv |
Ground-truth physician consensus scores for SCT evaluation — one row per (encounter, differential) pair | encounterkey, differential, dxupdate (ordinal score −2 to +2), dxtrajectory (ranked dict of all differentials) |
The CCSR reference file used for diagnosis evaluation must be downloaded separately from AHRQ: https://hcup-us.ahrq.gov/toolssoftware/ccsr/dxccsr.jsp
ER-Reason/
├── Experiments/
│ ├── Standard Clinical Tasks/
│ │ ├── acuity.py # Acuity prediction (zero-shot + step-back)
│ │ ├── disposition.py # Disposition prediction (zero-shot + step-back)
│ │ ├── final_diagnosis.py # Final diagnosis prediction (zero-shot + step-back)
│ │ ├── diag_evaluation.py # ICD-10 exact match + CCSR accuracy
│ │ └── cross_stage_analysis.py # Cross-stage workflow accuracy (Table 5)
│ └── SCT Reasoning/
│ ├── clinical_knowledge.py # Clinical knowledge baseline (Table 3 CK column)
│ ├── sct.py # SCT evaluation — baseline, single oracle, full oracle
│ └── sct_eval.py # DxUpdate, DxTrajectory, FinalDx, coherence, Figure 3
├── ER-Reason-V1-Archive/ # Original codebase (archived)
├── ER-Reason_Column_Descriptions.md
├── README.md
└── requirements.txt
1. Clone the repository
git clone https://github.com/AlaaLab/ER-Reason.git
cd ER-Reason2. Install dependencies
pip install -r requirements.txt3. Set your OpenRouter API key
All experiment scripts use OpenRouter so you can swap in any supported model with a single line change.
export OPENROUTER_API_KEY="sk-or-..."All scripts also enable Zero Data Retention ("provider": {"zdr": True}) by default, routing requests only to ZDR-compliant providers.
Each script has a MODEL_NAME variable at the top. Replace it with any OpenRouter model string:
MODEL_NAME = "openai/gpt-5.2-20251211"
# MODEL_NAME = "openai/o4-mini" # remove temperature parameter for this model
# MODEL_NAME = "deepseek/deepseek-r1"
# MODEL_NAME = "google/gemini-2.5-flash"
# MODEL_NAME = "anthropic/claude-sonnet-4-5"
# MODEL_NAME = "microsoft/phi-4"To enable Claude thinking mode, add to the API call:
extra_body={"thinking": {"type": "enabled", "budget_tokens": 10000}}All three scripts run zero-shot and step-back conditions in a single pass, saving results to a CSV with a condition column.
Acuity prediction
python Experiments/Standard\ Clinical\ Tasks/acuity.py
# Output: acuity_results.csvDisposition prediction
python Experiments/Standard\ Clinical\ Tasks/disposition.py
# Output: disposition_results.csvFinal diagnosis prediction
python Experiments/Standard\ Clinical\ Tasks/final_diagnosis.py
# Output: diagnosis_results.csvDiagnosis evaluation (ICD-10 exact match + CCSR accuracy)
python Experiments/Standard\ Clinical\ Tasks/diag_evaluation.py
# Reads: diagnosis_results.csv, icd_10_codes.csv, DXCCSR_v2025-1.CSVCross-stage workflow accuracy (Table 5)
python Experiments/Standard\ Clinical\ Tasks/cross_stage_analysis.py
# Reads: acuity_results.csv, disposition_results.csv, diagnosis_results.csv,
# icd_10_codes.csv, DXCCSR_v2025-1.CSVClinical knowledge baseline
python Experiments/SCT\ Reasoning/clinical_knowledge.py
# Output: clinical_knowledge_results.csvSCT evaluation — runs all three conditions (baseline, single oracle, full oracle)
python Experiments/SCT\ Reasoning/sct.py
# Output: sct_results.csv
# sct_baseline_checkpoint.csv
# sct_single_oracle_checkpoint.csv
# sct_full_oracle_checkpoint.csvEach condition checkpoints independently — if interrupted, re-running will resume from where it left off.
SCT metrics and figures (Tables 2, 3, 4 and Figure 3)
python Experiments/SCT\ Reasoning/sct_eval.py
# Reads: sct_results.csv, gt_clean.csv
# Output: top1_by_timestep.pdf, top1_by_timestep.pngIf you use ER-Reason in your research, please cite:
@inproceedings{@article{mehandru2025er,
title={Er-reason: A benchmark dataset for llm-based clinical reasoning in the emergency room},
author={Mehandru, Nikita and Golchini, Niloufar and Bamman, David and Zack, Travis and Molina, Melanie F and Alaa, Ahmed},
journal={arXiv preprint arXiv:2505.22919},
year={2025}
}The code in this repository is released under the MIT License. The dataset is governed by the PhysioNet Data Use Agreement and may not be redistributed.