This repository contains the code accompanying our paper:
An ECG biomarker for sudden cardiac death discovered via deep learning Ziad Obermeyer, Alexander Schubert, James Ross, Sendhil Mullainathan, Markus Lingman
The codebase implements three interconnected components:
- Data preprocessing — Cohort construction and waveform preparation from 12-lead ECG recordings.
- SCD risk model — A 1D residual network (ResNet) trained to predict sudden cardiac death across multiple time horizons from 10-second ECG traces.
- Morphing analysis — A generative model pipeline for identifying the waveform morphology features that drive model risk predictions.
The full pipeline is configured to run on the publicly accessible NTUH (National Taiwan University Hospital) dataset, one of the external validation cohorts from the paper, available through the Nightingale Open Science platform. The training cohort is a proprietary Swedish ECG dataset that cannot be released under applicable data-use agreements, but all code, model architecture, and training procedures are provided here.
We recommend Python 3.12 in a conda environment:
conda create -n ecg_scd_env python=3.12
conda activate ecg_scd_env
pip install -r requirements.txt
pip install .Use the Nightingale Open Science documentation for access and schema:
- Access process: https://docs.ngsci.org/access-data
- Dataset page: https://docs.ngsci.org/datasets/arrest-ntuh-ecg/
- Data dictionary: https://docs.ngsci.org/datasets/arrest-ntuh-ecg/data-dictionary.html
This repository expects the arrest-ntuh-ecg/v1/... directory structure described in the documentation.
| Script | Description |
|---|---|
x01_build_ntuh_df.py |
Reads NTUH case and control metadata, merges ECG parameters, derives SCD outcome labels at 3 mo / 6 mo / 1 yr / 2 yr, and writes the primary cohort table |
x02_generate_dummy_columns.py |
Adds pipeline-required columns not present in the NTUH schema (column renames, scaling factor, patient ID, interval SCD labels, modelling inclusion flag); also populates placeholders for covariates used in training that are absent from NTUH with realistic synthetic values for pipeline compatibility |
x03_process_10sec_ecg.py |
Extracts per-ECG waveform arrays and corrects a gain inconsistency in NTUH-derived limb leads (III, aVR, aVL, aVF) |
x04_segment_beats.py |
Detects R-peaks, segments individual beats, applies quality filters, and saves clean beat arrays for the beat-level model |
x05_create_morphing_beats.py |
Applies stricter ECG-level and beat-level filtering for the morphing pipeline and produces a filtering waterfall table |
x06_create_aVL_feature.py |
Parameterizes the aVL ECG biomarker discovered via morphing (RS-slope features) and appends them to the cohort table |
x07_run_ntuh_style_regression.py |
Runs the NTUH association analysis replicating the regression reported in the paper supplement, examining the relationship between the aVL biomarker and SCD outcomes |
x08_plot_ecg.py |
Utility for visualising individual ECG waveforms |
| Script | Description |
|---|---|
x01_train_10s_ecg.py |
Trains a ResNet on 10-second 12-lead ECGs with cumulative SCD probability prediction heads |
x02_train_beats.py |
Trains a ResNet on segmented single-beat ECGs |
x03_predict.py |
Runs inference with a trained model and saves predictions as a Feather file |
prediction_commands.md |
Example prediction commands for the models produced by the training scripts |
Implements the generative model pipeline used to interpret model predictions. Requires a trained convolutional backbone from 01_Predictive_Model/ (the generator is conditioned on the risk model's latent representations).
| Script | Description |
|---|---|
s08_train_generator.py |
Trains the VAE-based generative model on clean beat segments from ecg_beats_morphing/ |
s10_morph_ecgs.py |
Generates high-risk waveform morphs via latent-space perturbation |
s12_morph_stats.py |
Summarizes morphing outputs into interpretable statistics |
Supporting modules (s01–s07, s09, s11) implement the generative model components, data utilities, and post-processing steps.
All commands should be run from the repository root.
# Build the cohort table from NTUH source files
python 00_Data_Preprocessing/x01_build_ntuh_df.py
# Output: covariate_df.feather
# Add pipeline-required columns and derive SCD interval labels
python 00_Data_Preprocessing/x02_generate_dummy_columns.py
# Output: covariate_df.feather (updated)
# Extract and gain-correct individual ECG waveform arrays
python 00_Data_Preprocessing/x03_process_10sec_ecg.py
# Output: 10_sec_ecgs/<studyId>.npy
# Segment and quality-filter individual beats for the beat-level model
python 00_Data_Preprocessing/x04_segment_beats.py
# Output: ecg_beats/<studyId>_<beat>.npy
# Apply stricter beat filtering for morphing analysis
python 00_Data_Preprocessing/x05_create_morphing_beats.py
# Output: ecg_beats_morphing/, x05_waterfall_table_results.csv# Compute aVL RS features from waveform arrays
python 00_Data_Preprocessing/x06_create_aVL_feature.py
# Output: covariate_df.feather (updated with aVL_rs_diff, aVL_rs_diff_2)
# Run the NTUH association analysis (replicates the paper supplement regression)
python 00_Data_Preprocessing/x07_run_ntuh_style_regression.py
# Default: case-control subset
# To run on all available ECG records:
python 00_Data_Preprocessing/x07_run_ntuh_style_regression.py --filter-mode all# Train the 10-second 12-lead ECG model
python 01_Predictive_Model/x01_train_10s_ecg.py
# Output: modelfits_ecg/ntuh_scd_model_demo/
# Train the beat-level model
python 01_Predictive_Model/x02_train_beats.py
# Output: modelfits_beat/ntuh_scd_beat_model_demo/# 10-second ECG model
python 01_Predictive_Model/x03_predict.py \
--model_name ntuh_scd_model_demo \
--covariate_df_path covariate_df.feather \
--ecg_dir 10_sec_ecgs
# Output: predictions/ntuh_scd_model_demo_1/predictions.feather
# Beat-level model
python 01_Predictive_Model/x03_predict.py \
--model_name ntuh_scd_beat_model_demo \
--covariate_df_path covariate_df.feather \
--ecg_dir ecg_beats \
--beat
# Output: predictions/ntuh_scd_beat_model_demo_1/predictions.featherFor additional prediction options, see 01_Predictive_Model/prediction_commands.md.
The morphing pipeline uses the trained convolutional backbone from Step 3.
# Train the generative model
python 02_Morphing/s08_train_generator.py
# Generate high-risk morphs
python 02_Morphing/s10_morph_ecgs.py
# Summarize morph outputs
python 02_Morphing/s12_morph_stats.py
# Output: morphing_outputs/| Artifact | Description |
|---|---|
covariate_df.feather |
Main cohort table used throughout the pipeline (patient metadata, ECG parameters, outcomes, and derived features) |
10_sec_ecgs/ |
Gain-corrected 10-second ECG arrays, one .npy file per study |
ecg_beats/ |
Quality-filtered individual beat arrays for beat-level modelling |
ecg_beats_morphing/ |
Stricter-filtered beat directory for morphing analysis |
modelfits_ecg/ |
Trained model weights and hyperparameter logs for 10-second ECG models |
modelfits_beat/ |
Trained model weights and hyperparameter logs for beat-level models |
predictions/ |
Inference outputs; each subdirectory contains a predictions.feather file |
morphing_outputs/ |
Generated waveform morphs and summary statistics |