An ECG Biomarker for Sudden Cardiac Death

This repository contains the code accompanying our paper:

An ECG biomarker for sudden cardiac death discovered via deep learning Ziad Obermeyer, Alexander Schubert, James Ross, Sendhil Mullainathan, Markus Lingman

The codebase implements three interconnected components:

Data preprocessing — Cohort construction and waveform preparation from 12-lead ECG recordings.
SCD risk model — A 1D residual network (ResNet) trained to predict sudden cardiac death across multiple time horizons from 10-second ECG traces.
Morphing analysis — A generative model pipeline for identifying the waveform morphology features that drive model risk predictions.

The full pipeline is configured to run on the publicly accessible NTUH (National Taiwan University Hospital) dataset, one of the external validation cohorts from the paper, available through the Nightingale Open Science platform. The training cohort is a proprietary Swedish ECG dataset that cannot be released under applicable data-use agreements, but all code, model architecture, and training procedures are provided here.

Setup

We recommend Python 3.12 in a conda environment:

conda create -n ecg_scd_env python=3.12
conda activate ecg_scd_env
pip install -r requirements.txt
pip install .

Dataset access (NTUH)

Use the Nightingale Open Science documentation for access and schema:

Access process: https://docs.ngsci.org/access-data
Dataset page: https://docs.ngsci.org/datasets/arrest-ntuh-ecg/
Data dictionary: https://docs.ngsci.org/datasets/arrest-ntuh-ecg/data-dictionary.html

This repository expects the arrest-ntuh-ecg/v1/... directory structure described in the documentation.

Repository structure

`00_Data_Preprocessing/`

Script	Description
`x01_build_ntuh_df.py`	Reads NTUH case and control metadata, merges ECG parameters, derives SCD outcome labels at 3 mo / 6 mo / 1 yr / 2 yr, and writes the primary cohort table
`x02_generate_dummy_columns.py`	Adds pipeline-required columns not present in the NTUH schema (column renames, scaling factor, patient ID, interval SCD labels, modelling inclusion flag); also populates placeholders for covariates used in training that are absent from NTUH with realistic synthetic values for pipeline compatibility
`x03_process_10sec_ecg.py`	Extracts per-ECG waveform arrays and corrects a gain inconsistency in NTUH-derived limb leads (III, aVR, aVL, aVF)
`x04_segment_beats.py`	Detects R-peaks, segments individual beats, applies quality filters, and saves clean beat arrays for the beat-level model
`x05_create_morphing_beats.py`	Applies stricter ECG-level and beat-level filtering for the morphing pipeline and produces a filtering waterfall table
`x06_create_aVL_feature.py`	Parameterizes the aVL ECG biomarker discovered via morphing (RS-slope features) and appends them to the cohort table
`x07_run_ntuh_style_regression.py`	Runs the NTUH association analysis replicating the regression reported in the paper supplement, examining the relationship between the aVL biomarker and SCD outcomes
`x08_plot_ecg.py`	Utility for visualising individual ECG waveforms

`01_Predictive_Model/`

Script	Description
`x01_train_10s_ecg.py`	Trains a ResNet on 10-second 12-lead ECGs with cumulative SCD probability prediction heads
`x02_train_beats.py`	Trains a ResNet on segmented single-beat ECGs
`x03_predict.py`	Runs inference with a trained model and saves predictions as a Feather file
`prediction_commands.md`	Example prediction commands for the models produced by the training scripts

`02_Morphing/`

Implements the generative model pipeline used to interpret model predictions. Requires a trained convolutional backbone from 01_Predictive_Model/ (the generator is conditioned on the risk model's latent representations).

Script	Description
`s08_train_generator.py`	Trains the VAE-based generative model on clean beat segments from `ecg_beats_morphing/`
`s10_morph_ecgs.py`	Generates high-risk waveform morphs via latent-space perturbation
`s12_morph_stats.py`	Summarizes morphing outputs into interpretable statistics

Supporting modules (s01–s07, s09, s11) implement the generative model components, data utilities, and post-processing steps.

Typical workflow

All commands should be run from the repository root.

Step 1 — Data preprocessing

# Build the cohort table from NTUH source files
python 00_Data_Preprocessing/x01_build_ntuh_df.py
# Output: covariate_df.feather

# Add pipeline-required columns and derive SCD interval labels
python 00_Data_Preprocessing/x02_generate_dummy_columns.py
# Output: covariate_df.feather (updated)

# Extract and gain-correct individual ECG waveform arrays
python 00_Data_Preprocessing/x03_process_10sec_ecg.py
# Output: 10_sec_ecgs/<studyId>.npy

# Segment and quality-filter individual beats for the beat-level model
python 00_Data_Preprocessing/x04_segment_beats.py
# Output: ecg_beats/<studyId>_<beat>.npy

# Apply stricter beat filtering for morphing analysis
python 00_Data_Preprocessing/x05_create_morphing_beats.py
# Output: ecg_beats_morphing/, x05_waterfall_table_results.csv

Step 2 — Parameterize the discovered ECG biomarker

# Compute aVL RS features from waveform arrays
python 00_Data_Preprocessing/x06_create_aVL_feature.py
# Output: covariate_df.feather (updated with aVL_rs_diff, aVL_rs_diff_2)

# Run the NTUH association analysis (replicates the paper supplement regression)
python 00_Data_Preprocessing/x07_run_ntuh_style_regression.py
# Default: case-control subset

# To run on all available ECG records:
python 00_Data_Preprocessing/x07_run_ntuh_style_regression.py --filter-mode all

Step 3 — Train the predictive models

# Train the 10-second 12-lead ECG model
python 01_Predictive_Model/x01_train_10s_ecg.py
# Output: modelfits_ecg/ntuh_scd_model_demo/

# Train the beat-level model
python 01_Predictive_Model/x02_train_beats.py
# Output: modelfits_beat/ntuh_scd_beat_model_demo/

Step 4 — Generate predictions

# 10-second ECG model
python 01_Predictive_Model/x03_predict.py \
    --model_name ntuh_scd_model_demo \
    --covariate_df_path covariate_df.feather \
    --ecg_dir 10_sec_ecgs
# Output: predictions/ntuh_scd_model_demo_1/predictions.feather

# Beat-level model
python 01_Predictive_Model/x03_predict.py \
    --model_name ntuh_scd_beat_model_demo \
    --covariate_df_path covariate_df.feather \
    --ecg_dir ecg_beats \
    --beat
# Output: predictions/ntuh_scd_beat_model_demo_1/predictions.feather

For additional prediction options, see 01_Predictive_Model/prediction_commands.md.

Step 5 — Morphing analysis

The morphing pipeline uses the trained convolutional backbone from Step 3.

# Train the generative model
python 02_Morphing/s08_train_generator.py

# Generate high-risk morphs
python 02_Morphing/s10_morph_ecgs.py

# Summarize morph outputs
python 02_Morphing/s12_morph_stats.py
# Output: morphing_outputs/

Key artifacts

Artifact	Description
`covariate_df.feather`	Main cohort table used throughout the pipeline (patient metadata, ECG parameters, outcomes, and derived features)
`10_sec_ecgs/`	Gain-corrected 10-second ECG arrays, one `.npy` file per study
`ecg_beats/`	Quality-filtered individual beat arrays for beat-level modelling
`ecg_beats_morphing/`	Stricter-filtered beat directory for morphing analysis
`modelfits_ecg/`	Trained model weights and hyperparameter logs for 10-second ECG models
`modelfits_beat/`	Trained model weights and hyperparameter logs for beat-level models
`predictions/`	Inference outputs; each subdirectory contains a `predictions.feather` file
`morphing_outputs/`	Generated waveform morphs and summary statistics

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
00_Data_Preprocessing		00_Data_Preprocessing
01_Predictive_Model		01_Predictive_Model
02_Morphing		02_Morphing
src/ekg_scd		src/ekg_scd
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

An ECG Biomarker for Sudden Cardiac Death

Setup

Dataset access (NTUH)

Repository structure

`00_Data_Preprocessing/`

`01_Predictive_Model/`

`02_Morphing/`

Typical workflow

Step 1 — Data preprocessing

Step 2 — Parameterize the discovered ECG biomarker

Step 3 — Train the predictive models

Step 4 — Generate predictions

Step 5 — Morphing analysis

Key artifacts

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

An ECG Biomarker for Sudden Cardiac Death

Setup

Dataset access (NTUH)

Repository structure

00_Data_Preprocessing/

01_Predictive_Model/

02_Morphing/

Typical workflow

Step 1 — Data preprocessing

Step 2 — Parameterize the discovered ECG biomarker

Step 3 — Train the predictive models

Step 4 — Generate predictions

Step 5 — Morphing analysis

Key artifacts

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`00_Data_Preprocessing/`

`01_Predictive_Model/`

`02_Morphing/`

Packages