Skip to content

alexmschubert/ECG-SCD

Repository files navigation

An ECG Biomarker for Sudden Cardiac Death

This repository contains the code accompanying our paper:

An ECG biomarker for sudden cardiac death discovered via deep learning Ziad Obermeyer, Alexander Schubert, James Ross, Sendhil Mullainathan, Markus Lingman

The codebase implements three interconnected components:

  1. Data preprocessing — Cohort construction and waveform preparation from 12-lead ECG recordings.
  2. SCD risk model — A 1D residual network (ResNet) trained to predict sudden cardiac death across multiple time horizons from 10-second ECG traces.
  3. Morphing analysis — A generative model pipeline for identifying the waveform morphology features that drive model risk predictions.

The full pipeline is configured to run on the publicly accessible NTUH (National Taiwan University Hospital) dataset, one of the external validation cohorts from the paper, available through the Nightingale Open Science platform. The training cohort is a proprietary Swedish ECG dataset that cannot be released under applicable data-use agreements, but all code, model architecture, and training procedures are provided here.


Setup

We recommend Python 3.12 in a conda environment:

conda create -n ecg_scd_env python=3.12
conda activate ecg_scd_env
pip install -r requirements.txt
pip install .

Dataset access (NTUH)

Use the Nightingale Open Science documentation for access and schema:

This repository expects the arrest-ntuh-ecg/v1/... directory structure described in the documentation.


Repository structure

00_Data_Preprocessing/

Script Description
x01_build_ntuh_df.py Reads NTUH case and control metadata, merges ECG parameters, derives SCD outcome labels at 3 mo / 6 mo / 1 yr / 2 yr, and writes the primary cohort table
x02_generate_dummy_columns.py Adds pipeline-required columns not present in the NTUH schema (column renames, scaling factor, patient ID, interval SCD labels, modelling inclusion flag); also populates placeholders for covariates used in training that are absent from NTUH with realistic synthetic values for pipeline compatibility
x03_process_10sec_ecg.py Extracts per-ECG waveform arrays and corrects a gain inconsistency in NTUH-derived limb leads (III, aVR, aVL, aVF)
x04_segment_beats.py Detects R-peaks, segments individual beats, applies quality filters, and saves clean beat arrays for the beat-level model
x05_create_morphing_beats.py Applies stricter ECG-level and beat-level filtering for the morphing pipeline and produces a filtering waterfall table
x06_create_aVL_feature.py Parameterizes the aVL ECG biomarker discovered via morphing (RS-slope features) and appends them to the cohort table
x07_run_ntuh_style_regression.py Runs the NTUH association analysis replicating the regression reported in the paper supplement, examining the relationship between the aVL biomarker and SCD outcomes
x08_plot_ecg.py Utility for visualising individual ECG waveforms

01_Predictive_Model/

Script Description
x01_train_10s_ecg.py Trains a ResNet on 10-second 12-lead ECGs with cumulative SCD probability prediction heads
x02_train_beats.py Trains a ResNet on segmented single-beat ECGs
x03_predict.py Runs inference with a trained model and saves predictions as a Feather file
prediction_commands.md Example prediction commands for the models produced by the training scripts

02_Morphing/

Implements the generative model pipeline used to interpret model predictions. Requires a trained convolutional backbone from 01_Predictive_Model/ (the generator is conditioned on the risk model's latent representations).

Script Description
s08_train_generator.py Trains the VAE-based generative model on clean beat segments from ecg_beats_morphing/
s10_morph_ecgs.py Generates high-risk waveform morphs via latent-space perturbation
s12_morph_stats.py Summarizes morphing outputs into interpretable statistics

Supporting modules (s01s07, s09, s11) implement the generative model components, data utilities, and post-processing steps.


Typical workflow

All commands should be run from the repository root.

Step 1 — Data preprocessing

# Build the cohort table from NTUH source files
python 00_Data_Preprocessing/x01_build_ntuh_df.py
# Output: covariate_df.feather

# Add pipeline-required columns and derive SCD interval labels
python 00_Data_Preprocessing/x02_generate_dummy_columns.py
# Output: covariate_df.feather (updated)

# Extract and gain-correct individual ECG waveform arrays
python 00_Data_Preprocessing/x03_process_10sec_ecg.py
# Output: 10_sec_ecgs/<studyId>.npy

# Segment and quality-filter individual beats for the beat-level model
python 00_Data_Preprocessing/x04_segment_beats.py
# Output: ecg_beats/<studyId>_<beat>.npy

# Apply stricter beat filtering for morphing analysis
python 00_Data_Preprocessing/x05_create_morphing_beats.py
# Output: ecg_beats_morphing/, x05_waterfall_table_results.csv

Step 2 — Parameterize the discovered ECG biomarker

# Compute aVL RS features from waveform arrays
python 00_Data_Preprocessing/x06_create_aVL_feature.py
# Output: covariate_df.feather (updated with aVL_rs_diff, aVL_rs_diff_2)

# Run the NTUH association analysis (replicates the paper supplement regression)
python 00_Data_Preprocessing/x07_run_ntuh_style_regression.py
# Default: case-control subset

# To run on all available ECG records:
python 00_Data_Preprocessing/x07_run_ntuh_style_regression.py --filter-mode all

Step 3 — Train the predictive models

# Train the 10-second 12-lead ECG model
python 01_Predictive_Model/x01_train_10s_ecg.py
# Output: modelfits_ecg/ntuh_scd_model_demo/

# Train the beat-level model
python 01_Predictive_Model/x02_train_beats.py
# Output: modelfits_beat/ntuh_scd_beat_model_demo/

Step 4 — Generate predictions

# 10-second ECG model
python 01_Predictive_Model/x03_predict.py \
    --model_name ntuh_scd_model_demo \
    --covariate_df_path covariate_df.feather \
    --ecg_dir 10_sec_ecgs
# Output: predictions/ntuh_scd_model_demo_1/predictions.feather

# Beat-level model
python 01_Predictive_Model/x03_predict.py \
    --model_name ntuh_scd_beat_model_demo \
    --covariate_df_path covariate_df.feather \
    --ecg_dir ecg_beats \
    --beat
# Output: predictions/ntuh_scd_beat_model_demo_1/predictions.feather

For additional prediction options, see 01_Predictive_Model/prediction_commands.md.

Step 5 — Morphing analysis

The morphing pipeline uses the trained convolutional backbone from Step 3.

# Train the generative model
python 02_Morphing/s08_train_generator.py

# Generate high-risk morphs
python 02_Morphing/s10_morph_ecgs.py

# Summarize morph outputs
python 02_Morphing/s12_morph_stats.py
# Output: morphing_outputs/

Key artifacts

Artifact Description
covariate_df.feather Main cohort table used throughout the pipeline (patient metadata, ECG parameters, outcomes, and derived features)
10_sec_ecgs/ Gain-corrected 10-second ECG arrays, one .npy file per study
ecg_beats/ Quality-filtered individual beat arrays for beat-level modelling
ecg_beats_morphing/ Stricter-filtered beat directory for morphing analysis
modelfits_ecg/ Trained model weights and hyperparameter logs for 10-second ECG models
modelfits_beat/ Trained model weights and hyperparameter logs for beat-level models
predictions/ Inference outputs; each subdirectory contains a predictions.feather file
morphing_outputs/ Generated waveform morphs and summary statistics

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages