Skip to content

Promptehr pr integration#878

Draft
jalengg wants to merge 40 commits intosunlabuiuc:masterfrom
jalengg:promptehr-pr-integration
Draft

Promptehr pr integration#878
jalengg wants to merge 40 commits intosunlabuiuc:masterfrom
jalengg:promptehr-pr-integration

Conversation

@jalengg
Copy link

@jalengg jalengg commented Mar 2, 2026

No description provided.

jalengg added 30 commits March 1, 2026 01:41
Also: accept BartConfig object as bart_config_name for tiny test models.
Guard drive.mount() with os.path.isdir('/content/drive/MyDrive') check
so re-running the cell does not raise ValueError: Mountpoint must not
already contain files.
…scade

Wrap cardiology_detect (scipy), EEG_abnormal/events (mne), sleep_staging
variants (mne), and temple_university_EEG_tasks (mne) in try/except so
that pyhealth.tasks import does not fail in Colab where numpy 2.x breaks
scipy._lib._util. Mirrors the identical fix in halo-pr-528.
- When all 3 files exist in DATA_DIR (Drive-backed), print sizes and
  skip upload entirely — mirrors HALO notebook UX
- Normalize uploaded filenames via shutil.copy so Colab's duplicate
  rename (e.g. ADMISSIONS (1).csv) maps to canonical name in Drive
- Keep idempotent drive.mount() guard from previous fix
Wrap ChestXray14Dataset, COVID19CXRDataset (PIL/torchvision), SleepEDFDataset,
TUABDataset, TUEVDataset (mne) in try/except so datasets/__init__ does not
fail when optional deps are absent. TUABDataset was the immediate cause:
tuab.py imports EEGAbnormalTUAB from pyhealth.tasks, which is now silently
absent when mne is unavailable. Mirrors identical guards in halo-pr-528.
- Add --force-reinstall to pip install so Colab never loads a stale
  cached build that lacks the try/except import guards
- Switch to subprocess.run with returncode check (mirrors HALO pattern)
- Update preamble: last_modified 2026-03-03, commit 394e128
Wrap biot (einops), cnn/graph_torchvision/torchvision/vision_embedding
(PIL/torchvision), grasp (sklearn→scipy cascade), molerec/safedrug (rdkit),
tfm_tokenizer (einops), transformers_model/text_embedding/sdoh (transformers)
in try/except — mirrors halo-pr-528. Also removes duplicate medlink import.
- tuab.py, tuev.py: wrap task imports in try/except (= None fallback)
  so TUABDataset/TUEVDataset load cleanly when mne is unavailable.
  Mirrors halo-pr-528 commit b1470ad.
- Notebook preamble: restructured to match HALO layout (What You'll
  Need / How It Works / Important Notes / References); removed
  'Why PromptEHR is different from HALO' section per user request.
- Timestamp: 2026-03-04 08:37:50 UTC
…clobber

--force-reinstall reinstalls all transitive deps, which could downgrade
scipy back to the old Colab binary. Installing scipy>=1.14 in a second
pip call after PyHealth ensures it is the final version on disk when
s4-dataset later triggers the transformers→sklearn→scipy import chain.
PIL._typing._Ink moved between Pillow versions; --force-reinstall can
leave the package in an inconsistent state. Pinning Pillow>=10.4.0 in
the post-PyHealth upgrade step ensures consistent PIL internals.
…ded each session

Root cause: s2-config called os.makedirs(DATA_DIR) before Drive was mounted,
creating a local /content/drive/MyDrive directory. The s3-upload guard then
saw isdir('/content/drive/MyDrive') == True and skipped drive.mount(), so
all file checks ran against an empty local path.

Fix:
- s2-config: skip makedirs in Colab (Drive not yet mounted)
- s3-upload: use os.path.ismount('/content/drive') guard (checks actual
  filesystem mount, not directory existence); makedirs after mount
… state

PyHealth --force-reinstall can leave numpy/scipy in a mixed state where
Python files and compiled .so extensions are from different versions,
causing 'cannot import name _center from numpy._core.umath'.

Fix: add --force-reinstall and explicit numpy~=2.2.0 to the post-PyHealth
pip upgrade step, guaranteeing all numpy/scipy files are from consistent
versions that support each other and numpy 2.x.
- Add FeatureProcessor import (was in __all__ but never imported)
- Remove LabelProcessor from __all__ (class does not exist)
- Guard ImageProcessor/TimeImageProcessor with (ImportError, RuntimeError)
  to catch broken Pillow installs that raise RuntimeError, not ImportError
- Build __all__ dynamically so guarded processors are only listed when
  their imports succeed
- Change numpy~=2.2.0 → numpy>=2.0.0 in notebook post-install to avoid
  hard ceiling at <2.3 that would downgrade as Colab numpy advances
- pyproject.toml: numpy~=2.2.0 → numpy>=2.0.0 (removes <2.3 ceiling;
  prevents downgrade when Colab has numpy 2.3.x, which was the root
  cause of the recurring _center ImportError)
- s1-setup: remove --force-reinstall and numpy from post-install step;
  use --upgrade instead (force-reinstall of scipy force-reinstalls numpy
  transitively, creating mixed-version compiled/Python state)
- s3-upload: drive.mount(..., force_remount=True) to handle stale FUSE
  mount state that raised "Mountpoint must not already contain files"
… install

Same pattern as HALO's scipy fix (b80f837): PyHealth install may
partially upgrade Pillow (via torch→torchvision→Pillow cascade),
leaving mixed .py/.so files. Force-reinstall only Pillow (--no-deps)
before it gets imported so all files come from one version.
@jalengg jalengg force-pushed the promptehr-pr-integration branch from 9b5cfa5 to 732d207 Compare March 4, 2026 19:22
jalengg added 9 commits March 4, 2026 13:37
transformers 4.53+ eagerly imports loss_utils → image_utils →
torchvision → PIL, even for non-vision models like BART. In Colab,
Pillow is in a mixed-version state that can't be fixed by pip
(system-managed files). Fix: temporarily remove torchvision from
sys.modules during the BART import so transformers skips the vision
chain entirely. PromptEHR only needs BART, not vision functionality.
No PromptEHR task uses icustays, and most users don't have the file.
HuggingFace Trainer moves bart_model to GPU but doesn't move the
parent PromptEHR module. self.device (from _dummy_param) stays CPU
while bart_model is on GPU, causing RuntimeError during generation.
transformers defaults to beam search which fails with batch_size
inference on our single-token encoder input. PromptEHR uses
nucleus/greedy sampling, not beam search.
BART generate() always starts output with decoder_start_token_id
(BOS=1). The ported code treated BOS as a stop token (break),
causing decode_tokens to return empty visits for every patient.

Original pehr_scratch/generate.py::parse_sequence_to_visits uses
continue to skip BOS — this was a porting bug. Fix from
promptehr-port branch commit 97f6a7b.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant