Skip to content

EyeACT/envision-eye-actionable

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

envision-eye-actionable

Turns raw downloaded eye-imaging archives into HuggingFace-loadable training data. Uses a local LLM (Gemma 4 E4B via llama.cpp, runs on CPU) to analyze each dataset's file layout and produce structured directory trees with per-modality subdirectories, MeSH/NCIT ontology tags, and validated file layouts compatible with datasets.load_dataset(...) or any ML training loop.

Metadata follows the AI-READI dataset description schema (v0.1.0) — each conformed dataset includes a dataset_structure_description.json describing its on-disk layout.

Part of the EyeACT (Eye Aging, Cognition, and Imaging) project by the FAIR Data Innovations Hub at the California Medical Innovations Institute (CalMI2).

EyeACT aims to make eye imaging datasets across the scientific literature discoverable, classifiable, and directly usable for ML/AI research. Discovered datasets are registered on the Envision Portal.

Where this sits in the pipeline

envision-discovery handles the metadata side:

  • scrapes 7 repositories (Zenodo, Figshare, Dryad, OSF, DataCite, Kaggle, NEI)
  • classifies records as eye-imaging or not
  • downloads raw files for EYE_IMAGING records
  • exports the dataset metadata to the AI-READI schema (dataset_description.json) — describing what the dataset claims to contain

envision-eye-actionable (this repo) handles the data side:

  • unpacks every archive in each download (zip, tar, 7z, rar, multi-part zip)
  • classifies every file by format and modality (fundus / OCT / OCTA / DICOM / ...)
  • uses a local LLM agent to analyze the dataset layout and propose a placement recipe
  • hardlinks files into a structured on-disk tree
  • emits a post-materialization dataset_structure_description.json — what we actually laid down on disk, distinct from what envision-discovery's exporter produced
  • validates the tree by sample-opening one file per modality with the right reader (pydicom, nibabel, Pillow, oct-converter, ...)

The output trees are structured so HuggingFace loaders work out of the box: retinal_photography/<class_name>/*.jpg is an ImageFolder; paired image/+mask/ subdirectories are a segmentation dataset; etc.

Installation

git clone https://github.com/EyeACT/envision-eye-actionable.git
cd envision-eye-actionable
pip install -e '.[agent]'

Requires Python >= 3.10. Installs all open-source readers needed for conform runs: pydicom, nibabel, h5py, mat73, tifffile, pynrrd, SimpleITK, pypdf, pdfplumber, python-docx, Pillow, oct-converter, py7zr, rarfile, and llama-cpp-python (for the Gemma 4 agent). See docs/readers.md for the full format matrix.

You also need a Gemma 4 E4B GGUF model file:

# Recommended: Q4_K_M quantization (~5 GB, 5-15 tok/s on CPU)
# Download from https://huggingface.co/unsloth/gemma-4-E4B-it-GGUF

Usage

# Conform every downloaded zenodo record
envision-conform --agent-model ~/models/gemma-4-e4b-it-q4.gguf --source zenodo

# One specific record
envision-conform --agent-model ~/models/gemma-4-e4b-it-q4.gguf \
    --source zenodo --source-id 4521044

# Every source directory under ./data/downloads/
envision-conform --agent-model ~/models/gemma-4-e4b-it-q4.gguf --all-sources

# Re-run only records that failed
envision-conform --agent-model ~/models/gemma-4-e4b-it-q4.gguf \
    --source zenodo --rerun-status failed

# Custom input/output paths (if not running next to an envision-discovery checkout)
envision-conform --agent-model ~/models/gemma-4-e4b-it-q4.gguf \
    --source zenodo \
    --downloads-dir /mnt/bigdisk/envision-downloads \
    --output-dir    /mnt/bigdisk/envision-actionable

Or use the module entry point:

python -m envision_eye_actionable --agent-model ~/models/gemma-4-e4b-it-q4.gguf \
    --source zenodo

How it works

downloaded record
      |
      v
   unpack    extract every .zip/.tar/.7z/.rar/multi-part zip (recursive)
      |
      v
  inventory  walk tree; tag every file by format/modality/readability;
             collect README / PDF / DOCX / label-hint files
      |
      v
   agent     Gemma 4 E4B analyzes inventory + README + metadata
             and proposes a Recipe(task_type, placements[])
      |
      v
 materialize hardlink (default) or copy source files into target tree
      |
      v
  validate   check dataset_structure_description.json, sample-open one
             file per modality with the right reader

Full details in docs/conforming.md.

On-disk layout (output)

data/actionable/
+-- zenodo/
    +-- _conform_log.json                           # run summary (status per record)
    +-- 4521044/
    |   +-- dataset_structure_description.json      # AI-READI schema descriptor
    |   +-- conform_report.json                     # placements + inventory + stats
    |   +-- _unplaced/                              # orphan files the recipe didn't place
    |   +-- retinal_photography/
    |       +-- 1_healthy_young_raw_good_quality/
    |       |   +-- <files hardlinked from data/downloads/>
    |       +-- 2_healthy_young_segmented/
    |       +-- ...
    +-- 16744782/
        +-- dataset_structure_description.json
        +-- conform_report.json
        +-- retinal_photography/
            +-- ...

Hardlinks mean the conformed tree does not double disk usage — the downloaded file and the placed file share an inode.

Relationship to envision-discovery

By default, this tool expects envision-discovery's layout:

Directory What we read from it
./data/downloads/{source}/{id}/ Downloaded files + per-record manifest.json
./results/{source}_eye_imaging.json Classification metadata (optional; enriches agent context)
./data/actionable/{source}/{id}/ Output — conformed trees

If you're not running this alongside an envision-discovery checkout, use --downloads-dir, --results-dir, and --output-dir to point elsewhere.

What the conformer does NOT do (yet)

  • Decode vendor OCT. We only verify oct-converter imports; extraction to DICOM / numpy is a followup once the canonical on-disk form is decided.
  • Extract labels from PDFs / DOCX automatically. Readers are installed, but label extraction isn't wired into the agent prompt yet.
  • Handle splits (train/val/test). Most datasets provide splits explicitly as subdirectories; the conformer passes them through but doesn't tag them.
  • Cross-source deduplication — that lives upstream in envision-discovery.

Results (Zenodo, first run)

Applied to 83 downloaded Zenodo eye imaging datasets (644 GB, 421 files):

Status Count %
ok 82 99%
failed 1 1%

The Gemma 4 E4B agent (running locally on CPU via llama.cpp, Q4_K_M quantization) produced validated directory trees for 82 of 83 records.

Related

Citation

If you use this tool in your research, please cite the EyeACT project:

FAIR Data Innovations Hub, California Medical Innovations Institute (CalMI2). Envision: Eye imaging dataset discovery and curation pipeline. EyeACT Study, eyeactstudy.org.

License

MIT — see LICENSE. Individual dataset licenses vary; check each dataset before use.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages