envision-eye-actionable

Turns raw downloaded eye-imaging archives into HuggingFace-loadable training data. Uses a local LLM (Gemma 4 E4B via llama.cpp, runs on CPU) to analyze each dataset's file layout and produce structured directory trees with per-modality subdirectories, MeSH/NCIT ontology tags, and validated file layouts compatible with datasets.load_dataset(...) or any ML training loop.

Metadata follows the AI-READI dataset description schema (v0.1.0) — each conformed dataset includes a dataset_structure_description.json describing its on-disk layout.

Part of the EyeACT (Eye Aging, Cognition, and Imaging) project by the FAIR Data Innovations Hub at the California Medical Innovations Institute (CalMI2).

EyeACT aims to make eye imaging datasets across the scientific literature discoverable, classifiable, and directly usable for ML/AI research. Discovered datasets are registered on the Envision Portal.

Where this sits in the pipeline

envision-discovery handles the metadata side:

scrapes 7 repositories (Zenodo, Figshare, Dryad, OSF, DataCite, Kaggle, NEI)
classifies records as eye-imaging or not
downloads raw files for EYE_IMAGING records
exports the dataset metadata to the AI-READI schema (dataset_description.json) — describing what the dataset claims to contain

envision-eye-actionable (this repo) handles the data side:

unpacks every archive in each download (zip, tar, 7z, rar, multi-part zip)
classifies every file by format and modality (fundus / OCT / OCTA / DICOM / ...)
uses a local LLM agent to analyze the dataset layout and propose a placement recipe
hardlinks files into a structured on-disk tree
emits a post-materialization dataset_structure_description.json — what we actually laid down on disk, distinct from what envision-discovery's exporter produced
validates the tree by sample-opening one file per modality with the right reader (pydicom, nibabel, Pillow, oct-converter, ...)

The output trees are structured so HuggingFace loaders work out of the box: retinal_photography/<class_name>/*.jpg is an ImageFolder; paired image/+mask/ subdirectories are a segmentation dataset; etc.

Installation

git clone https://github.com/EyeACT/envision-eye-actionable.git
cd envision-eye-actionable
pip install -e '.[agent]'

Requires Python >= 3.10. Installs all open-source readers needed for conform runs: pydicom, nibabel, h5py, mat73, tifffile, pynrrd, SimpleITK, pypdf, pdfplumber, python-docx, Pillow, oct-converter, py7zr, rarfile, and llama-cpp-python (for the Gemma 4 agent). See docs/readers.md for the full format matrix.

You also need a Gemma 4 E4B GGUF model file:

# Recommended: Q4_K_M quantization (~5 GB, 5-15 tok/s on CPU)
# Download from https://huggingface.co/unsloth/gemma-4-E4B-it-GGUF

Usage

# Conform every downloaded zenodo record
envision-conform --agent-model ~/models/gemma-4-e4b-it-q4.gguf --source zenodo

# One specific record
envision-conform --agent-model ~/models/gemma-4-e4b-it-q4.gguf \
    --source zenodo --source-id 4521044

# Every source directory under ./data/downloads/
envision-conform --agent-model ~/models/gemma-4-e4b-it-q4.gguf --all-sources

# Re-run only records that failed
envision-conform --agent-model ~/models/gemma-4-e4b-it-q4.gguf \
    --source zenodo --rerun-status failed

# Custom input/output paths (if not running next to an envision-discovery checkout)
envision-conform --agent-model ~/models/gemma-4-e4b-it-q4.gguf \
    --source zenodo \
    --downloads-dir /mnt/bigdisk/envision-downloads \
    --output-dir    /mnt/bigdisk/envision-actionable

Or use the module entry point:

python -m envision_eye_actionable --agent-model ~/models/gemma-4-e4b-it-q4.gguf \
    --source zenodo

How it works

downloaded record
      |
      v
   unpack    extract every .zip/.tar/.7z/.rar/multi-part zip (recursive)
      |
      v
  inventory  walk tree; tag every file by format/modality/readability;
             collect README / PDF / DOCX / label-hint files
      |
      v
   agent     Gemma 4 E4B analyzes inventory + README + metadata
             and proposes a Recipe(task_type, placements[])
      |
      v
 materialize hardlink (default) or copy source files into target tree
      |
      v
  validate   check dataset_structure_description.json, sample-open one
             file per modality with the right reader

Full details in docs/conforming.md.

On-disk layout (output)

data/actionable/
+-- zenodo/
    +-- _conform_log.json                           # run summary (status per record)
    +-- 4521044/
    |   +-- dataset_structure_description.json      # AI-READI schema descriptor
    |   +-- conform_report.json                     # placements + inventory + stats
    |   +-- _unplaced/                              # orphan files the recipe didn't place
    |   +-- retinal_photography/
    |       +-- 1_healthy_young_raw_good_quality/
    |       |   +-- <files hardlinked from data/downloads/>
    |       +-- 2_healthy_young_segmented/
    |       +-- ...
    +-- 16744782/
        +-- dataset_structure_description.json
        +-- conform_report.json
        +-- retinal_photography/
            +-- ...

Hardlinks mean the conformed tree does not double disk usage — the downloaded file and the placed file share an inode.

Relationship to envision-discovery

By default, this tool expects envision-discovery's layout:

Directory	What we read from it
`./data/downloads/{source}/{id}/`	Downloaded files + per-record `manifest.json`
`./results/{source}_eye_imaging.json`	Classification metadata (optional; enriches agent context)
`./data/actionable/{source}/{id}/`	Output — conformed trees

If you're not running this alongside an envision-discovery checkout, use --downloads-dir, --results-dir, and --output-dir to point elsewhere.

What the conformer does NOT do (yet)

Decode vendor OCT. We only verify oct-converter imports; extraction to DICOM / numpy is a followup once the canonical on-disk form is decided.
Extract labels from PDFs / DOCX automatically. Readers are installed, but label extraction isn't wired into the agent prompt yet.
Handle splits (train/val/test). Most datasets provide splits explicitly as subdirectories; the conformer passes them through but doesn't tag them.
Cross-source deduplication — that lives upstream in envision-discovery.

Results (Zenodo, first run)

Applied to 83 downloaded Zenodo eye imaging datasets (644 GB, 421 files):

Status	Count	%
ok	82	99%
failed	1	1%

The Gemma 4 E4B agent (running locally on CPU via llama.cpp, Q4_K_M quantization) produced validated directory trees for 82 of 83 records.

envision-discovery — dataset scraping + classification + download pipeline
envision-classifier — the SetFit eye-imaging classifier
Model weights on HuggingFace
Envision Portal — searchable catalog of discovered eye imaging datasets
AI-READI dataset description schema
EyeACT Study — Eye Aging, Cognition, and Imaging study
FAIR Data Innovations Hub

Citation

If you use this tool in your research, please cite the EyeACT project:

FAIR Data Innovations Hub, California Medical Innovations Institute (CalMI2). Envision: Eye imaging dataset discovery and curation pipeline. EyeACT Study, eyeactstudy.org.

License

MIT — see LICENSE. Individual dataset licenses vary; check each dataset before use.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
docs		docs
envision_eye_actionable		envision_eye_actionable
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

envision-eye-actionable

Where this sits in the pipeline

Installation

Usage

How it works

On-disk layout (output)

Relationship to envision-discovery

What the conformer does NOT do (yet)

Results (Zenodo, first run)

Related

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

envision-eye-actionable

Where this sits in the pipeline

Installation

Usage

How it works

On-disk layout (output)

Relationship to envision-discovery

What the conformer does NOT do (yet)

Results (Zenodo, first run)

Related

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages