-
-
Notifications
You must be signed in to change notification settings - Fork 15
Data Extraction
This page describes the automated data extraction pipeline for processing PowerPoint anatomy presentations and converting them into structured JSON data for the Digital Bone Box web application.
The extraction process involves multiple Python scripts that work together to extract different types of data from PowerPoint XML files:
- Images (bone illustrations in various views)
- Text annotations (bone names, labels, descriptions)
- Colored regions (anatomical region overlays with precise path coordinates)
- Metadata (bone relationships, hierarchies, and hyperlinks)
- Purpose: Extract bone images from PowerPoint slides
-
Script:
extract_bone_images.py - Output: Image files taken from the slides.
- Purpose: Extract boneset/bone/subbone relationships
-
Script:
xml_boneset_reader.py -
Output: JSON files defining bone hierarchies (e.g.,
bonesets.json,bones.json,subbones.json)
- Purpose: Extract descriptive text and bullet points from slides
-
Script:
Extract_Bone_Descriptions.py - Output: JSON file with bone names and anatomical descriptions
- Purpose: Extract precise path data for anatomical region overlays
-
Script:
ColoredRegionsExtractor.py - Output: JSON files with path coordinates for region overlays
- Purpose: Extract annotation labels with their pointer lines and target regions
-
Script:
extract_text_labels.py - Output: JSON with text positions, pointer line paths, and connection endpoints
- Purpose: Extract image rotation, flip, and side-by-side layout information
-
Script:
bony_pelvis_rotation.py - Output: Template JSON for consistent image positioning across slides
- Purpose: Apply offset corrections to align extracted regions with images
-
Script:
calibrate_colored_regions.py - Output: Calibrated region JSON files with adjusted coordinates
-
Rename the PowerPoint file extension from
.pptxto.zip. This will convert it into a zipped folder that you can then extract to a folder. -
In each file necessary, rewrite the directory structures included to be the folder that the PowerPoint data was extracted to.
-
Extract all data by running the scripts in this order:
a.
extract_bone_images.py: Extract raw images from slidesb.
xml_boneset_reader.py: Generate bone hierarchy lookup filesc.
Extract_Bone_Descriptions.py: Extract descriptive text for each boned.
ColoredRegionsExtractor.py: Extract colored regionse.
extract_text_labels.py: Extract annotation labels and pointer linesf.
bony_pelvis_rotation.py: Extract data for adjusting rotation of images to match how they are rotated in the PowerPoints. This file is probably not necessary to use. It appears to be built very inflexibly and specifically for certain images in the pelvis boneset. Guidance on how to generate data about how images should be rotated will be coming soon, but for now you can skip this step. From what I can see in the PowerPoints, it seems like few images need to be rotated anyway, so we can get away with not having this data for now.g.
calibrate_colored_regions.py: Apply offset corrections to region coordinatesEach of the extraction files takes several arguments to customize the input and output files. You can first run the files with
--helpto see the arguments taken. -
Clean up data by manually editing while referencing the PowerPoint to see what information should be associated with each boneset, bone, or subbone according to the slides. Some of the extraction scripts may give incomplete data:
a.
extract_bone_images.pyDoes not assign final image names. Images should be given a name of the format
<bone_id>_image.jpgif it is the only image of that boneset/bone/subbone, or<bone_id>_image<number>.jpgif it is one of several images of that ID, wherebone_idis the ID of the boneset, bone, or subbone being described. See theimages/directory in the database for an example of how names should be assigned.Duplicate images can be removed entirely. If more than one bone or subbone needs to reference an image, that can be done in the description JSON; see the instructions for cleaning the output of
Extract_Bone_Descriptions.py.b.
xml_boneset_reader.pyMay not extract all the hierarchical data. Try running on a different slide number to see if more data is extracted (usually no data will be extracted from the title slide). For a reference of how the JSON files should be structured, see the
boneset/,bones/, andsubbones/directories in the database. Each boneset/bone/subbone should have a name, a unique ID (which should be the name in lowercase with spaces replaced by underscores—e.g. a bone with the name "Bone Name" has ID "bone_name"), and a list of its child bone IDs (if applicable). They should be split into separate JSON files with the title<bone_id>.jsonin their respective directories, wherebone_idis the ID of the boneset, bone, or subbone being described.c.
Extract_Bone_Descriptions.pyThe file may make some general errors in parsing out the descriptions. You should keep the PowerPoint open and verify that each bone object has the appropriate boneset/bone/subbone name, ID (which should be the name in lowercase with spaces replaced by underscores—e.g. a bone with the name "Bone Name" has ID "bone_name"), and description, which should be a list of strings of each line of text in the description. That list should strictly only include text in the description of the boneset, bone, or subbone, not any extra information.
The file also does not assign final file names. Files should be given a name of the format
<bone_id>_description.jsonwherebone_idis the ID of the boneset, bone, or subbone being described.Inside the description should then be optionally added an "images" field that is a list of the filenames of the images associated with that bone as extracted by the image extraction script. The "images" field is necessary if a boneset, bone, or subbone has one or more images associated with it.
See the
descriptions/directory in the database for an example of how description JSON files should look.d.
ColoredRegionsExtractor.pyDoes not assign final image names. Files should be given a name of the format
<bone_id>_colored_regions.jsonwherebone_idis the ID of the boneset, bone, or subbone being described. See theannotations/ColoredRegions/directory in the database for an example of how names should be assigned.e.
extract_text_labels.pyDoes not assign final file names. Files should be given a name of the format
<bone_id>_text_annotations.jsonwherebone_idis the ID of the boneset, bone, or subbone being described. See theannotations/text_label_annotations/directory in the database for an example of how names should be assigned.