Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
248 changes: 248 additions & 0 deletions documents/presentations/2026-eu-interchange-define-xml-automation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,248 @@
# Automating Define-XML Generation in the CDISC 360i Program
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can use this opportunity to strike a more strategic tone.

1. Frame the Shift as a Strategic Transformation, Not Just a Process Fix
Instead of solely focusing on "we are replacing spreadsheets to reduce manual copy-paste errors," frame it as moving from a static, document-based past to a dynamic, machine-consumable future. This helps conference attendees understand that we are not just building a faster spreadsheet. We are creating a fundamentally new metadata backbone that unlocks the true value of their standards investments.

2. Create a Clear "Current vs. Future" Contrast
We have inefficiencies and future states in separate sections. Bringing them together into a direct comparison makes the future direction immediately obvious and easy to digest for non-technical attendees.

The Current Standard (Spreadsheets) The Future Direction (Metadata-Driven)
Static & Study-Specific: Templates must be manually adapted for every new trial. Dynamic & Reusable: A structured model that scales systematically across projects.
Laborious: Requires manual interpretation, copy-pasting, and extensive QC. Machine-Consumable: Enables automated validation and generation directly from the source.
Siloed Intermediary: Breaks the chain between upstream concepts and downstream artifacts. Connected Backbone: Directly links the study design and biomedical concepts to the final Define-XML.

3. Strengthen the "Why This Matters"
The point "Realizing the benefits of your investment in standards via metadata-driven automation" is currently buried as the fifth bullet point in its section. This should be the headline. The core message should be: Spreadsheets trap your standards in unreadable formats; the Data Definition Engine activates them.

Copy link
Copy Markdown
Collaborator Author

@swhume swhume Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that we want to highlight a new way of working driven by study design. Using existing metadata sources, like spreadsheets, just creates a better bridge to that future until USDM is business as usual.


---

## Scope of the 360i Data Definition Engine Project
1. Define-XML generation
2. ODM-based CRF generation
3. Dataset Shell generation
4. Trial Design datasets generation
5. Generating SDTM datasets from Lab DTAs
6. Test a new draft model: Data Definition Specification (DDS)

This presentation focuses on SDTM Define-XML generation.

Speaker Notes:
While this presentation focuses on SDTM Define-XML generation, most of the principles covered here apply to our other
deliverables. Instead of giving an overview of the 360i program or even an overview of our team's work, I will go into
a bit more details on the SDTM Define-XML generation deliverable since we're further along on this one and it happens
to be the feature that I've worked the most on.

---

## What Are We Trying to Accomplish?
1. Create a solution that maximizes automation and minimizes manually created metadata to generate Define-XML
2. Support generating Define-XML from the study design; this uses USDM -> Biomedical Concepts -> SDTM Dataset Specializations
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add CRF Specializations. We already shared some details about it in the BC webinar in March.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could mention that when we cover the Define-XML generation using USDM + BCs + DSSs. We can say we plan to follow a similar approach to generating the CRFs using CRF Specializations.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the focus of the presentation is on SDTM Define-XML generation, I added the CRF Specializations to the speaker's notes as something to mention.

3. Support using multiple sources of metadata to generate the Define-XML, such as existing metadata spreadsheets or MDRs
4. Create a new Data Definition Specification model to support metadata-driven automation
5. Identify and resolve metadata/standards gaps impeding automation

Speaker Notes:
This slide focuses on SDTM Define-XML generation, as noted previously. Our main focus is on creating a new way of
automating Define-XML generation using USDM + BCs + DSSs. We'll use this same approach for ODM CRF generation,
but will use CRF Specializations instead of SDTM DSSs.

This is an innovative project. We're working towards making a leap forward, and to do this we've had to learn several
new standards and models. So, this project has, at times, been an exploratory one where we have had to learn new
standards and models without any formal training. We're doing new things, and this calls for a pioneering spirit and
a willingness to deal with content and processes that are rough and incomplete.

Even supporting today's state-of-the-art metadata sources, such as existing metadata spreadsheets and MDRs requires
innovation because we are loading this content into the new DDS model.

Identifying gaps in the standards and pioneering new ways of working are what this project is all about.

---

## Why This Matters
1. Bridging future methods of generating Define-XML with current ways of working
2. Open-source where a team of experts performs ongoing development and maintenance
3. Higher quality Define-XMLs generated more efficiently
4. Realizing the benefits of your investment in standards via metadata-driven automation

Speaker Notes:
We are demonstrating new ways of generating study artifacts, like Define-XML, starting from the study design metadata.
We seek to use this method to maximize automation. This makes it easier to generate a Define-XML at the very beginning
of a study so that it can be used as a specification. It also makes it easier to re-generate the Define-XML
specification when the study design is ammended.

---

## Inefficiencies in Today's Process
1. Manual and Inefficient Workflow
- Current Define-XML workflows rely heavily on spreadsheets, local conventions, and manual editing, causing inefficiencies.
Copy link
Copy Markdown
Collaborator

@dostiep dostiep Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my point about spreadsheets, we even mention here that using those is somehow a liability.

I think we also need to mention that spreadsheets are not "single source of truth". Or at least we need somewhere in this presentation to make it clear the intention is also to avoid duplicate information in different sources which could lead to inconsistencies.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I view spreadsheets mostly as a way to help the many folks who use them today get started with the DDE application. As they begin implementing USDM, they can support those new studies, with the goal of moving the industry off spreadsheets in the long run. I know many sponsors who use metadata spreadsheets (or a mix of an MDR and spreadsheets), and some have expressed interest in open-source tools for generating Define-XML.

2. Error-Prone Processes
Copy link
Copy Markdown
Collaborator

@chowsanthony chowsanthony Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add "Every 1.32 studies go through protocol amendments. Manual process is not only error-prone, but unsustainable and expensive."

Ref: Getz K, Smith Z, Botto E, Murphy E, Dauchy A. New Benchmarks on Protocol Amendment Practices, Trends and their Impact on Clinical Trial Performance. Ther Innov Regul Sci. 2024 May;58(3):539-548. doi: 10.1007/s43441-024-00622-9. Epub 2024 Mar 4. PMID: 38438658.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automation driven by the Study Design metadata available in USDM is a key point here.

- Manual copy-paste and ad hoc edits lead to mismatches and inconsistencies in metadata and datasets, increasing QC churn.
3. Limited Maintainability and Reuse
- Spreadsheet templates are study-specific and not machine-interpretable, limiting standardization and systematic artifact generation.
4. Automation Opportunity
- Generating Define-XML from a consistent metadata backbone reduces errors, streamlines updates, and scales across projects.

Speaker Notes:
I think most of us would agree that the current process is inefficient and error-prone. It's too manual. Define-XML
generation often occurs at the end of the study and is not available as a specification. With the availability of new
standards and models, we believe we can increase the automation and quality of the Define-XML generation process.
Every 1.32 studies go through protocol amendments. Our existing manual processes are not only error-prone, but
unsustainable and expensive. Driving Define-XML generation automation from the study design should provide major quality
and efficiency benefits.

---

## The Project: The Data Definition Engine (DDE)
![dde_architecture_slide.png](dde_architecture_slide.png)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the image, I think the DC/ADaM Loader box and the Additional box are mixed up.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll generate an updated version.


Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This slide is probably the most important one for the audience to grasp what this project is all about. I will probably need to stay on this slide for a couple of minutes. I will need "presenter's notes" to be added so I talk while the crowd is looking at the slide. I'll think of smoething but feel free to add contents.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Solution Overview document provides a description. I haven't pushed the latest as I need to review it. I will get that pushed soon. Then we can pull some speaker's notes from there.

1. Metadata Sources
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentionned in a previous comment, shouldn't we replace the spreadsheet with sponsor's MDR?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about I update the slide to have MDR/Metadata Spreadsheet? I like the MDR interface, but I'm not sure how feasible it is for us to implement. If you know of an MDR API or something similar we can test, please add it to the backlog. We may need MDR vendor support to do that. Spreadsheets are widely used and something we can develop. I showed this diagram to a sponsor, and they were pretty happy with what we are doing. They use spreadsheets and are also implementing an MDR (maybe their 2nd or 3rd try with different MDR vendors).

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel the sponsor's pain, our first attempt to a MDR failed miserably! I now have my standards in YAML and I'm using a python GUI for extraction. At least the YAML format is vendor neutral and can later be converted to be integrated in any MDR system.

2. Loaders
3. The Data Definition Specification (DDS) model
4. Generators
5. Study Artifacts
6. A Refinement Pipeline

Speaker Notes:
This slide represents the key ideas I want to highlight in this presentation. It shows the Data Definition Engine
(DDE) architecture and highlights the components needed to achieve our goals.

TODO: Expand on each part of the architecture to understand the role each component plays in the process.

---

## DDE: Generating Define-XML from the Study Design
![dde_define_xml_slide.png](dde_define_xml_slide.png)

1. Metadata Sources
- USDM + BCs + DSSs
- CDISC Library
2. Loaders
- USDM + BCs + DSSs
3. DDS
- JSON model
4. Generators
- Define-XML
5. A Refinement Pipeline
- Define-XML with PLACEHOLDERS
- Study level refinement of Define-XML

Speaker Notes:
This slides focuses on one of our main 360i deliverables: generating Define-XML from a USDM-based study design. The
Define-XML generated can be used as an initial specification for the study. Since it's generated from the study design,
it can be be generated in the very beginning of the study. It can be updated as the study design is amended. These are
some process benefits that come along with the improvements in automation.

This approach retrieves the Biomedical Concepts from the USDM schedule of activities. It then uses the CDISC Library API
to look up the DSSs for each BC. Using this information, we are able to populate much of the Define-XML metadata,
including bits that are considered more challenging, like Value Level Metadata. However, there are quite a few gaps
in the metadata needed to full generate a conformant Define-XML. These tend to be basic, study-level content like
KeySeuqence (which variables are keys), Length, whether a variable is mandatory or not, and so on. During the initial
Define-XML generation, we use placeholders for these metadata items. Then, in the refinement pipeline those placeholders
are with study-level metadata. The final step in that pipeline is performed at the end of the study to make any changes
or additions needed to make this submission-ready.

---

## Plug In Architecture
1. Add new loaders to address different sources of metadata, such as MDRs or other propreitary sources
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"propreitary" should be "proprietary"

2. Add new generators to create new study artifacts or variations of the supported artifacts
3. New approaches to the refinement pipeline
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need more details here as I don't get the idea.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without getting into the technical details (I will write something up on this topic), the idea is that we provide a way for others to write their own loader code that can be used by this application. MDRs would be a great example. We, the project, may not be able to build an MDR loader (without vendor help), but we could set this up so someone could write one that could be added. Basically, the architecture should allow others to extend it with their own loaders that pull metadata from whatever sources they have to populate the DDS, and then can take advantage of the generators to create the outputs.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, I can even give it a shot with my YAML standard files but they are for ADaM standards, not SDTM.


Speaker Notes:
The plug-in architecture allows implementers to add new loaders and generators to support new metadata sources and new
study artifacts. For example, if your organization uses an MDR with an API, a loader could be created to
extract metadata to load into the DDS. The generators work the same, regardless of how the content is loaded int the
DDS. Similarly, implementers can add new generators or extend the existing ones to create new study artifacts or to
add user-specific variations to existing ones. So, this adds flexibility to extend the DDE solution to support new
metadata sources and new study artifacts.

---

## Timeframes for the Solution Targets
1. Current State: current ways of generating Define-XML: e.g., metadata spreadsheets
2. New State: generating define.xml using USDM + BCs + DSSs
3. Future State: future standards like a new, JSON-based version of define, DTAs, etc.

Speaker Notes:
As noted previously, our main target is the New State: using USDM + BCs + DSSs to generate Define-XML and other study
artifacts. Given that most all sponsors have not yet adopted the new standards like USDM, and they have other sources of
study metadata that they currently use to generate their Define-XMLs. We would like to support as many common sources of
existing metadata as possible.

For example, we would like to support commonly used metadata spreadsheets. This allows organizations to start using the
DDE right away, and they can move to the new standards when they are ready. As they are implementing major new changes,
such as USDM, they will also likely continue to have the need to support their current processes until they've
completed the transition. It's also true that there are a lot of existing metadata spreadsheets, and loading this
content helps us test the DDS model as well as the loaders.

---

## Metadata Gaps Identified
1. Numerous gaps in the metadata needed to automate Define-XML generation were identified
2. Examples include keySequence, Length, ...
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I be more specific during the presentation, for example mentionning that those are not available in the CDISC Library and thus need user's input? Or should I just mention the few examples and move on?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could move our list of gaps identified in Phase 1 into this repo and let the audience know where to find more details. Then, during the presentation, you can highlight some examples that get the point across. You can state that not all the metadata needed to generate a Define-XML is available via the USDM + BC + DSS content.

3. To address the missing metadata, we used placeholders in the first Define-XML generated
4. Gaps may drive updates to standards
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or, in this case, necessitates a new standard.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chowsanthony Do you have something in mind in terms of a new standard?


Speaker Notes:
Before beginning the project we understood that implementing end-to-end automation using new or existing standards
would identify gaps or misalignments in the available standards metadata. So, in this context, gaps and misalignments
aren't bugs, they're features. We expected to find gaps, and we've found and documented many of them.

TODO: add a better list of metadata gap examples.

---

## Data Definition Specification (DDS)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Globally, this is the slide I'm going to struggle with as I'm not th author of the model. I know the benifits it adds for define.xml creation but I'm having a hard time figuring out everything else it might support. A few notes would be appreciated, mostly for me to fully understand the message we're trying to pass.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add some notes on this. I don't think we need to get into everything it might do as I think that's not 100% determined yet. That said, we can highlight how it will help with CRF generation, LabV2 to SDTM transformations, etc.

1. Role of Specification Model
- The model links upstream conceptual models to downstream artifacts, replacing unreliable spreadsheet intermediaries.
- The DDS is a new draft model that we will publish as a standard after we complete our 360i work.
2. Key Characteristics
- Defines consistent metadata structure to support standards-driven generation and maintainability through metadata updates.
- Targets automation in a way that Define-XML was not designed to support.
3. Alignment and Automation Benefits
- Structured metadata enables automated validation, controlled terminology checks, and reliable value-level metadata building.
- Provides the metadata to support many different automation tasks, beyond generating define.xml and ODM-based CRFs.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Such as? An example or two would help I think, this could be added to the presenter's notes, not the slides.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add a note or two on this.

4. Extensibility and Feedback Loop
- Designed for extensibility and interoperability, the model reveals standards gaps, fostering continuous improvement.

Speaker Notes:
The Data Definition Specification (DDS) is a new draft model that we will publish as a standard after we complete our
360i work. We are using the DDS in our work as a new model is needed to address what we need to support end-to-end
automation. Define-XML was never intended to drive end-to-end automation, though it has been used to do so at times.

For example, the DDS allows us to define both the data supply and demand, sometimes referred to as the source and
target datasets. It allows us to define derivation methods in a more complete manner to support automation
and not just provide documentation of the code used. It also does a better job of representing semantics and
relationships. DDS and Define-XML were defined with different primary goals in mind.

In this presentation, I do not have time to get into the details of the DDS, but we are using it in our project and will
be working towards publishing it as a standard, so there will be opportunities to learn more about it or even to get
involved in its development.

---

## Future Work
1. Refine and enhance the current work-in-progress to create a usable solution
2. Add support for ADaM Define-XML
3. DTA to SDTM transformations
4. Plug In Architecture
5. Initial release targeted by EOY
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In what form will a release look like?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe re-use the graphical image and highlight the boxes that represent what we want to deliver by EOY?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope we have an initial release candidate ready that covers, at the very least, the Define-XML generation. I believe we will also have the CRF generation ready as well, but less visibility on that one at the moment. We can generate an image that conveys what we expect to be ready for EOY.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a brief description in the speaker notes.


Speaker Notes:
We are working towards refining and expanding the current work-in-progress to create a usable solution. To that end, we
plan to publish a Release Candidate by the end of the year. This will be available as a release in GitHub. We aim for
this to be a usable solution that's available to anyone to use without any licensing fees or restrictions. Similarly,
this project is open to contributors, and we hope others will make contributions to the project so that it becomes a
more robust solution for everyone, and there aren't just a few of us supporting it.

---

## Key Takeaways
1. Proof of Automation
- Define-XML can be generated consistently from structured, standards-based metadata organized in a machine-consumable model.
2. Phased Progress Achieved
- Phase 1 delivered automated SDTM Define-XML generation using new metadata specification models and biomedical concepts.
3. Forward Extension Plans
- Phase 2 will automate ADaM Define-XML incorporating analysis concepts to represent analytical intent as structured metadata.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just cross-checking: is "DTA to SDTM transformations" not to be mentioned here? Maybe I'm just confused with the 360i Phase 2 goals :)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DTA-to-SDTM Transformations are listed as a sub-project at the beginning of the presentation, alongside the other feature team deliverables. We then say the presentation will largely focus on Define-XML generation, since that's primarily what we've worked on so far, and we want to provide specifics on our work in this presentation. Since we don't go into any details on it, it's not a major takeaway, and this holds for most of the other feature teams as well.

4. Process Improvement Benefits
- Moving from spreadsheets to metadata-driven automation enhances reuse, maintainability, quality, and reduces errors.
5. Standards Feedback Loop
- Automation efforts reveal gaps in CDISC standards, guiding standards evolution and better implementation.
6. Open-Source Adoption
- Open-source solutions promote interoperability, accelerate learning, and reduce duplicated automation efforts across organizations.

Speaker Notes:

---

## Questions?
- Thank you!
- Where to Find Our Work: https://github.com/cdisc-org/data-definition-engine
- Join Us: https://www.cdisc.org/volunteer/form
- Contact
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added documents/presentations/dde_define_xml_slide.png
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just not to forget: suggest removing "JSON" from "JSON" Model i the larger orange box in both diagrams.

Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading