define json technical guide

Define-JSON: Technical Deep-Dive for Data Engineers

This guide is aimed at data engineers and Python programmers working with clinical trial data pipelines. It assumes familiarity with JSON schemas, data modelling concepts, and ideally some exposure to CDISC standards (though the latter is not required).

Architecture Overview
Schema Foundation: LinkML
The Root Object: MetaDataVersion
ItemGroup and Item: The Data Layer
Slicing: Parameter-Specific Definitions
Conditions, WhereClauses, and RangeChecks
Methods, FormalExpressions, and Analysis
CodeLists and Controlled Vocabulary
Semantic Layer: Coding and ReifiedConcept
Data Products and Dataflows
Versioning and Provenance
XML ↔ JSON Conversion
Reverse Engineering Metadata from Data
Working with the Model in Python
OID Reference Patterns

1. Architecture Overview

Define-JSON is a flat, reference-based JSON structure. Unlike deeply nested formats (e.g. XML hierarchies), most collections live at the top level of MetaDataVersion and reference each other by OID.

MetaDataVersion
├── itemGroups[]         ← datasets, FHIR profiles, concept templates
│   └── items[]          ← inlined within each group
├── items[]              ← top-level template items (no group)
├── codeLists[]          ← permissible value sets
├── conditions[]         ← reusable logical expressions
├── whereClauses[]       ← named data contexts (link conditions to structures)
├── methods[]            ← derivation procedures
├── analyses[]           ← analysis-specific method extensions
├── concepts[]           ← abstract biomedical concepts (ReifiedConcept)
├── codings[]            ← shared semantic tags
├── relationships[]      ← inter-element semantic links
├── standards[]          ← CDISC/external standard references
├── dictionaries[]       ← external code systems (MedDRA, SNOMED, etc.)
├── dataProducts[]       ← governed data product definitions
└── displays[]           ← rendered analysis outputs

Key design decision: Items within an ItemGroup are inlined (embedded objects), but cross-references between groups use OID strings. This means you never dereference a separate lookup for items inside a group, but you do for codeList, method, applicableWhen, and similar cross-cutting references.

2. Schema Foundation: LinkML

The model is defined in LinkML, a YAML-based schema language that generates JSON Schema, OWL, Python dataclasses, and more. The source schema is at https://cdisc.org/define-json.

Key LinkML concepts used throughout:

is_a: single inheritance (e.g. ItemGroup is a GovernedElement)
mixins: multiple inheritance for cross-cutting concerns (e.g. IsProfile, IsODMStandard)
multivalued: true: the slot holds a list
inlined: true: the object is embedded, not referenced by ID
any_of: polymorphic ranges (e.g. owner can be a string, User, or Organization)
identifier: true: marks the primary key slot (OID on all Identifiable classes)

You can generate Python dataclasses from the schema:

pip install linkml-runtime
gen-python https://cdisc.org/define-json > define_json_classes.py

3. The Root Object: MetaDataVersion

MetaDataVersion is the tree root (LinkML tree_root: true) — the top-level object in any serialised Define-JSON document.

Required fields

Field	Type	Description
`OID`	string	Local identifier. Use `MDV.STUDY-001.v1` format for submissions
`fileOID`	string	Identifier for the ODM file
`creationDateTime`	datetime	ISO 8601 timestamp
`odmVersion`	string	e.g. `"2.0"`
`fileType`	string	`"Snapshot"` or `"Transactional"`
`studyOID`	string	Identifier for the parent study

Minimal valid document

{
  "OID": "MDV.LZZT.v1",
  "fileOID": "ODM.LZZT",
  "creationDateTime": "2024-01-15T10:00:00",
  "odmVersion": "2.0",
  "fileType": "Snapshot",
  "studyOID": "STUDY.LZZT",
  "studyName": "LZZT Phase II",
  "itemGroups": [],
  "codeLists": [],
  "methods": [],
  "conditions": [],
  "whereClauses": []
}

Inheritance chain

MetaDataVersion
  is_a: GovernedElement
    mixins: [Identifiable, Labelled, Governed]
  mixins: [ODMFileMetadata, StudyMetadata]

This means MetaDataVersion picks up slots from all five classes: OID/uuid from Identifiable, name/label/description/coding/aliases from Labelled, mandatory/owner/purpose/lastUpdated/wasDerivedFrom/comments from Governed, ODM file fields from ODMFileMetadata, and study identification from StudyMetadata.

4. ItemGroup and Item: The Data Layer

ItemGroup maps to a dataset, FHIR resource profile, OMOP table, or form section depending on context. Item is a single variable/column within it.

ItemGroup key slots

Slot	Type	Notes
`OID`	string (required)	Primary key, e.g. `"IG.VS"`
`name`	string	Short name, matches dataset name in data files
`domain`	string	CDISC domain abbreviation, e.g. `"VS"`, `"LB"`
`type`	`ItemGroupType` enum	`Dataset`, `DatasetSpecialization`, `FHIR`, `Form`, etc.
`structure`	string	e.g. `"One record per visit per vital sign test per subject"`
`items`	`Item[]`	Inlined — full item objects, not references
`keySequence`	`Item[]`	OID references to items that form the sort/uniqueness key
`slices`	`ItemGroup[]`	Inlined sub-groups (parameter-specific specialisations)
`implementsConcept`	`ReifiedConcept` ref	Links to abstract biomedical concept
`applicableWhen`	`WhereClause[]` refs	When this group is in scope (OR logic across clauses)
`standard`	`Standard` ref	CDISC IG being implemented
`wasDerivedFrom`	ref	Template this group was derived from

Item key slots

Slot	Type	Notes
`OID`	string (required)	e.g. `"IT.VS.VSTESTCD"`
`name`	string	Variable name, e.g. `"VSTESTCD"`
`dataType`	`DataType` enum	`text`, `integer`, `float`, `date`, `datetime`, `time`, `boolean`
`length`	integer	Max character length
`codeList`	`CodeList` ref	Permissible value constraint
`method`	`Method` ref	Derivation procedure
`origin`	`Origin`	Source type and provenance
`rangeChecks`	`RangeCheck[]`	Edit checks / CORE rules
`conceptProperty`	`ConceptProperty` ref	Abstract property this item specialises
`applicableWhen`	`WhereClause[]` refs	Conditional applicability
`wasDerivedFrom`	ref	Template item this was derived from

Example: VS domain

{
  "OID": "IG.VS",
  "name": "VS",
  "label": "Vital Signs",
  "domain": "VS",
  "type": "Dataset",
  "structure": "One record per vital sign per visit per subject",
  "standard": "SDTMIG.v3.4",
  "keySequence": ["IT.VS.STUDYID", "IT.VS.USUBJID", "IT.VS.VSTESTCD", "IT.VS.VISITNUM"],
  "items": [
    {
      "OID": "IT.VS.STUDYID",
      "name": "STUDYID",
      "label": "Study Identifier",
      "dataType": "text",
      "length": 12,
      "origin": { "type": "Assigned" }
    },
    {
      "OID": "IT.VS.VSTESTCD",
      "name": "VSTESTCD",
      "label": "Vital Signs Test Short Name",
      "dataType": "text",
      "length": 8,
      "codeList": "CL.VSTESTCD",
      "origin": { "type": "Assigned" }
    },
    {
      "OID": "IT.VS.VSORRES",
      "name": "VSORRES",
      "label": "Result or Finding in Original Units",
      "dataType": "text",
      "length": 200,
      "origin": { "type": "Collected" },
      "method": null
    },
    {
      "OID": "IT.VS.VSORRESU",
      "name": "VSORRESU",
      "label": "Original Units",
      "dataType": "text",
      "length": 40,
      "codeList": "CL.UNIT",
      "origin": { "type": "Collected" }
    }
  ],
  "slices": [
    { "OID": "VL.VS.DIABP", "name": "VL.VS.DIABP", "type": "DatasetSpecialization" },
    { "OID": "VL.VS.SYSBP", "name": "VL.VS.SYSBP", "type": "DatasetSpecialization" }
  ]
}

5. Slicing: Parameter-Specific Definitions

Slices let you attach parameter-specific metadata without duplicating the parent group definition. A slice is itself an ItemGroup with type: "DatasetSpecialization" and an applicableWhen that scopes it.

This is the key structural improvement over Define-XML v2.1's ValueList approach: rather than grouping by variable (VSORRES, VSORRESU), slices group by clinical parameter (DIABP, SYSBP), so each slice carries both the result and the unit for that parameter.

{
  "OID": "VL.VS.DIABP",
  "name": "VL.VS.DIABP",
  "label": "Diastolic Blood Pressure",
  "type": "DatasetSpecialization",
  "domain": "VS",
  "applicableWhen": ["WC.VS.DIABP"],
  "items": [
    {
      "OID": "IT.VS.DIABP.VSORRES",
      "name": "VSORRES",
      "label": "Diastolic BP Result",
      "dataType": "float",
      "rangeChecks": [
        {
          "item": "IT.VS.DIABP.VSORRES",
          "comparator": "GE",
          "checkValues": ["0"],
          "softHard": "Soft"
        },
        {
          "item": "IT.VS.DIABP.VSORRES",
          "comparator": "LE",
          "checkValues": ["300"],
          "softHard": "Hard"
        }
      ]
    },
    {
      "OID": "IT.VS.DIABP.VSORRESU",
      "name": "VSORRESU",
      "label": "Diastolic BP Units",
      "dataType": "text",
      "codeList": "CL.MMHG_ONLY"
    }
  ]
}

The WC.VS.DIABP where-clause restricts this slice to rows where VSTESTCD = "DIABP":

{
  "OID": "WC.VS.DIABP",
  "name": "WC.VS.DIABP",
  "conditions": ["COND.VS.DIABP"]
}

6. Conditions, WhereClauses, and RangeChecks

The three-level logic hierarchy

WhereClause          ← named context; referenced by items/groups via applicableWhen
  └── Condition[]    ← combined with AND within the clause
        ├── RangeCheck[]    ← simple value comparisons (EQ, NE, IN, GE, LE, etc.)
        ├── Condition[]     ← nested sub-conditions (recursive, for complex logic)
        └── FormalExpression[]  ← executable code for complex cases

Multiple WhereClause references on the same element are combined with OR logic: "applies when ANY of these clauses matches". Within a clause, Condition objects combine with AND (configurable via operator).

Condition example

{
  "OID": "COND.VS.DIABP",
  "name": "COND.VS.DIABP",
  "operator": "AND",
  "rangeChecks": [
    {
      "item": "IT.VS.VSTESTCD",
      "comparator": "EQ",
      "checkValues": ["DIABP"],
      "softHard": "Hard"
    }
  ]
}

Comparator enum values

EQ, NE, LT, LE, GT, GE, IN, NOTIN

SoftHard semantics

Value	Meaning
`Hard`	Error — data is invalid if check fails
`Soft`	Warning — data is unusual but not necessarily wrong

Nested condition example (compound logic)

{
  "OID": "COND.ADVERSE_SERIOUS",
  "name": "COND.ADVERSE_SERIOUS",
  "operator": "AND",
  "conditions": [
    {
      "OID": "COND.AESER_YES",
      "operator": "AND",
      "rangeChecks": [
        { "item": "IT.AE.AESER", "comparator": "EQ", "checkValues": ["Y"] }
      ]
    },
    {
      "OID": "COND.AESEV_OR",
      "operator": "OR",
      "rangeChecks": [
        { "item": "IT.AE.AESEV", "comparator": "EQ", "checkValues": ["SEVERE"] },
        { "item": "IT.AE.AESEV", "comparator": "EQ", "checkValues": ["LIFE THREATENING"] }
      ]
    }
  ]
}

7. Methods, FormalExpressions, and Analysis

Method

A Method is a reusable derivation procedure. Items reference methods via method: "MT.CALC_BMI".

{
  "OID": "MT.CALC_BMI",
  "name": "MT.CALC_BMI",
  "label": "Calculate BMI",
  "type": "Computation",
  "expressions": [
    {
      "OID": "FE.CALC_BMI.SAS",
      "context": "SAS",
      "expression": "VSSTRESN = (WEIGHT_KG / (HEIGHT_M ** 2))",
      "returnType": "float",
      "parameters": [
        {
          "OID": "PARAM.WEIGHT_KG",
          "name": "WEIGHT_KG",
          "dataType": "float",
          "required": true
        },
        {
          "OID": "PARAM.HEIGHT_M",
          "name": "HEIGHT_M",
          "dataType": "float",
          "required": true
        }
      ]
    },
    {
      "OID": "FE.CALC_BMI.PYTHON",
      "context": "Python",
      "expression": "bmi = weight_kg / (height_m ** 2)",
      "returnType": "float"
    }
  ]
}

Analysis

Analysis extends Method with study-specific traceability fields. Use it when you need to document why and from what an analysis was run, not just how.

{
  "OID": "AN.SUMMARY_VS",
  "name": "AN.SUMMARY_VS",
  "label": "Vital Signs Summary Statistics",
  "type": "Computation",
  "analysisReason": "Primary Efficacy",
  "analysisPurpose": "Exploratory",
  "inputData": ["IG.VS", "IG.VS.VL.VS.DIABP"],
  "expressions": [
    {
      "OID": "FE.SUMMARY_VS.R",
      "context": "R",
      "expression": "vs_summary <- vs_data %>% group_by(VSTESTCD, VISIT) %>% summarise(n=n(), mean=mean(VSSTRESN, na.rm=TRUE), sd=sd(VSSTRESN, na.rm=TRUE))"
    }
  ]
}

inputData accepts OIDs of ItemGroup or slice objects — make sure every referenced Item (e.g. analysis variables passed as Parameter) has its parent ItemGroup listed here.

8. CodeLists and Controlled Vocabulary

Inline CodeList

{
  "OID": "CL.VSTESTCD",
  "name": "VSTESTCD",
  "label": "Vital Signs Test Code",
  "dataType": "text",
  "standard": "CDISC/NCI",
  "codeListItems": [
    {
      "codedValue": "DIABP",
      "decode": "Diastolic Blood Pressure",
      "coding": [{ "code": "C25299", "codeSystem": "NCI", "codeSystemVersion": "2023-09-25" }]
    },
    {
      "codedValue": "SYSBP",
      "decode": "Systolic Blood Pressure",
      "coding": [{ "code": "C25298", "codeSystem": "NCI", "codeSystemVersion": "2023-09-25" }]
    },
    {
      "codedValue": "TEMP",
      "decode": "Temperature",
      "weight": 3.0,
      "coding": [{ "code": "C25206", "codeSystem": "NCI", "codeSystemVersion": "2023-09-25" }]
    }
  ]
}

External CodeList reference

When the full enumeration lives in an external system (MedDRA, SNOMED, LOINC), use externalCodeList instead of codeListItems:

{
  "OID": "CL.MEDDRA_PT",
  "name": "MEDDRA_PT",
  "label": "MedDRA Preferred Terms",
  "dataType": "text",
  "externalCodeList": {
    "OID": "RES.MEDDRA",
    "name": "MedDRA",
    "href": "https://www.meddra.org",
    "version": "26.1"
  }
}

DataType enum

text, integer, float, double, decimal, date, time, datetime, dateTime, boolean, base64Binary, hexBinary, anyURI

9. Semantic Layer: Coding and ReifiedConcept

This is what separates Define-JSON from a pure structural schema. Every element can be anchored to ontologies; datasets can declare which abstract biomedical concept they implement.

Coding

Attach standardised semantic tags to any element using the coding slot:

{
  "OID": "IT.VS.VSORRES",
  "name": "VSORRES",
  "coding": [
    {
      "code": "C25712",
      "codeSystem": "NCI",
      "codeSystemVersion": "2023-09-25",
      "decode": "Result",
      "aliasType": "SameAs"
    },
    {
      "code": "8480-6",
      "codeSystem": "LOINC",
      "codeSystemVersion": "2.76",
      "decode": "Systolic blood pressure",
      "aliasType": "NarrowMatch"
    }
  ]
}

aliasType (the AliasPredicate enum) controls the relationship semantics: SameAs, BroadMatch, NarrowMatch, RelatedMatch, Implements, IsA.

ReifiedConcept

ReifiedConcept makes an abstract concept — e.g. "Diastolic Blood Pressure" as defined in the CDISC Biomedical Concept model — explicit and referenceable. ItemGroups and Methods then declare that they implement it.

{
  "OID": "BC.DIABP",
  "name": "DiastolicBloodPressure",
  "label": "Diastolic Blood Pressure",
  "href": "https://library.cdisc.org/api/cosmos/v2/bc/C25299",
  "coding": [
    { "code": "C25299", "codeSystem": "NCI", "decode": "Diastolic Blood Pressure" }
  ],
  "properties": [
    {
      "OID": "BCP.DIABP.RESULT",
      "name": "result",
      "label": "Result Value",
      "minOccurs": 1,
      "maxOccurs": 1,
      "codeList": null
    },
    {
      "OID": "BCP.DIABP.UNIT",
      "name": "unit",
      "label": "Unit of Measure",
      "minOccurs": 1,
      "maxOccurs": 1,
      "codeList": "CL.MMHG_ONLY"
    }
  ]
}

An ItemGroup then declares "implementsConcept": "BC.DIABP" and each Item declares "conceptProperty": "BCP.DIABP.RESULT" — forming a typed, verifiable link from concrete implementation to abstract definition.

10. Data Products and Dataflows

For pipeline and data contract use cases, DataProduct and Dataflow express the supply/demand boundary.

Dataflow: the abstract contract

A Dataflow declares what structure is expected — before any concrete data exists. Think of it as an interface definition.

{
  "OID": "DF.VS_TRANSFER",
  "name": "DF.VS_TRANSFER",
  "label": "Vital Signs Transfer Agreement",
  "structure": "IG.VS",
  "dimensionConstraint": ["IT.VS.USUBJID", "IT.VS.VSTESTCD", "IT.VS.VISITNUM"],
  "version": "1.0"
}

DataProduct: the governed package

{
  "OID": "DP.CLINICAL_DATA_V1",
  "name": "ClinicalDataPackage",
  "label": "Clinical Data Package v1",
  "dataProductOwner": "Data Management Team",
  "lifecycleStatus": "Active",
  "domain": "SDTM",
  "inputDataflow": ["DF.VS_TRANSFER", "DF.LB_TRANSFER"],
  "outputDataset": ["DS.VS_FINAL", "DS.LB_FINAL"],
  "outputPort": [
    {
      "OID": "SVC.FHIR_API",
      "name": "FHIR R4 API",
      "protocol": "HTTPS",
      "resourceType": "HL7-FHIR",
      "href": "https://api.example.com/fhir/r4",
      "securitySchemaType": "OAuth2"
    }
  ]
}

Dataset: a concrete delivery

{
  "OID": "DS.VS_FINAL",
  "name": "vs_final.xpt",
  "structuredBy": "IG.VS",
  "describedBy": "DF.VS_TRANSFER",
  "conformsTo": "SDTM v1.8",
  "dataExtractionDate": "2024-01-10",
  "validFrom": "2023-01-01",
  "validTo": "2023-12-31",
  "distribution": [
    {
      "format": "application/x-xpt",
      "accessService": {
        "OID": "SVC.SFTP",
        "protocol": "SFTP",
        "href": "sftp://transfers.example.com/sdtm/"
      }
    }
  ]
}

11. Versioning and Provenance

The copy-and-link model

Each MetaDataVersion is a complete, immutable snapshot. Derivation is tracked via wasDerivedFrom, which accepts OID strings or typed object references.

{
  "OID": "MDV.STUDY-001.v2",
  "wasDerivedFrom": "MDV.STUDY-001.v1",
  "creationDateTime": "2024-06-01T09:00:00",
  "studyOID": "STUDY.001",
  "itemGroups": [
    {
      "OID": "IG.VS.v2",
      "wasDerivedFrom": "IG.VS.v1",
      "items": [
        {
          "OID": "IT.VS.VSORRES.v2",
          "wasDerivedFrom": "IT.VS.VSORRES.v1",
          "name": "VSORRES",
          "dataType": "float"
        }
      ]
    }
  ]
}

This pattern supports:

Template reuse: a study-specific ItemGroup derives from a CDISC standard template
Study amendment tracking: each protocol amendment creates a new MetaDataVersion linked to the prior one
Cross-study comparison: shared wasDerivedFrom ancestry identifies equivalent variables across studies

Item-level origin tracking

{
  "OID": "IT.VS.VSDY",
  "name": "VSDY",
  "label": "Study Day of Vital Signs",
  "dataType": "integer",
  "origin": {
    "type": "Derived",
    "source": "Sponsor",
    "sourceItems": [
      {
        "item": "IT.VS.VSDTC",
        "resource": ["IG.VS"],
        "document": null
      },
      {
        "item": "IT.DM.RFSTDTC",
        "resource": ["IG.DM"]
      }
    ]
  },
  "method": "MT.CALC_STUDY_DAY"
}

OriginType values: Collected, Derived, Assigned, Protocol, eDT
OriginSource values: Investigator, Sponsor, Subject, Vendor

12. XML ↔ JSON Conversion

Installation

git clone https://github.com/TeMeta/define-json.git
cd define-json
pip install poetry
poetry install

Python API

from src.define_json.converters.xml_to_json import PortableDefineXMLToJSONConverter
from src.define_json.converters.json_to_xml import DefineJSONToXMLConverter
from pathlib import Path

# Define-XML → Define-JSON
xml_converter = PortableDefineXMLToJSONConverter()
json_data = xml_converter.convert_file(
    Path('data/define-360i.xml'),
    Path('data/define-360i.json')
)

# Define-JSON → Define-XML
xml_converter = DefineJSONToXMLConverter()
xml_root = xml_converter.convert_file(
    Path('data/define-360i.json'),
    Path('data/define-360i-recreated.xml')
)

CLI

# XML → JSON
poetry run python -m define_json xml2json data/define.xml data/output.json

# JSON → XML
poetry run python -m define_json json2xml data/input.json data/output.xml

# HTML rendering (no CORS issues for browser viewing)
poetry run python -m define_json json2html input.json output.html

# Roundtrip validation
poetry run python -m define_json roundtrip data/original.xml

# Schema validation
poetry run python -m define_json validate data/input.json

What the converter improves over raw Define-XML

Aspect	Define-XML v2.1	Define-JSON
ValueList grouping	By variable (VSORRES, VSORRESU)	By parameter (DIABP, SYSBP, TEMP) — clinically meaningful
WhereClause deduplication	Separate WC per variable per parameter	Shared WC per parameter — 27 → 14 in the 360i sample
JSON size	N/A	~33% smaller than source XML (98KB → 66KB)
Reference model	Nested XML with repeated attribute XML	Flat JSON with OID references

13. Reverse Engineering Metadata from Data

Generate a Define-JSON skeleton from existing Dataset-JSON files:

python scripts/reverse_engineer_define.py examples/sample_dataset_lb.json

This produces four output files:

File	Contents
`define_metadata.json`	Inferred Define-JSON structure (ItemGroups, Items, CodeLists)
`sdmx_policy_suggestion.yaml`	Suggested SDMX dimension/measure assignments
`analysis_summary.json`	Per-variable statistics and confidence scores for data type inference
`reverse_engineering_report.md`	Human-readable audit of the inference process

Python API for reverse engineering

import json
from pathlib import Path

# Load Dataset-JSON
with open('examples/sample_dataset_lb.json') as f:
    dataset = json.load(f)

# The reverse engineering script outputs structured metadata
# that can be post-processed:
with open('define_metadata.json') as f:
    metadata = json.load(f)

# Iterate inferred items
for item_group in metadata.get('itemGroups', []):
    print(f"Dataset: {item_group['name']}")
    for item in item_group.get('items', []):
        print(f"  {item['name']}: {item['dataType']} (confidence: {item.get('confidence', 'n/a')})")

14. Working with the Model in Python

Loading and traversing a Define-JSON file

import json
from pathlib import Path

with open('data/define-360i.json') as f:
    mdv = json.load(f)

# Build an OID lookup for O(1) cross-reference resolution
oid_index = {}
for ig in mdv.get('itemGroups', []):
    oid_index[ig['OID']] = ig
    for item in ig.get('items', []):
        oid_index[item['OID']] = item
for cl in mdv.get('codeLists', []):
    oid_index[cl['OID']] = cl
for method in mdv.get('methods', []):
    oid_index[method['OID']] = method
for wc in mdv.get('whereClauses', []):
    oid_index[wc['OID']] = wc

# Resolve an item's code list
def get_codelist(item):
    cl_oid = item.get('codeList')
    if not cl_oid:
        return None
    return oid_index.get(cl_oid)

# Example: find all items with a specific data type
float_items = [
    (ig['name'], item['name'])
    for ig in mdv.get('itemGroups', [])
    for item in ig.get('items', [])
    if item.get('dataType') == 'float'
]

Resolving applicableWhen logic

def evaluate_where_clause(wc_oid, row: dict) -> bool:
    """Evaluate a WhereClause against a data row. Returns True if all conditions pass."""
    wc = oid_index[wc_oid]
    conditions = wc.get('conditions', [])
    # Within a WhereClause, all conditions must be true (AND logic)
    return all(evaluate_condition(cond, row) for cond in conditions)

def evaluate_condition(condition: dict, row: dict) -> bool:
    operator = condition.get('operator', 'AND')
    range_checks = condition.get('rangeChecks', [])
    sub_conditions = condition.get('conditions', [])

    results = []
    for rc in range_checks:
        item_oid = rc['item']
        item = oid_index.get(item_oid, {})
        value = row.get(item.get('name', ''))
        results.append(evaluate_range_check(rc, value))
    for sub in sub_conditions:
        results.append(evaluate_condition(sub, row))

    if operator == 'AND':
        return all(results)
    elif operator == 'OR':
        return any(results)
    return False

def evaluate_range_check(rc: dict, value) -> bool:
    comparator = rc['comparator']
    check_values = rc['checkValues']
    if comparator == 'EQ':
        return str(value) in check_values
    elif comparator == 'NE':
        return str(value) not in check_values
    elif comparator == 'IN':
        return str(value) in check_values
    elif comparator == 'NOTIN':
        return str(value) not in check_values
    elif comparator == 'GE':
        return float(value) >= float(check_values[0])
    elif comparator == 'LE':
        return float(value) <= float(check_values[0])
    elif comparator == 'GT':
        return float(value) > float(check_values[0])
    elif comparator == 'LT':
        return float(value) < float(check_values[0])
    return False

# Determine which slice applies to a given row
def get_applicable_slice(item_group: dict, row: dict):
    for slice_group in item_group.get('slices', []):
        applicable_when = slice_group.get('applicableWhen', [])
        # OR logic: row matches if ANY where-clause matches
        if any(evaluate_where_clause(wc_oid, row) for wc_oid in applicable_when):
            return slice_group
    return None

Validating data against a Define-JSON definition

def validate_row(row: dict, item_group: dict) -> list[dict]:
    """Returns a list of validation failures for a data row."""
    failures = []

    # Determine the applicable slice (if any)
    slice_group = get_applicable_slice(item_group, row)
    items_to_check = item_group.get('items', [])
    if slice_group:
        # Slice items override / supplement domain items
        items_to_check = slice_group.get('items', items_to_check)

    for item in items_to_check:
        value = row.get(item['name'])
        item_name = item['name']

        # Code list check
        cl = get_codelist(item)
        if cl and value is not None:
            allowed = {i['codedValue'] for i in cl.get('codeListItems', [])}
            if allowed and str(value) not in allowed:
                failures.append({
                    'item': item_name,
                    'severity': 'Hard',
                    'message': f"Value '{value}' not in code list {cl['OID']}"
                })

        # Range checks
        for rc in item.get('rangeChecks', []):
            if value is not None and not evaluate_range_check(rc, value):
                failures.append({
                    'item': item_name,
                    'severity': rc.get('softHard', 'Hard'),
                    'message': f"Range check failed: {rc['comparator']} {rc['checkValues']}"
                })

    return failures

15. OID Reference Patterns

OIDs are the primary key mechanism. The schema uses CDISC conventions for regulatory submissions but allows any string for internal use. Recommended patterns:

Object type	Pattern	Example
MetaDataVersion	`MDV.<study>.<version>`	`MDV.LZZT.v1`
ItemGroup	`IG.<domain>`	`IG.VS`
ItemGroup slice	`VL.<domain>.<param>`	`VL.VS.DIABP`
Item	`IT.<domain>.<varname>`	`IT.VS.VSTESTCD`
CodeList	`CL.<name>`	`CL.VSTESTCD`
Method	`MT.<name>`	`MT.CALC_BMI`
Analysis	`AN.<name>`	`AN.SUMMARY_VS`
WhereClause	`WC.<domain>.<param>`	`WC.VS.DIABP`
Condition	`COND.<name>`	`COND.DIABP_FILTER`
FormalExpression	`FE.<method>.<context>`	`FE.CALC_BMI.SAS`
ReifiedConcept	`BC.<name>`	`BC.DIABP`
Dataflow	`DF.<name>`	`DF.VS_TRANSFER`
Dataset	`DS.<name>`	`DS.VS_FINAL`
DataProduct	`DP.<name>`	`DP.CLINICAL_DATA_V1`

Cross-reference inlining rules

Understanding when objects are inlined vs. referenced by OID string is essential for correct parsing:

Slot	Inlined?	Notes
`MetaDataVersion.itemGroups`	✅ Yes	Full objects embedded
`ItemGroup.items`	✅ Yes	Full objects embedded
`ItemGroup.slices`	✅ Yes	Full objects embedded
`MetaDataVersion.codeLists`	✅ Yes	Full objects embedded
`MetaDataVersion.conditions`	✅ Yes	Full objects embedded
`MetaDataVersion.whereClauses`	✅ Yes	Full objects embedded
`Item.codeList`	❌ OID ref	String pointing to `MetaDataVersion.codeLists[].OID`
`Item.method`	❌ OID ref	String pointing to `MetaDataVersion.methods[].OID`
`Item.applicableWhen`	❌ OID ref list	Strings pointing to `whereClauses[].OID`
`ItemGroup.applicableWhen`	❌ OID ref list	Strings pointing to `whereClauses[].OID`
`ItemGroup.implementsConcept`	❌ OID ref	String pointing to `concepts[].OID`
`MetaDataVersion.concepts`	❌ OID ref list	Not inlined — referenced by OID from `itemGroups`

This means when building an index, you must traverse both inlined and top-level lists. The concepts array is notably not inlined despite being a top-level collection — it is always referenced by OID.

define json technical guide

Define-JSON: Technical Deep-Dive for Data Engineers

Table of Contents

1. Architecture Overview

2. Schema Foundation: LinkML

3. The Root Object: MetaDataVersion

Required fields

Minimal valid document

Inheritance chain

4. ItemGroup and Item: The Data Layer

ItemGroup key slots

Item key slots

Example: VS domain

5. Slicing: Parameter-Specific Definitions

6. Conditions, WhereClauses, and RangeChecks

The three-level logic hierarchy

Condition example

Comparator enum values

SoftHard semantics

Nested condition example (compound logic)

7. Methods, FormalExpressions, and Analysis

Method

Analysis

8. CodeLists and Controlled Vocabulary

Inline CodeList

External CodeList reference

DataType enum

9. Semantic Layer: Coding and ReifiedConcept

Coding

ReifiedConcept

10. Data Products and Dataflows

Dataflow: the abstract contract

DataProduct: the governed package

Dataset: a concrete delivery

11. Versioning and Provenance

The copy-and-link model

Item-level origin tracking

12. XML ↔ JSON Conversion

Installation

Python API

CLI

What the converter improves over raw Define-XML

13. Reverse Engineering Metadata from Data

Python API for reverse engineering

14. Working with the Model in Python

Loading and traversing a Define-JSON file

Resolving applicableWhen logic

Validating data against a Define-JSON definition

15. OID Reference Patterns

Cross-reference inlining rules

Further Reading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally