Skip to content

define json technical guide

pendingintent edited this page May 28, 2026 · 1 revision

Define-JSON: Technical Deep-Dive for Data Engineers

This guide is aimed at data engineers and Python programmers working with clinical trial data pipelines. It assumes familiarity with JSON schemas, data modelling concepts, and ideally some exposure to CDISC standards (though the latter is not required).


Table of Contents

  1. Architecture Overview
  2. Schema Foundation: LinkML
  3. The Root Object: MetaDataVersion
  4. ItemGroup and Item: The Data Layer
  5. Slicing: Parameter-Specific Definitions
  6. Conditions, WhereClauses, and RangeChecks
  7. Methods, FormalExpressions, and Analysis
  8. CodeLists and Controlled Vocabulary
  9. Semantic Layer: Coding and ReifiedConcept
  10. Data Products and Dataflows
  11. Versioning and Provenance
  12. XML ↔ JSON Conversion
  13. Reverse Engineering Metadata from Data
  14. Working with the Model in Python
  15. OID Reference Patterns

1. Architecture Overview

Define-JSON is a flat, reference-based JSON structure. Unlike deeply nested formats (e.g. XML hierarchies), most collections live at the top level of MetaDataVersion and reference each other by OID.

MetaDataVersion
├── itemGroups[]         ← datasets, FHIR profiles, concept templates
│   └── items[]          ← inlined within each group
├── items[]              ← top-level template items (no group)
├── codeLists[]          ← permissible value sets
├── conditions[]         ← reusable logical expressions
├── whereClauses[]       ← named data contexts (link conditions to structures)
├── methods[]            ← derivation procedures
├── analyses[]           ← analysis-specific method extensions
├── concepts[]           ← abstract biomedical concepts (ReifiedConcept)
├── codings[]            ← shared semantic tags
├── relationships[]      ← inter-element semantic links
├── standards[]          ← CDISC/external standard references
├── dictionaries[]       ← external code systems (MedDRA, SNOMED, etc.)
├── dataProducts[]       ← governed data product definitions
└── displays[]           ← rendered analysis outputs

Key design decision: Items within an ItemGroup are inlined (embedded objects), but cross-references between groups use OID strings. This means you never dereference a separate lookup for items inside a group, but you do for codeList, method, applicableWhen, and similar cross-cutting references.


2. Schema Foundation: LinkML

The model is defined in LinkML, a YAML-based schema language that generates JSON Schema, OWL, Python dataclasses, and more. The source schema is at https://cdisc.org/define-json.

Key LinkML concepts used throughout:

  • is_a: single inheritance (e.g. ItemGroup is a GovernedElement)
  • mixins: multiple inheritance for cross-cutting concerns (e.g. IsProfile, IsODMStandard)
  • multivalued: true: the slot holds a list
  • inlined: true: the object is embedded, not referenced by ID
  • any_of: polymorphic ranges (e.g. owner can be a string, User, or Organization)
  • identifier: true: marks the primary key slot (OID on all Identifiable classes)

You can generate Python dataclasses from the schema:

pip install linkml-runtime
gen-python https://cdisc.org/define-json > define_json_classes.py

3. The Root Object: MetaDataVersion

MetaDataVersion is the tree root (LinkML tree_root: true) — the top-level object in any serialised Define-JSON document.

Required fields

Field Type Description
OID string Local identifier. Use MDV.STUDY-001.v1 format for submissions
fileOID string Identifier for the ODM file
creationDateTime datetime ISO 8601 timestamp
odmVersion string e.g. "2.0"
fileType string "Snapshot" or "Transactional"
studyOID string Identifier for the parent study

Minimal valid document

{
  "OID": "MDV.LZZT.v1",
  "fileOID": "ODM.LZZT",
  "creationDateTime": "2024-01-15T10:00:00",
  "odmVersion": "2.0",
  "fileType": "Snapshot",
  "studyOID": "STUDY.LZZT",
  "studyName": "LZZT Phase II",
  "itemGroups": [],
  "codeLists": [],
  "methods": [],
  "conditions": [],
  "whereClauses": []
}

Inheritance chain

MetaDataVersion
  is_a: GovernedElement
    mixins: [Identifiable, Labelled, Governed]
  mixins: [ODMFileMetadata, StudyMetadata]

This means MetaDataVersion picks up slots from all five classes: OID/uuid from Identifiable, name/label/description/coding/aliases from Labelled, mandatory/owner/purpose/lastUpdated/wasDerivedFrom/comments from Governed, ODM file fields from ODMFileMetadata, and study identification from StudyMetadata.


4. ItemGroup and Item: The Data Layer

ItemGroup maps to a dataset, FHIR resource profile, OMOP table, or form section depending on context. Item is a single variable/column within it.

ItemGroup key slots

Slot Type Notes
OID string (required) Primary key, e.g. "IG.VS"
name string Short name, matches dataset name in data files
domain string CDISC domain abbreviation, e.g. "VS", "LB"
type ItemGroupType enum Dataset, DatasetSpecialization, FHIR, Form, etc.
structure string e.g. "One record per visit per vital sign test per subject"
items Item[] Inlined — full item objects, not references
keySequence Item[] OID references to items that form the sort/uniqueness key
slices ItemGroup[] Inlined sub-groups (parameter-specific specialisations)
implementsConcept ReifiedConcept ref Links to abstract biomedical concept
applicableWhen WhereClause[] refs When this group is in scope (OR logic across clauses)
standard Standard ref CDISC IG being implemented
wasDerivedFrom ref Template this group was derived from

Item key slots

Slot Type Notes
OID string (required) e.g. "IT.VS.VSTESTCD"
name string Variable name, e.g. "VSTESTCD"
dataType DataType enum text, integer, float, date, datetime, time, boolean
length integer Max character length
codeList CodeList ref Permissible value constraint
method Method ref Derivation procedure
origin Origin Source type and provenance
rangeChecks RangeCheck[] Edit checks / CORE rules
conceptProperty ConceptProperty ref Abstract property this item specialises
applicableWhen WhereClause[] refs Conditional applicability
wasDerivedFrom ref Template item this was derived from

Example: VS domain

{
  "OID": "IG.VS",
  "name": "VS",
  "label": "Vital Signs",
  "domain": "VS",
  "type": "Dataset",
  "structure": "One record per vital sign per visit per subject",
  "standard": "SDTMIG.v3.4",
  "keySequence": ["IT.VS.STUDYID", "IT.VS.USUBJID", "IT.VS.VSTESTCD", "IT.VS.VISITNUM"],
  "items": [
    {
      "OID": "IT.VS.STUDYID",
      "name": "STUDYID",
      "label": "Study Identifier",
      "dataType": "text",
      "length": 12,
      "origin": { "type": "Assigned" }
    },
    {
      "OID": "IT.VS.VSTESTCD",
      "name": "VSTESTCD",
      "label": "Vital Signs Test Short Name",
      "dataType": "text",
      "length": 8,
      "codeList": "CL.VSTESTCD",
      "origin": { "type": "Assigned" }
    },
    {
      "OID": "IT.VS.VSORRES",
      "name": "VSORRES",
      "label": "Result or Finding in Original Units",
      "dataType": "text",
      "length": 200,
      "origin": { "type": "Collected" },
      "method": null
    },
    {
      "OID": "IT.VS.VSORRESU",
      "name": "VSORRESU",
      "label": "Original Units",
      "dataType": "text",
      "length": 40,
      "codeList": "CL.UNIT",
      "origin": { "type": "Collected" }
    }
  ],
  "slices": [
    { "OID": "VL.VS.DIABP", "name": "VL.VS.DIABP", "type": "DatasetSpecialization" },
    { "OID": "VL.VS.SYSBP", "name": "VL.VS.SYSBP", "type": "DatasetSpecialization" }
  ]
}

5. Slicing: Parameter-Specific Definitions

Slices let you attach parameter-specific metadata without duplicating the parent group definition. A slice is itself an ItemGroup with type: "DatasetSpecialization" and an applicableWhen that scopes it.

This is the key structural improvement over Define-XML v2.1's ValueList approach: rather than grouping by variable (VSORRES, VSORRESU), slices group by clinical parameter (DIABP, SYSBP), so each slice carries both the result and the unit for that parameter.

{
  "OID": "VL.VS.DIABP",
  "name": "VL.VS.DIABP",
  "label": "Diastolic Blood Pressure",
  "type": "DatasetSpecialization",
  "domain": "VS",
  "applicableWhen": ["WC.VS.DIABP"],
  "items": [
    {
      "OID": "IT.VS.DIABP.VSORRES",
      "name": "VSORRES",
      "label": "Diastolic BP Result",
      "dataType": "float",
      "rangeChecks": [
        {
          "item": "IT.VS.DIABP.VSORRES",
          "comparator": "GE",
          "checkValues": ["0"],
          "softHard": "Soft"
        },
        {
          "item": "IT.VS.DIABP.VSORRES",
          "comparator": "LE",
          "checkValues": ["300"],
          "softHard": "Hard"
        }
      ]
    },
    {
      "OID": "IT.VS.DIABP.VSORRESU",
      "name": "VSORRESU",
      "label": "Diastolic BP Units",
      "dataType": "text",
      "codeList": "CL.MMHG_ONLY"
    }
  ]
}

The WC.VS.DIABP where-clause restricts this slice to rows where VSTESTCD = "DIABP":

{
  "OID": "WC.VS.DIABP",
  "name": "WC.VS.DIABP",
  "conditions": ["COND.VS.DIABP"]
}

6. Conditions, WhereClauses, and RangeChecks

The three-level logic hierarchy

WhereClause          ← named context; referenced by items/groups via applicableWhen
  └── Condition[]    ← combined with AND within the clause
        ├── RangeCheck[]    ← simple value comparisons (EQ, NE, IN, GE, LE, etc.)
        ├── Condition[]     ← nested sub-conditions (recursive, for complex logic)
        └── FormalExpression[]  ← executable code for complex cases

Multiple WhereClause references on the same element are combined with OR logic: "applies when ANY of these clauses matches". Within a clause, Condition objects combine with AND (configurable via operator).

Condition example

{
  "OID": "COND.VS.DIABP",
  "name": "COND.VS.DIABP",
  "operator": "AND",
  "rangeChecks": [
    {
      "item": "IT.VS.VSTESTCD",
      "comparator": "EQ",
      "checkValues": ["DIABP"],
      "softHard": "Hard"
    }
  ]
}

Comparator enum values

EQ, NE, LT, LE, GT, GE, IN, NOTIN

SoftHard semantics

Value Meaning
Hard Error — data is invalid if check fails
Soft Warning — data is unusual but not necessarily wrong

Nested condition example (compound logic)

{
  "OID": "COND.ADVERSE_SERIOUS",
  "name": "COND.ADVERSE_SERIOUS",
  "operator": "AND",
  "conditions": [
    {
      "OID": "COND.AESER_YES",
      "operator": "AND",
      "rangeChecks": [
        { "item": "IT.AE.AESER", "comparator": "EQ", "checkValues": ["Y"] }
      ]
    },
    {
      "OID": "COND.AESEV_OR",
      "operator": "OR",
      "rangeChecks": [
        { "item": "IT.AE.AESEV", "comparator": "EQ", "checkValues": ["SEVERE"] },
        { "item": "IT.AE.AESEV", "comparator": "EQ", "checkValues": ["LIFE THREATENING"] }
      ]
    }
  ]
}

7. Methods, FormalExpressions, and Analysis

Method

A Method is a reusable derivation procedure. Items reference methods via method: "MT.CALC_BMI".

{
  "OID": "MT.CALC_BMI",
  "name": "MT.CALC_BMI",
  "label": "Calculate BMI",
  "type": "Computation",
  "expressions": [
    {
      "OID": "FE.CALC_BMI.SAS",
      "context": "SAS",
      "expression": "VSSTRESN = (WEIGHT_KG / (HEIGHT_M ** 2))",
      "returnType": "float",
      "parameters": [
        {
          "OID": "PARAM.WEIGHT_KG",
          "name": "WEIGHT_KG",
          "dataType": "float",
          "required": true
        },
        {
          "OID": "PARAM.HEIGHT_M",
          "name": "HEIGHT_M",
          "dataType": "float",
          "required": true
        }
      ]
    },
    {
      "OID": "FE.CALC_BMI.PYTHON",
      "context": "Python",
      "expression": "bmi = weight_kg / (height_m ** 2)",
      "returnType": "float"
    }
  ]
}

Analysis

Analysis extends Method with study-specific traceability fields. Use it when you need to document why and from what an analysis was run, not just how.

{
  "OID": "AN.SUMMARY_VS",
  "name": "AN.SUMMARY_VS",
  "label": "Vital Signs Summary Statistics",
  "type": "Computation",
  "analysisReason": "Primary Efficacy",
  "analysisPurpose": "Exploratory",
  "inputData": ["IG.VS", "IG.VS.VL.VS.DIABP"],
  "expressions": [
    {
      "OID": "FE.SUMMARY_VS.R",
      "context": "R",
      "expression": "vs_summary <- vs_data %>% group_by(VSTESTCD, VISIT) %>% summarise(n=n(), mean=mean(VSSTRESN, na.rm=TRUE), sd=sd(VSSTRESN, na.rm=TRUE))"
    }
  ]
}

inputData accepts OIDs of ItemGroup or slice objects — make sure every referenced Item (e.g. analysis variables passed as Parameter) has its parent ItemGroup listed here.


8. CodeLists and Controlled Vocabulary

Inline CodeList

{
  "OID": "CL.VSTESTCD",
  "name": "VSTESTCD",
  "label": "Vital Signs Test Code",
  "dataType": "text",
  "standard": "CDISC/NCI",
  "codeListItems": [
    {
      "codedValue": "DIABP",
      "decode": "Diastolic Blood Pressure",
      "coding": [{ "code": "C25299", "codeSystem": "NCI", "codeSystemVersion": "2023-09-25" }]
    },
    {
      "codedValue": "SYSBP",
      "decode": "Systolic Blood Pressure",
      "coding": [{ "code": "C25298", "codeSystem": "NCI", "codeSystemVersion": "2023-09-25" }]
    },
    {
      "codedValue": "TEMP",
      "decode": "Temperature",
      "weight": 3.0,
      "coding": [{ "code": "C25206", "codeSystem": "NCI", "codeSystemVersion": "2023-09-25" }]
    }
  ]
}

External CodeList reference

When the full enumeration lives in an external system (MedDRA, SNOMED, LOINC), use externalCodeList instead of codeListItems:

{
  "OID": "CL.MEDDRA_PT",
  "name": "MEDDRA_PT",
  "label": "MedDRA Preferred Terms",
  "dataType": "text",
  "externalCodeList": {
    "OID": "RES.MEDDRA",
    "name": "MedDRA",
    "href": "https://www.meddra.org",
    "version": "26.1"
  }
}

DataType enum

text, integer, float, double, decimal, date, time, datetime, dateTime, boolean, base64Binary, hexBinary, anyURI


9. Semantic Layer: Coding and ReifiedConcept

This is what separates Define-JSON from a pure structural schema. Every element can be anchored to ontologies; datasets can declare which abstract biomedical concept they implement.

Coding

Attach standardised semantic tags to any element using the coding slot:

{
  "OID": "IT.VS.VSORRES",
  "name": "VSORRES",
  "coding": [
    {
      "code": "C25712",
      "codeSystem": "NCI",
      "codeSystemVersion": "2023-09-25",
      "decode": "Result",
      "aliasType": "SameAs"
    },
    {
      "code": "8480-6",
      "codeSystem": "LOINC",
      "codeSystemVersion": "2.76",
      "decode": "Systolic blood pressure",
      "aliasType": "NarrowMatch"
    }
  ]
}

aliasType (the AliasPredicate enum) controls the relationship semantics: SameAs, BroadMatch, NarrowMatch, RelatedMatch, Implements, IsA.

ReifiedConcept

ReifiedConcept makes an abstract concept — e.g. "Diastolic Blood Pressure" as defined in the CDISC Biomedical Concept model — explicit and referenceable. ItemGroups and Methods then declare that they implement it.

{
  "OID": "BC.DIABP",
  "name": "DiastolicBloodPressure",
  "label": "Diastolic Blood Pressure",
  "href": "https://library.cdisc.org/api/cosmos/v2/bc/C25299",
  "coding": [
    { "code": "C25299", "codeSystem": "NCI", "decode": "Diastolic Blood Pressure" }
  ],
  "properties": [
    {
      "OID": "BCP.DIABP.RESULT",
      "name": "result",
      "label": "Result Value",
      "minOccurs": 1,
      "maxOccurs": 1,
      "codeList": null
    },
    {
      "OID": "BCP.DIABP.UNIT",
      "name": "unit",
      "label": "Unit of Measure",
      "minOccurs": 1,
      "maxOccurs": 1,
      "codeList": "CL.MMHG_ONLY"
    }
  ]
}

An ItemGroup then declares "implementsConcept": "BC.DIABP" and each Item declares "conceptProperty": "BCP.DIABP.RESULT" — forming a typed, verifiable link from concrete implementation to abstract definition.


10. Data Products and Dataflows

For pipeline and data contract use cases, DataProduct and Dataflow express the supply/demand boundary.

Dataflow: the abstract contract

A Dataflow declares what structure is expected — before any concrete data exists. Think of it as an interface definition.

{
  "OID": "DF.VS_TRANSFER",
  "name": "DF.VS_TRANSFER",
  "label": "Vital Signs Transfer Agreement",
  "structure": "IG.VS",
  "dimensionConstraint": ["IT.VS.USUBJID", "IT.VS.VSTESTCD", "IT.VS.VISITNUM"],
  "version": "1.0"
}

DataProduct: the governed package

{
  "OID": "DP.CLINICAL_DATA_V1",
  "name": "ClinicalDataPackage",
  "label": "Clinical Data Package v1",
  "dataProductOwner": "Data Management Team",
  "lifecycleStatus": "Active",
  "domain": "SDTM",
  "inputDataflow": ["DF.VS_TRANSFER", "DF.LB_TRANSFER"],
  "outputDataset": ["DS.VS_FINAL", "DS.LB_FINAL"],
  "outputPort": [
    {
      "OID": "SVC.FHIR_API",
      "name": "FHIR R4 API",
      "protocol": "HTTPS",
      "resourceType": "HL7-FHIR",
      "href": "https://api.example.com/fhir/r4",
      "securitySchemaType": "OAuth2"
    }
  ]
}

Dataset: a concrete delivery

{
  "OID": "DS.VS_FINAL",
  "name": "vs_final.xpt",
  "structuredBy": "IG.VS",
  "describedBy": "DF.VS_TRANSFER",
  "conformsTo": "SDTM v1.8",
  "dataExtractionDate": "2024-01-10",
  "validFrom": "2023-01-01",
  "validTo": "2023-12-31",
  "distribution": [
    {
      "format": "application/x-xpt",
      "accessService": {
        "OID": "SVC.SFTP",
        "protocol": "SFTP",
        "href": "sftp://transfers.example.com/sdtm/"
      }
    }
  ]
}

11. Versioning and Provenance

The copy-and-link model

Each MetaDataVersion is a complete, immutable snapshot. Derivation is tracked via wasDerivedFrom, which accepts OID strings or typed object references.

{
  "OID": "MDV.STUDY-001.v2",
  "wasDerivedFrom": "MDV.STUDY-001.v1",
  "creationDateTime": "2024-06-01T09:00:00",
  "studyOID": "STUDY.001",
  "itemGroups": [
    {
      "OID": "IG.VS.v2",
      "wasDerivedFrom": "IG.VS.v1",
      "items": [
        {
          "OID": "IT.VS.VSORRES.v2",
          "wasDerivedFrom": "IT.VS.VSORRES.v1",
          "name": "VSORRES",
          "dataType": "float"
        }
      ]
    }
  ]
}

This pattern supports:

  • Template reuse: a study-specific ItemGroup derives from a CDISC standard template
  • Study amendment tracking: each protocol amendment creates a new MetaDataVersion linked to the prior one
  • Cross-study comparison: shared wasDerivedFrom ancestry identifies equivalent variables across studies

Item-level origin tracking

{
  "OID": "IT.VS.VSDY",
  "name": "VSDY",
  "label": "Study Day of Vital Signs",
  "dataType": "integer",
  "origin": {
    "type": "Derived",
    "source": "Sponsor",
    "sourceItems": [
      {
        "item": "IT.VS.VSDTC",
        "resource": ["IG.VS"],
        "document": null
      },
      {
        "item": "IT.DM.RFSTDTC",
        "resource": ["IG.DM"]
      }
    ]
  },
  "method": "MT.CALC_STUDY_DAY"
}

OriginType values: Collected, Derived, Assigned, Protocol, eDT
OriginSource values: Investigator, Sponsor, Subject, Vendor


12. XML ↔ JSON Conversion

Installation

git clone https://github.com/TeMeta/define-json.git
cd define-json
pip install poetry
poetry install

Python API

from src.define_json.converters.xml_to_json import PortableDefineXMLToJSONConverter
from src.define_json.converters.json_to_xml import DefineJSONToXMLConverter
from pathlib import Path

# Define-XML → Define-JSON
xml_converter = PortableDefineXMLToJSONConverter()
json_data = xml_converter.convert_file(
    Path('data/define-360i.xml'),
    Path('data/define-360i.json')
)

# Define-JSON → Define-XML
xml_converter = DefineJSONToXMLConverter()
xml_root = xml_converter.convert_file(
    Path('data/define-360i.json'),
    Path('data/define-360i-recreated.xml')
)

CLI

# XML → JSON
poetry run python -m define_json xml2json data/define.xml data/output.json

# JSON → XML
poetry run python -m define_json json2xml data/input.json data/output.xml

# HTML rendering (no CORS issues for browser viewing)
poetry run python -m define_json json2html input.json output.html

# Roundtrip validation
poetry run python -m define_json roundtrip data/original.xml

# Schema validation
poetry run python -m define_json validate data/input.json

What the converter improves over raw Define-XML

Aspect Define-XML v2.1 Define-JSON
ValueList grouping By variable (VSORRES, VSORRESU) By parameter (DIABP, SYSBP, TEMP) — clinically meaningful
WhereClause deduplication Separate WC per variable per parameter Shared WC per parameter — 27 → 14 in the 360i sample
JSON size N/A ~33% smaller than source XML (98KB → 66KB)
Reference model Nested XML with repeated attribute XML Flat JSON with OID references

13. Reverse Engineering Metadata from Data

Generate a Define-JSON skeleton from existing Dataset-JSON files:

python scripts/reverse_engineer_define.py examples/sample_dataset_lb.json

This produces four output files:

File Contents
define_metadata.json Inferred Define-JSON structure (ItemGroups, Items, CodeLists)
sdmx_policy_suggestion.yaml Suggested SDMX dimension/measure assignments
analysis_summary.json Per-variable statistics and confidence scores for data type inference
reverse_engineering_report.md Human-readable audit of the inference process

Python API for reverse engineering

import json
from pathlib import Path

# Load Dataset-JSON
with open('examples/sample_dataset_lb.json') as f:
    dataset = json.load(f)

# The reverse engineering script outputs structured metadata
# that can be post-processed:
with open('define_metadata.json') as f:
    metadata = json.load(f)

# Iterate inferred items
for item_group in metadata.get('itemGroups', []):
    print(f"Dataset: {item_group['name']}")
    for item in item_group.get('items', []):
        print(f"  {item['name']}: {item['dataType']} (confidence: {item.get('confidence', 'n/a')})")

14. Working with the Model in Python

Loading and traversing a Define-JSON file

import json
from pathlib import Path

with open('data/define-360i.json') as f:
    mdv = json.load(f)

# Build an OID lookup for O(1) cross-reference resolution
oid_index = {}
for ig in mdv.get('itemGroups', []):
    oid_index[ig['OID']] = ig
    for item in ig.get('items', []):
        oid_index[item['OID']] = item
for cl in mdv.get('codeLists', []):
    oid_index[cl['OID']] = cl
for method in mdv.get('methods', []):
    oid_index[method['OID']] = method
for wc in mdv.get('whereClauses', []):
    oid_index[wc['OID']] = wc

# Resolve an item's code list
def get_codelist(item):
    cl_oid = item.get('codeList')
    if not cl_oid:
        return None
    return oid_index.get(cl_oid)

# Example: find all items with a specific data type
float_items = [
    (ig['name'], item['name'])
    for ig in mdv.get('itemGroups', [])
    for item in ig.get('items', [])
    if item.get('dataType') == 'float'
]

Resolving applicableWhen logic

def evaluate_where_clause(wc_oid, row: dict) -> bool:
    """Evaluate a WhereClause against a data row. Returns True if all conditions pass."""
    wc = oid_index[wc_oid]
    conditions = wc.get('conditions', [])
    # Within a WhereClause, all conditions must be true (AND logic)
    return all(evaluate_condition(cond, row) for cond in conditions)

def evaluate_condition(condition: dict, row: dict) -> bool:
    operator = condition.get('operator', 'AND')
    range_checks = condition.get('rangeChecks', [])
    sub_conditions = condition.get('conditions', [])

    results = []
    for rc in range_checks:
        item_oid = rc['item']
        item = oid_index.get(item_oid, {})
        value = row.get(item.get('name', ''))
        results.append(evaluate_range_check(rc, value))
    for sub in sub_conditions:
        results.append(evaluate_condition(sub, row))

    if operator == 'AND':
        return all(results)
    elif operator == 'OR':
        return any(results)
    return False

def evaluate_range_check(rc: dict, value) -> bool:
    comparator = rc['comparator']
    check_values = rc['checkValues']
    if comparator == 'EQ':
        return str(value) in check_values
    elif comparator == 'NE':
        return str(value) not in check_values
    elif comparator == 'IN':
        return str(value) in check_values
    elif comparator == 'NOTIN':
        return str(value) not in check_values
    elif comparator == 'GE':
        return float(value) >= float(check_values[0])
    elif comparator == 'LE':
        return float(value) <= float(check_values[0])
    elif comparator == 'GT':
        return float(value) > float(check_values[0])
    elif comparator == 'LT':
        return float(value) < float(check_values[0])
    return False

# Determine which slice applies to a given row
def get_applicable_slice(item_group: dict, row: dict):
    for slice_group in item_group.get('slices', []):
        applicable_when = slice_group.get('applicableWhen', [])
        # OR logic: row matches if ANY where-clause matches
        if any(evaluate_where_clause(wc_oid, row) for wc_oid in applicable_when):
            return slice_group
    return None

Validating data against a Define-JSON definition

def validate_row(row: dict, item_group: dict) -> list[dict]:
    """Returns a list of validation failures for a data row."""
    failures = []

    # Determine the applicable slice (if any)
    slice_group = get_applicable_slice(item_group, row)
    items_to_check = item_group.get('items', [])
    if slice_group:
        # Slice items override / supplement domain items
        items_to_check = slice_group.get('items', items_to_check)

    for item in items_to_check:
        value = row.get(item['name'])
        item_name = item['name']

        # Code list check
        cl = get_codelist(item)
        if cl and value is not None:
            allowed = {i['codedValue'] for i in cl.get('codeListItems', [])}
            if allowed and str(value) not in allowed:
                failures.append({
                    'item': item_name,
                    'severity': 'Hard',
                    'message': f"Value '{value}' not in code list {cl['OID']}"
                })

        # Range checks
        for rc in item.get('rangeChecks', []):
            if value is not None and not evaluate_range_check(rc, value):
                failures.append({
                    'item': item_name,
                    'severity': rc.get('softHard', 'Hard'),
                    'message': f"Range check failed: {rc['comparator']} {rc['checkValues']}"
                })

    return failures

15. OID Reference Patterns

OIDs are the primary key mechanism. The schema uses CDISC conventions for regulatory submissions but allows any string for internal use. Recommended patterns:

Object type Pattern Example
MetaDataVersion MDV.<study>.<version> MDV.LZZT.v1
ItemGroup IG.<domain> IG.VS
ItemGroup slice VL.<domain>.<param> VL.VS.DIABP
Item IT.<domain>.<varname> IT.VS.VSTESTCD
CodeList CL.<name> CL.VSTESTCD
Method MT.<name> MT.CALC_BMI
Analysis AN.<name> AN.SUMMARY_VS
WhereClause WC.<domain>.<param> WC.VS.DIABP
Condition COND.<name> COND.DIABP_FILTER
FormalExpression FE.<method>.<context> FE.CALC_BMI.SAS
ReifiedConcept BC.<name> BC.DIABP
Dataflow DF.<name> DF.VS_TRANSFER
Dataset DS.<name> DS.VS_FINAL
DataProduct DP.<name> DP.CLINICAL_DATA_V1

Cross-reference inlining rules

Understanding when objects are inlined vs. referenced by OID string is essential for correct parsing:

Slot Inlined? Notes
MetaDataVersion.itemGroups ✅ Yes Full objects embedded
ItemGroup.items ✅ Yes Full objects embedded
ItemGroup.slices ✅ Yes Full objects embedded
MetaDataVersion.codeLists ✅ Yes Full objects embedded
MetaDataVersion.conditions ✅ Yes Full objects embedded
MetaDataVersion.whereClauses ✅ Yes Full objects embedded
Item.codeList ❌ OID ref String pointing to MetaDataVersion.codeLists[].OID
Item.method ❌ OID ref String pointing to MetaDataVersion.methods[].OID
Item.applicableWhen ❌ OID ref list Strings pointing to whereClauses[].OID
ItemGroup.applicableWhen ❌ OID ref list Strings pointing to whereClauses[].OID
ItemGroup.implementsConcept ❌ OID ref String pointing to concepts[].OID
MetaDataVersion.concepts ❌ OID ref list Not inlined — referenced by OID from itemGroups

This means when building an index, you must traverse both inlined and top-level lists. The concepts array is notably not inlined despite being a top-level collection — it is always referenced by OID.


Further Reading

Clone this wiki locally