Skip to content

feat: Add MIT Learn integration dbt layer (Cohort 1 catalog sources)#2262

Open
blarghmatey wants to merge 5 commits into
mainfrom
feat/learn-integration-layer-cohort1-dbt-models
Open

feat: Add MIT Learn integration dbt layer (Cohort 1 catalog sources)#2262
blarghmatey wants to merge 5 commits into
mainfrom
feat/learn-integration-layer-cohort1-dbt-models

Conversation

@blarghmatey
Copy link
Copy Markdown
Member

@blarghmatey blarghmatey commented May 28, 2026

What are the relevant tickets?

Closes https://github.com/mitodl/hq/issues/11510

Description (What does it do?)

Introduces the integrations/learn dbt layer — stable application-level contract models that MIT Learn's Trino-pull ETL tasks consume directly via the BaseTrinoETLTask pattern. These models sit above the marts layer and expose a documented schema contract agreed on by both platform and application teams.

New dbt models (src/ol_dbt/models/integrations/learn/):

Model Source platforms
integrations__learn__ocw_courses OCW Studio (Wagtail)
integrations__learn__mitxonline_courses MITx Online app DB + CMS
integrations__learn__mitxonline_programs MITx Online app DB + CMS
integrations__learn__xpro_courses xPRO app DB + CMS
integrations__learn__xpro_programs xPRO app DB + CMS
integrations__learn__micromasters_programs MicroMasters app DB

All models satisfy the required column contract (readable_id, title, last_modified, etl_source NOT NULL) with not_null and unique dbt tests in the schema YAML.

Supporting changes:

  • dbt_project.yml: registers the integrations layer with table materialization, integrations schema, and a mit_learn_etl Trino role grant
  • docs/learn_marts_contract.md: schema contract spec — required columns, grain, nullability rules, valid etl_source enum values
  • dg_projects/learning_resources/CONTRIBUTING.md: contributor guide — local dev, Vault mock, three asset patterns (REST webhook, sensor-partitioned, Trino-pull), naming conventions

How can this be tested?

cd src/ol_dbt
dbt parse        # no new errors beyond pre-existing project-wide deprecation warnings
dbt ls --select 'integrations__learn*'   # expect 6 models listed

Once the mit_learn_etl Trino role is provisioned:

dbt run --select 'integrations__learn*'
dbt test --select 'integrations__learn*'

Additional Context

This PR delivers the data platform half of Phase 0 (Foundation) and Cohort 1 dbt work from implementation_guide_01_db_catalog.md. The MIT Learn application-side Trino-pull Celery tasks (in mitodl/mit-learn) will reference these views by their fully-qualified Trino names, e.g. ol_warehouse_production.integrations.integrations__learn__ocw_courses.

Note: MicroMasters programs use current_timestamp as last_modified — the MicroMasters source DB has no last-modified column on programs. This is documented in the schema YAML and the contract doc.

Introduces the integrations/learn dbt layer — stable application-level
contract models that MIT Learn's Trino-pull ETL tasks consume directly.
These sit above the marts layer and expose the schema contract defined in
docs/learn_marts_contract.md.

New models (ol_dbt/models/integrations/learn/):
- integrations__learn__ocw_courses
- integrations__learn__mitxonline_courses
- integrations__learn__mitxonline_programs
- integrations__learn__xpro_courses
- integrations__learn__xpro_programs
- integrations__learn__micromasters_programs

All models satisfy the required column contract (readable_id, title,
last_modified, etl_source NOT NULL) and include not_null + unique tests
in the schema YAML.

Also adds:
- dbt_project.yml: integrations layer with table materialization,
  integrations schema, and mit_learn_etl Trino role grant
- docs/learn_marts_contract.md: schema contract spec for the layer
- dg_projects/learning_resources/CONTRIBUTING.md: contributor guide
  covering local dev, vault mock, asset patterns, and dlt guidance

Refs: MIT_LEARN_ETL_MIGRATION.md Cohort 1 — DB-backed catalog sources
Copilot AI review requested due to automatic review settings May 28, 2026 21:16
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Introduces a new dbt integration contract layer for MIT Learn catalog ingestion, exposing Learn-facing course/program tables above the warehouse marts and documenting how downstream Trino-pull ETL should consume them.

Changes:

  • Adds six Learn integration dbt models for OCW, MITx Online, xPRO, and MicroMasters catalog entities.
  • Registers the new integrations dbt layer and grants access to the MIT Learn ETL role.
  • Adds contract and contributor documentation for Learn integration workflows.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
src/ol_dbt/models/integrations/learn/integrations__learn__ocw_courses.sql Adds OCW course integration model.
src/ol_dbt/models/integrations/learn/integrations__learn__mitxonline_courses.sql Adds MITx Online course integration model.
src/ol_dbt/models/integrations/learn/integrations__learn__mitxonline_programs.sql Adds MITx Online program integration model.
src/ol_dbt/models/integrations/learn/integrations__learn__xpro_courses.sql Adds xPRO course integration model.
src/ol_dbt/models/integrations/learn/integrations__learn__xpro_programs.sql Adds xPRO program integration model.
src/ol_dbt/models/integrations/learn/integrations__learn__micromasters_programs.sql Adds MicroMasters program integration model.
src/ol_dbt/models/integrations/learn/_integrations__learn__schema.yml Documents and tests required columns for the new integration models.
src/ol_dbt/dbt_project.yml Configures the new integrations layer materialization, schema, and grants.
docs/learn_marts_contract.md Adds the MIT Learn integration schema contract.
dg_projects/learning_resources/CONTRIBUTING.md Adds contributor guidance for Learn catalog delivery patterns.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/ol_dbt/models/integrations/learn/integrations__learn__xpro_programs.sql Outdated
Comment thread src/ol_dbt/models/integrations/learn/integrations__learn__xpro_courses.sql Outdated
Comment thread src/ol_dbt/models/integrations/learn/integrations__learn__ocw_courses.sql Outdated
Comment thread src/ol_dbt/models/integrations/learn/integrations__learn__mitxonline_programs.sql Outdated
Comment thread src/ol_dbt/models/integrations/learn/integrations__learn__mitxonline_courses.sql Outdated
Comment thread docs/learn_marts_contract.md Outdated
Comment thread docs/learn_marts_contract.md Outdated
Comment thread dg_projects/learning_resources/CONTRIBUTING.md Outdated
Comment thread dg_projects/learning_resources/CONTRIBUTING.md
- Replace native array_join() calls with {{ array_join() }} dispatched
  macro throughout all six integration models, ensuring DuckDB
  compatibility for local dev runs (Copilot feedback)

- Refactor complex concat expressions into per-row CTEs before
  aggregation so the macro receives simple column references

- Simplify integrations__learn__xpro_programs: source course membership
  from int__mitxpro__courses (which carries program_id directly) rather
  than joining through Wagtail CMS page tables

- Fix integrations__learn__micromasters_programs: use program_updated_on
  from stg__micromasters__app__postgres__courses_program instead of
  current_timestamp; int__micromasters__programs does not expose this
  column so add a targeted stg_programs CTE

- Update docs/learn_marts_contract.md: document topics, instructors,
  runs, and courses columns as delimited varchar rather than JSON arrays,
  matching the actual model output types (Copilot feedback)

- Fix CONTRIBUTING.md: correct 'From the repo root' cd path (remove
  spurious ol-data-platform/ prefix); add sensor-driven partitioned
  pattern to the delivery patterns table so the count matches (Copilot
  feedback)

- Rewrite _integrations__learn__schema.yml cleanly: fix structural YAML
  issue introduced by partial edit; add column docs for topics,
  instructors, departments; update MicroMasters last_modified description
  to reflect program_updated_on source
@blarghmatey
Copy link
Copy Markdown
Member Author

Copilot review — status update

All 10 comments from the automated Copilot review have been addressed. A summary of the fixes in commit ed810ede:

array_join macro fixes — Every SQL file that called Trino's native array_join() directly has been refactored to use the project's dispatched {{ array_join() }} macro, ensuring DuckDB compatibility for local dev. Complex concat() expressions were moved into per-row CTEs before aggregation so the macro receives plain column references.

integrations__learn__xpro_programs join simplification — Replaced a 4-table Wagtail CMS page join chain with a direct query on int__mitxpro__courses, which carries program_id directly (confirmed via OpenMetadata lineage).

integrations__learn__micromasters_programs timestamp fix — Replaced current_timestamp with program_updated_on from the staging table. int__micromasters__programs doesn't surface this column, so a targeted stg_programs CTE was added.

docs/learn_marts_contract.md type correctionstopics, instructors, runs, and courses are now documented as delimiter-separated varchars, not JSON arrays.

CONTRIBUTING.md copy fixes — The cd path was corrected (removed spurious ol-data-platform/ prefix), and the sensor-driven partitioned pattern was added to the delivery patterns table so the count matches the prose.

The 10 Copilot comments are all anchored to commit aba746ef (the pre-commit auto-fix commit), which predates the fix commit. They are stale and can be dismissed.

Sources from stg__edxorg__api__course and stg__edxorg__api__courserun
(edX.org S3 exports via IRx). Resolves the open source-decision question
from the Cohort 1 implementation guide: edxorg dg_project is the correct
source for the public-facing MITx catalog on edx.org; mitxresidential
covers MIT's internal OpenEdX deployment and is not appropriate here.

Model details:
- Filters to courses with at least one published run
- course_topics: array_join on the array(varchar) from staging
- instructors: cross-join unnest over courserun_instructors JSON,
  deduped and comma-joined
- runs: pipe-delimited run metadata (key|start|end|published) joined
  with semicolons via the dispatched array_join macro

Adds schema YAML entry with not_null/unique tests on readable_id and
etl_source, plus column docs for runs, topics, and instructors.
Copy link
Copy Markdown
Contributor

@rachellougee rachellougee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this intended to replace the Learn's current API ingestion from api/v2/courses/ and api/v2/programs/ directly from each source? If so, it would add an additional layer to their ingestion layer, though that might not be a big deal.

Also, none of these models run successfully, though the code looks fine.

19:07:35  Finished running 1 project hook, 7 table models, 35 data tests in 0 hours 0 minutes and 15.04 seconds (15.04s).
19:07:36  
19:07:36  Completed with 7 errors, 0 partial successes, and 0 warnings:
19:07:36  
19:07:36  Failure in model integrations__learn__mitxonline_courses (models/integrations/learn/integrations__learn__mitxonline_courses.sql)

@blarghmatey
Copy link
Copy Markdown
Member Author

The goal is to start pushing the ingestion in Learn further along on the processing chain so that it has less work to do in the application and more of it happens in the data platform. The idea is that Learn will move to the Celery Trino ingest model that you built in MITx Online. I've got a branch locally with the beginnings of that work that I'll be pushing next week.

@rachellougee
Copy link
Copy Markdown
Contributor

Sounds good. Once you've resolved the dbt errors in these 7 new models, that should be sufficient.

…n models

- macros/cross_db_functions.sql: add unnest_json_array cross-db macro
  Trino: unnest(try_cast(json_parse(expr) as array(json))) as alias(col)
  DuckDB: unnest(try_cast(expr as json[]))   as alias(col)

- macros/json_extract_scalar.sql: fix DuckDB dispatch
  Replace json_extract (returns quoted JSON) with json_extract_string
  (returns plain VARCHAR), matching Trino json_extract_scalar semantics

- integrations__learn__mit_edx_courses: two fixes
  * Use {{ unnest_json_array() }} macro instead of inline Trino syntax
  * Use {{ json_extract_scalar() }} macro calls so DuckDB dispatch fires

- integrations__learn__micromasters_programs:
  course_readable_id does not exist on stg__micromasters__app__postgres__courses_course;
  correct column is course_edx_key

- mitxonline_courses/programs, xpro_courses/programs: fix last_modified nulls
  The CMS page join returns null for courses/programs without a Wagtail page.
  Add {{ cast_timestamp_to_iso8601('current_timestamp') }} as the final coalesce
  fallback so last_modified is always non-null (varchar-typed on both engines).
  Update schema.yml descriptions accordingly.
Copy link
Copy Markdown
Contributor

@rachellougee rachellougee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@blarghmatey There are still errors when running dbt against trino. They all related to null in the final select because trino can't infer the type of null without an explicit cast

13:28:17  Finished running 1 project hook, 7 table models, 35 data tests in 0 hours 1 minutes and 1.86 seconds (61.86s).
13:28:18  
13:28:18  Completed with 7 errors, 0 partial successes, and 0 warnings:
13:28:18  
13:28:18  Failure in model integrations__learn__mitxonline_courses (models/integrations/learn/integrations__learn__mitxonline_courses.sql)
13:28:18    Database Error in model integrations__learn__mitxonline_courses (models/integrations/learn/integrations__learn__mitxonline_courses.sql)
  TrinoUserError(type=USER_ERROR, name=COLUMN_TYPE_UNKNOWN, message="line 6:5: Column type is unknown: image_url", query_id=20260601_132813_00237_gps54)
  compiled code at target/run/open_learning/models/integrations/learn/integrations__learn__mitxonline_courses.sql
13:28:18  
13:28:18    compiled code at target/compiled/open_learning/models/integrations/learn/integrations__learn__mitxonline_courses.sql
13:28:18  
13:28:18  Failure in model integrations__learn__mitxonline_programs (models/integrations/learn/integrations__learn__mitxonline_programs.sql)
13:28:18    Database Error in model integrations__learn__mitxonline_programs (models/integrations/learn/integrations__learn__mitxonline_programs.sql)
  TrinoUserError(type=USER_ERROR, name=COLUMN_TYPE_UNKNOWN, message="line 6:5: Column type is unknown: image_url", query_id=20260601_132813_01476_w5ckq)
  compiled code at target/run/open_learning/models/integrations/learn/integrations__learn__mitxonline_programs.sql
13:28:18  
13:28:18    compiled code at target/compiled/open_learning/models/integrations/learn/integrations__learn__mitxonline_programs.sql
13:28:18  
13:28:18  Failure in model integrations__learn__mit_edx_courses (models/integrations/learn/integrations__learn__mit_edx_courses.sql)
13:28:18    Database Error in model integrations__learn__mit_edx_courses (models/integrations/learn/integrations__learn__mit_edx_courses.sql)
  TrinoUserError(type=USER_ERROR, name=COLUMN_TYPE_UNKNOWN, message="line 6:5: Column type is unknown: page_slug", query_id=20260601_132813_02001_z3nde)
  compiled code at target/run/open_learning/models/integrations/learn/integrations__learn__mit_edx_courses.sql
13:28:18  
13:28:18    compiled code at target/compiled/open_learning/models/integrations/learn/integrations__learn__mit_edx_courses.sql
13:28:18  
13:28:18  Failure in model integrations__learn__xpro_programs (models/integrations/learn/integrations__learn__xpro_programs.sql)
13:28:18    Database Error in model integrations__learn__xpro_programs (models/integrations/learn/integrations__learn__xpro_programs.sql)
  TrinoUserError(type=USER_ERROR, name=COLUMN_TYPE_UNKNOWN, message="line 6:5: Column type is unknown: image_url", query_id=20260601_132815_00820_r6v86)
  compiled code at target/run/open_learning/models/integrations/learn/integrations__learn__xpro_programs.sql
13:28:18  
13:28:18    compiled code at target/compiled/open_learning/models/integrations/learn/integrations__learn__xpro_programs.sql
13:28:18  
13:28:18  Failure in model integrations__learn__micromasters_programs (models/integrations/learn/integrations__learn__micromasters_programs.sql)
13:28:18    Database Error in model integrations__learn__micromasters_programs (models/integrations/learn/integrations__learn__micromasters_programs.sql)
  TrinoUserError(type=USER_ERROR, name=COLUMN_TYPE_UNKNOWN, message="line 6:5: Column type is unknown: url", query_id=20260601_132813_00198_buyqq)
  compiled code at target/run/open_learning/models/integrations/learn/integrations__learn__micromasters_programs.sql
13:28:18  
13:28:18    compiled code at target/compiled/open_learning/models/integrations/learn/integrations__learn__micromasters_programs.sql
13:28:18  
13:28:18  Failure in model integrations__learn__xpro_courses (models/integrations/learn/integrations__learn__xpro_courses.sql)
13:28:18    Database Error in model integrations__learn__xpro_courses (models/integrations/learn/integrations__learn__xpro_courses.sql)
  TrinoUserError(type=USER_ERROR, name=COLUMN_TYPE_UNKNOWN, message="line 6:5: Column type is unknown: image_url", query_id=20260601_132815_00250_gps54)
  compiled code at target/run/open_learning/models/integrations/learn/integrations__learn__xpro_courses.sql
13:28:18  
13:28:18    compiled code at target/compiled/open_learning/models/integrations/learn/integrations__learn__xpro_courses.sql
13:28:18  
13:28:18  Failure in model integrations__learn__ocw_courses (models/integrations/learn/integrations__learn__ocw_courses.sql)
13:28:18    Database Error in model integrations__learn__ocw_courses (models/integrations/learn/integrations__learn__ocw_courses.sql)
  TrinoUserError(type=USER_ERROR, name=COLUMN_TYPE_UNKNOWN, message="line 6:5: Column type is unknown: image_url", query_id=20260601_132815_00928_hzdx3)
  compiled code at target/run/open_learning/models/integrations/learn/integrations__learn__ocw_courses.sql
13:28:18  
13:28:18    compiled code at target/compiled/open_learning/models/integrations/learn/integrations__learn__ocw_courses.sql
13:28:18  
13:28:18  Done. PASS=1 WARN=0 ERROR=7 SKIP=35 NO-OP=0 TOTAL=43

Can you take a look?

Comment on lines +37 to +38
, null as url
, null as image_url
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
, null as url
, null as image_url
, cast(null as varchar) as url
, cast(null as varchar) as image_url

to fix Runtime Error

 TrinoUserError(type=USER_ERROR, name=COLUMN_TYPE_UNKNOWN, message="line 6:5: Column type is unknown: image_url", query_id=20260601_132813_00237_gps54)

, {{ array_join('courses.course_topics', ', ') }} as topics
, course_instructors.instructors as instructors
, null as certification_type
, null as price
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can re-visit these null later, but some of these fields are probably available in the edx course API

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants