feat: Add MIT Learn integration dbt layer (Cohort 1 catalog sources)#2262
feat: Add MIT Learn integration dbt layer (Cohort 1 catalog sources)#2262blarghmatey wants to merge 5 commits into
Conversation
Introduces the integrations/learn dbt layer — stable application-level contract models that MIT Learn's Trino-pull ETL tasks consume directly. These sit above the marts layer and expose the schema contract defined in docs/learn_marts_contract.md. New models (ol_dbt/models/integrations/learn/): - integrations__learn__ocw_courses - integrations__learn__mitxonline_courses - integrations__learn__mitxonline_programs - integrations__learn__xpro_courses - integrations__learn__xpro_programs - integrations__learn__micromasters_programs All models satisfy the required column contract (readable_id, title, last_modified, etl_source NOT NULL) and include not_null + unique tests in the schema YAML. Also adds: - dbt_project.yml: integrations layer with table materialization, integrations schema, and mit_learn_etl Trino role grant - docs/learn_marts_contract.md: schema contract spec for the layer - dg_projects/learning_resources/CONTRIBUTING.md: contributor guide covering local dev, vault mock, asset patterns, and dlt guidance Refs: MIT_LEARN_ETL_MIGRATION.md Cohort 1 — DB-backed catalog sources
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Pull request overview
Introduces a new dbt integration contract layer for MIT Learn catalog ingestion, exposing Learn-facing course/program tables above the warehouse marts and documenting how downstream Trino-pull ETL should consume them.
Changes:
- Adds six Learn integration dbt models for OCW, MITx Online, xPRO, and MicroMasters catalog entities.
- Registers the new
integrationsdbt layer and grants access to the MIT Learn ETL role. - Adds contract and contributor documentation for Learn integration workflows.
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
src/ol_dbt/models/integrations/learn/integrations__learn__ocw_courses.sql |
Adds OCW course integration model. |
src/ol_dbt/models/integrations/learn/integrations__learn__mitxonline_courses.sql |
Adds MITx Online course integration model. |
src/ol_dbt/models/integrations/learn/integrations__learn__mitxonline_programs.sql |
Adds MITx Online program integration model. |
src/ol_dbt/models/integrations/learn/integrations__learn__xpro_courses.sql |
Adds xPRO course integration model. |
src/ol_dbt/models/integrations/learn/integrations__learn__xpro_programs.sql |
Adds xPRO program integration model. |
src/ol_dbt/models/integrations/learn/integrations__learn__micromasters_programs.sql |
Adds MicroMasters program integration model. |
src/ol_dbt/models/integrations/learn/_integrations__learn__schema.yml |
Documents and tests required columns for the new integration models. |
src/ol_dbt/dbt_project.yml |
Configures the new integrations layer materialization, schema, and grants. |
docs/learn_marts_contract.md |
Adds the MIT Learn integration schema contract. |
dg_projects/learning_resources/CONTRIBUTING.md |
Adds contributor guidance for Learn catalog delivery patterns. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Replace native array_join() calls with {{ array_join() }} dispatched
macro throughout all six integration models, ensuring DuckDB
compatibility for local dev runs (Copilot feedback)
- Refactor complex concat expressions into per-row CTEs before
aggregation so the macro receives simple column references
- Simplify integrations__learn__xpro_programs: source course membership
from int__mitxpro__courses (which carries program_id directly) rather
than joining through Wagtail CMS page tables
- Fix integrations__learn__micromasters_programs: use program_updated_on
from stg__micromasters__app__postgres__courses_program instead of
current_timestamp; int__micromasters__programs does not expose this
column so add a targeted stg_programs CTE
- Update docs/learn_marts_contract.md: document topics, instructors,
runs, and courses columns as delimited varchar rather than JSON arrays,
matching the actual model output types (Copilot feedback)
- Fix CONTRIBUTING.md: correct 'From the repo root' cd path (remove
spurious ol-data-platform/ prefix); add sensor-driven partitioned
pattern to the delivery patterns table so the count matches (Copilot
feedback)
- Rewrite _integrations__learn__schema.yml cleanly: fix structural YAML
issue introduced by partial edit; add column docs for topics,
instructors, departments; update MicroMasters last_modified description
to reflect program_updated_on source
Copilot review — status updateAll 10 comments from the automated Copilot review have been addressed. A summary of the fixes in commit 5×
The 10 Copilot comments are all anchored to commit |
Sources from stg__edxorg__api__course and stg__edxorg__api__courserun (edX.org S3 exports via IRx). Resolves the open source-decision question from the Cohort 1 implementation guide: edxorg dg_project is the correct source for the public-facing MITx catalog on edx.org; mitxresidential covers MIT's internal OpenEdX deployment and is not appropriate here. Model details: - Filters to courses with at least one published run - course_topics: array_join on the array(varchar) from staging - instructors: cross-join unnest over courserun_instructors JSON, deduped and comma-joined - runs: pipe-delimited run metadata (key|start|end|published) joined with semicolons via the dispatched array_join macro Adds schema YAML entry with not_null/unique tests on readable_id and etl_source, plus column docs for runs, topics, and instructors.
rachellougee
left a comment
There was a problem hiding this comment.
Is this intended to replace the Learn's current API ingestion from api/v2/courses/ and api/v2/programs/ directly from each source? If so, it would add an additional layer to their ingestion layer, though that might not be a big deal.
Also, none of these models run successfully, though the code looks fine.
19:07:35 Finished running 1 project hook, 7 table models, 35 data tests in 0 hours 0 minutes and 15.04 seconds (15.04s).
19:07:36
19:07:36 Completed with 7 errors, 0 partial successes, and 0 warnings:
19:07:36
19:07:36 Failure in model integrations__learn__mitxonline_courses (models/integrations/learn/integrations__learn__mitxonline_courses.sql)
|
The goal is to start pushing the ingestion in Learn further along on the processing chain so that it has less work to do in the application and more of it happens in the data platform. The idea is that Learn will move to the Celery Trino ingest model that you built in MITx Online. I've got a branch locally with the beginnings of that work that I'll be pushing next week. |
|
Sounds good. Once you've resolved the dbt errors in these 7 new models, that should be sufficient. |
…n models
- macros/cross_db_functions.sql: add unnest_json_array cross-db macro
Trino: unnest(try_cast(json_parse(expr) as array(json))) as alias(col)
DuckDB: unnest(try_cast(expr as json[])) as alias(col)
- macros/json_extract_scalar.sql: fix DuckDB dispatch
Replace json_extract (returns quoted JSON) with json_extract_string
(returns plain VARCHAR), matching Trino json_extract_scalar semantics
- integrations__learn__mit_edx_courses: two fixes
* Use {{ unnest_json_array() }} macro instead of inline Trino syntax
* Use {{ json_extract_scalar() }} macro calls so DuckDB dispatch fires
- integrations__learn__micromasters_programs:
course_readable_id does not exist on stg__micromasters__app__postgres__courses_course;
correct column is course_edx_key
- mitxonline_courses/programs, xpro_courses/programs: fix last_modified nulls
The CMS page join returns null for courses/programs without a Wagtail page.
Add {{ cast_timestamp_to_iso8601('current_timestamp') }} as the final coalesce
fallback so last_modified is always non-null (varchar-typed on both engines).
Update schema.yml descriptions accordingly.
rachellougee
left a comment
There was a problem hiding this comment.
@blarghmatey There are still errors when running dbt against trino. They all related to null in the final select because trino can't infer the type of null without an explicit cast
13:28:17 Finished running 1 project hook, 7 table models, 35 data tests in 0 hours 1 minutes and 1.86 seconds (61.86s).
13:28:18
13:28:18 Completed with 7 errors, 0 partial successes, and 0 warnings:
13:28:18
13:28:18 Failure in model integrations__learn__mitxonline_courses (models/integrations/learn/integrations__learn__mitxonline_courses.sql)
13:28:18 Database Error in model integrations__learn__mitxonline_courses (models/integrations/learn/integrations__learn__mitxonline_courses.sql)
TrinoUserError(type=USER_ERROR, name=COLUMN_TYPE_UNKNOWN, message="line 6:5: Column type is unknown: image_url", query_id=20260601_132813_00237_gps54)
compiled code at target/run/open_learning/models/integrations/learn/integrations__learn__mitxonline_courses.sql
13:28:18
13:28:18 compiled code at target/compiled/open_learning/models/integrations/learn/integrations__learn__mitxonline_courses.sql
13:28:18
13:28:18 Failure in model integrations__learn__mitxonline_programs (models/integrations/learn/integrations__learn__mitxonline_programs.sql)
13:28:18 Database Error in model integrations__learn__mitxonline_programs (models/integrations/learn/integrations__learn__mitxonline_programs.sql)
TrinoUserError(type=USER_ERROR, name=COLUMN_TYPE_UNKNOWN, message="line 6:5: Column type is unknown: image_url", query_id=20260601_132813_01476_w5ckq)
compiled code at target/run/open_learning/models/integrations/learn/integrations__learn__mitxonline_programs.sql
13:28:18
13:28:18 compiled code at target/compiled/open_learning/models/integrations/learn/integrations__learn__mitxonline_programs.sql
13:28:18
13:28:18 Failure in model integrations__learn__mit_edx_courses (models/integrations/learn/integrations__learn__mit_edx_courses.sql)
13:28:18 Database Error in model integrations__learn__mit_edx_courses (models/integrations/learn/integrations__learn__mit_edx_courses.sql)
TrinoUserError(type=USER_ERROR, name=COLUMN_TYPE_UNKNOWN, message="line 6:5: Column type is unknown: page_slug", query_id=20260601_132813_02001_z3nde)
compiled code at target/run/open_learning/models/integrations/learn/integrations__learn__mit_edx_courses.sql
13:28:18
13:28:18 compiled code at target/compiled/open_learning/models/integrations/learn/integrations__learn__mit_edx_courses.sql
13:28:18
13:28:18 Failure in model integrations__learn__xpro_programs (models/integrations/learn/integrations__learn__xpro_programs.sql)
13:28:18 Database Error in model integrations__learn__xpro_programs (models/integrations/learn/integrations__learn__xpro_programs.sql)
TrinoUserError(type=USER_ERROR, name=COLUMN_TYPE_UNKNOWN, message="line 6:5: Column type is unknown: image_url", query_id=20260601_132815_00820_r6v86)
compiled code at target/run/open_learning/models/integrations/learn/integrations__learn__xpro_programs.sql
13:28:18
13:28:18 compiled code at target/compiled/open_learning/models/integrations/learn/integrations__learn__xpro_programs.sql
13:28:18
13:28:18 Failure in model integrations__learn__micromasters_programs (models/integrations/learn/integrations__learn__micromasters_programs.sql)
13:28:18 Database Error in model integrations__learn__micromasters_programs (models/integrations/learn/integrations__learn__micromasters_programs.sql)
TrinoUserError(type=USER_ERROR, name=COLUMN_TYPE_UNKNOWN, message="line 6:5: Column type is unknown: url", query_id=20260601_132813_00198_buyqq)
compiled code at target/run/open_learning/models/integrations/learn/integrations__learn__micromasters_programs.sql
13:28:18
13:28:18 compiled code at target/compiled/open_learning/models/integrations/learn/integrations__learn__micromasters_programs.sql
13:28:18
13:28:18 Failure in model integrations__learn__xpro_courses (models/integrations/learn/integrations__learn__xpro_courses.sql)
13:28:18 Database Error in model integrations__learn__xpro_courses (models/integrations/learn/integrations__learn__xpro_courses.sql)
TrinoUserError(type=USER_ERROR, name=COLUMN_TYPE_UNKNOWN, message="line 6:5: Column type is unknown: image_url", query_id=20260601_132815_00250_gps54)
compiled code at target/run/open_learning/models/integrations/learn/integrations__learn__xpro_courses.sql
13:28:18
13:28:18 compiled code at target/compiled/open_learning/models/integrations/learn/integrations__learn__xpro_courses.sql
13:28:18
13:28:18 Failure in model integrations__learn__ocw_courses (models/integrations/learn/integrations__learn__ocw_courses.sql)
13:28:18 Database Error in model integrations__learn__ocw_courses (models/integrations/learn/integrations__learn__ocw_courses.sql)
TrinoUserError(type=USER_ERROR, name=COLUMN_TYPE_UNKNOWN, message="line 6:5: Column type is unknown: image_url", query_id=20260601_132815_00928_hzdx3)
compiled code at target/run/open_learning/models/integrations/learn/integrations__learn__ocw_courses.sql
13:28:18
13:28:18 compiled code at target/compiled/open_learning/models/integrations/learn/integrations__learn__ocw_courses.sql
13:28:18
13:28:18 Done. PASS=1 WARN=0 ERROR=7 SKIP=35 NO-OP=0 TOTAL=43
Can you take a look?
| , null as url | ||
| , null as image_url |
There was a problem hiding this comment.
| , null as url | |
| , null as image_url | |
| , cast(null as varchar) as url | |
| , cast(null as varchar) as image_url |
to fix Runtime Error
TrinoUserError(type=USER_ERROR, name=COLUMN_TYPE_UNKNOWN, message="line 6:5: Column type is unknown: image_url", query_id=20260601_132813_00237_gps54)
| , {{ array_join('courses.course_topics', ', ') }} as topics | ||
| , course_instructors.instructors as instructors | ||
| , null as certification_type | ||
| , null as price |
There was a problem hiding this comment.
We can re-visit these null later, but some of these fields are probably available in the edx course API
What are the relevant tickets?
Closes https://github.com/mitodl/hq/issues/11510
Description (What does it do?)
Introduces the
integrations/learndbt layer — stable application-level contract models that MIT Learn's Trino-pull ETL tasks consume directly via theBaseTrinoETLTaskpattern. These models sit above the marts layer and expose a documented schema contract agreed on by both platform and application teams.New dbt models (
src/ol_dbt/models/integrations/learn/):integrations__learn__ocw_coursesintegrations__learn__mitxonline_coursesintegrations__learn__mitxonline_programsintegrations__learn__xpro_coursesintegrations__learn__xpro_programsintegrations__learn__micromasters_programsAll models satisfy the required column contract (
readable_id,title,last_modified,etl_sourceNOT NULL) withnot_nullanduniquedbt tests in the schema YAML.Supporting changes:
dbt_project.yml: registers theintegrationslayer with table materialization,integrationsschema, and amit_learn_etlTrino role grantdocs/learn_marts_contract.md: schema contract spec — required columns, grain, nullability rules, validetl_sourceenum valuesdg_projects/learning_resources/CONTRIBUTING.md: contributor guide — local dev, Vault mock, three asset patterns (REST webhook, sensor-partitioned, Trino-pull), naming conventionsHow can this be tested?
Once the
mit_learn_etlTrino role is provisioned:Additional Context
This PR delivers the data platform half of Phase 0 (Foundation) and Cohort 1 dbt work from
implementation_guide_01_db_catalog.md. The MIT Learn application-side Trino-pull Celery tasks (inmitodl/mit-learn) will reference these views by their fully-qualified Trino names, e.g.ol_warehouse_production.integrations.integrations__learn__ocw_courses.Note: MicroMasters programs use
current_timestampaslast_modified— the MicroMasters source DB has no last-modified column on programs. This is documented in the schema YAML and the contract doc.