Skip to content

C1: Create tfact_grade — course run grade fact table#2259

Open
quazi-h wants to merge 1 commit into
mainfrom
copilot/2125-tfact-grade
Open

C1: Create tfact_grade — course run grade fact table#2259
quazi-h wants to merge 1 commit into
mainfrom
copilot/2125-tfact-grade

Conversation

@quazi-h
Copy link
Copy Markdown
Contributor

@quazi-h quazi-h commented May 28, 2026

What are the relevant tickets?

Closes #2125
Epic: #2073

Description (What does it do?)

Adds tfact_grade — a new incremental transaction fact table in the dimensional star schema.

Grain: One row per (user × course run) grade record.

Platforms covered: MITx Online, MITx Pro, edX.org

Source models:

  • int__mitxonline__courserun_grades — filtered to courserun_platform = 'MITx Online' to exclude edX.org-hosted MITx Online course grades (which are captured via the edxorg source and would otherwise be double-counted)
  • int__mitxpro__courserun_grades
  • int__edxorg__mitx_courserun_grades — uses courserungrade_user_grade (column name differs from other platforms); timestamps unavailable so grade_date_key and grade_created_on are null for edX.org rows; grade_id is a surrogate from (user_id, courserun_readable_id)

Key implementation details:

  • Incremental strategy: delete+insert on grade_key; watermark computed via a pre-computed CTE per platform (avoids Trino correlated-subquery fan-out anti-pattern, consistent with tfact_certificate and tfact_enrollment)
  • edX.org rows are fully re-ingested on each incremental run (null timestamps; delete+insert is idempotent)
  • Defensive ROW_NUMBER dedup guard (Trino does not support QUALIFY)
  • Foreign keys: dim_user, dim_course_run (is_current = true), dim_platform, dim_date
  • edX.org rows include MicroMasters-linked grades (micromasters_program_id IS NOT NULL); a separate MicroMasters platform source can be added in a future phase

How can this be tested?

  1. Compile: cd src/ol_dbt && ol-dbt run --select tfact_grade (or dbt compile --select tfact_grade)
  2. Run against QA: ol-dbt run --select tfact_grade --target dev_qa
  3. Validate row counts and pass rates:
SELECT
    platform,
    COUNT(*) AS grade_count,
    ROUND(AVG(CAST(is_passing AS double)), 2) AS pass_rate
FROM ol_warehouse_production_dimensional.tfact_grade
GROUP BY platform;
-- Expected: pass_rate ~0.5–0.7 for mitxonline; edxorg rows have null grade_date_key
  1. Verify FK integrity (run dbt tests): dbt test --select tfact_grade

Additional Context

The platform-filtering fix for MITx Online (filtering to courserun_platform = 'MITx Online') was identified during a pre-implementation rubber duck review. int__mitxonline__courserun_grades contains grades for both MITx Online-hosted and edX.org-hosted course runs; without this filter, edX.org course grades would appear twice in the fact table (once from mitxonline, once from edxorg).

This model is a new leaf node — it consumes existing intermediates and has no impact on any existing downstream models (confirmed via OpenMetadata lineage analysis).

Implements C1 from epic #2073. Consolidates course run grade records
from MITx Online, MITx Pro, and edX.org into a new incremental fact
table in the dimensional layer.

- Filters int__mitxonline__courserun_grades to courserun_platform = 'MITx Online'
  to avoid double-counting edX.org-hosted MITx Online course grades
- edX.org uses courserungrade_user_grade (different column name) and a
  surrogate grade_id from (user_id, courserun_readable_id)
- Incremental watermark via pre-computed CTE to avoid Trino fan-out
- Defensive ROW_NUMBER dedup guard (QUALIFY not supported in Trino)
- Full YAML column docs and FK relationship tests in _fact_tables.yml

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings May 28, 2026 17:32
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new incremental transaction fact table tfact_grade to the dimensional star schema, consolidating course run grades from MITx Online, MITx Pro, and edX.org. The model follows the patterns established by tfact_certificate and tfact_enrollment: per-platform CTEs are unioned, joined to dim_user/dim_course_run/dim_platform, and incrementally loaded via delete+insert keyed on a surrogate grade_key. A pre-computed per-platform watermark CTE avoids correlated-subquery fan-out, and a ROW_NUMBER dedup guard substitutes for Trino's missing QUALIFY.

Changes:

  • New model tfact_grade.sql with platform-aware FK joins and an incremental watermark per platform.
  • New YAML entry in _fact_tables.yml describing the table, grain, FK relationships, and column-level tests.
  • MITx Online source filtered to courserun_platform = 'MITx Online' to prevent double-counting edX.org-hosted MITx Online grades.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
src/ol_dbt/models/dimensional/tfact_grade.sql New incremental fact model unioning mitxonline/mitxpro/edxorg grades with FK lookups, watermarked incremental filter, and ROW_NUMBER dedup.
src/ol_dbt/models/dimensional/_fact_tables.yml Documents tfact_grade columns, relationships, and tests, including unique combination of (grade_id, platform).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@rachellougee rachellougee self-assigned this May 29, 2026
Copy link
Copy Markdown
Contributor

@rachellougee rachellougee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dbt build fine but a bit slow. I left a comment but not a blocker.

Comment on lines +117 to +126
on (
-- mitxonline/mitxpro: join on integer source_id + platform
(combined_grades.platform in ('mitxonline', 'mitxpro')
and combined_grades.courserun_id = dim_course_run.source_id
and combined_grades.platform = dim_course_run.platform)
-- edxorg: join on readable_id + platform to prevent fan-out across platforms
or (combined_grades.platform = 'edxorg'
and combined_grades.courserun_readable_id = dim_course_run.courserun_readable_id
and dim_course_run.platform = 'edxorg')
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to change it now, but the OR clause in the dim_course_run join here has a performance cost . This was likely copied from tfact_enrollment and tfact_certificate which have the same pattern.

All three platforms could join dim_course_run on (courserun_readable_id, platform) if we populate courserun_readable_id in above CTEs, the same natural key used to build courserun_pk, eliminating the OR entirely.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

C1: Create tfact_grade — course run grade fact table

3 participants