Skip to content

Migrate Superset charts from virtual to physical dbt datasets#1887

Closed
blarghmatey wants to merge 4 commits intomainfrom
superset_virtual_to_physical
Closed

Migrate Superset charts from virtual to physical dbt datasets#1887
blarghmatey wants to merge 4 commits intomainfrom
superset_virtual_to_physical

Conversation

@blarghmatey
Copy link
Copy Markdown
Member

@blarghmatey blarghmatey commented Feb 5, 2026

Summary

Migrates 47 Superset charts from virtual datasets (with inline SQL) to physical dbt-managed datasets, addressing https://github.com/mitodl/hq/issues/7644.

Changes

Charts Updated (39 files)

Updated dataset_uuid references in chart YAML files to point to physical dbt datasets instead of virtual datasets:

Data Detail Charts (16 charts)

  • Discussion events: Data_Detail_Discusstfact_discussion_events
  • Navigation events: Data_Detail_Navtfact_course_navigation_events
  • Problem events: Data_Detail_Problemstfact_problem_events
  • Video events: Data_Detail_Videotfact_video_events

Engagement Charts (10 charts)

  • Problem engagement: Test datasets → afact_problem_engagement
  • Page engagement: page_engagement_viewsafact_course_page_engagement

Enrollment Charts (8 charts)

  • Course enrollments: Virtual views → marts__combined_course_enrollment_detail
  • Program enrollments: Program_Enrollment_with_usermarts__combined_program_enrollment_detail

Video Engagement (5 charts)

  • marts__combined_video_engagements_w_video_countsmarts__combined_video_engagements

Demographics (8 charts)

  • Learner_Demographics_And_Cert_Infolearner_demographics_and_cert_info

Dataset Converted (1 file)

  • marts__mitxonline_course_engagements_daily: Converted from virtual to physical by removing SQL query logic, keeping same UUID

Benefits

Performance: Physical datasets are pre-computed by dbt (faster queries, better caching)
Maintainability: Dataset logic now in version-controlled dbt models
Single Source of Truth: All transformations managed in dbt, not scattered across virtual datasets
Flexibility: Charts can leverage Superset's native JOIN/aggregation instead of baked-in logic
Metadata: Physical datasets support lineage tracking and impact analysis

Migration Strategy

Virtual datasets primarily added JOINs or aggregations to base physical tables. Rather than replicate this in new physical datasets, charts now use the normalized dbt base tables directly. Superset can handle JOINs/aggregations at query time with caching and optimization.

Post-Merge Cleanup

After validation, delete these 15 orphaned virtual dataset YAML files (no charts reference them):

cd src/ol_superset/assets/datasets/Trino/

# Data Detail views (4 files)
rm Data_Detail_Discuss_5bab95fd-d465-44f3-bed8-40c559958fc4.yaml
rm Data_Detail_Nav_32e283f5-fc02-403a-9c92-b8286352cdba.yaml
rm Data_Detail_Problems_dc0886e8-1861-4bd9-a694-25a063adcf83.yaml
rm Data_Detail_Video_3967f60c-2d80-41ad-8245-67dbf555bd82.yaml

# Test datasets (3 files)
rm engagement_problem_summary_test1_d763f86e-5c1b-4996-9b35-f8cb13947f11.yaml
rm engagement_problem_summary_test2_a2198d6f-df2d-492c-b30b-9e3279466bdb.yaml
rm engagment_problem_test1_b9e94ba2-f399-4210-b042-b8aaa4039f58.yaml

# Video engagement views (2 files)
rm marts__combined_video_engagements_w_video_counts_42f063d3-3c06-4834-b993-777fdbc6dea7.yaml
rm marts__mitxonline_video_engagements_w_video_counts_bf17ec0d-2a0a-4f97-ab5b-ae2df862d4b5.yaml

# Page engagement view (1 file)
rm page_engagement_views_c3182341-5246-40c2-8f04-885369c38093.yaml

# Enrollment views (2 files)
rm Enrollment_Activity_Counts_Dataset_a9301703-0ac1-4eb7-a409-f65da5d8cba1.yaml
rm combined_enrollments_with_gender_and_date_f3e517dd-4012-441e-8dfe-edaef1318000.yaml

# Complex views (2 files)
rm combined_learners_enrollment_detail_b579034e-2b79-4d3a-ba84-94c9fcfa0cc5.yaml
rm Program_Enrollment_with_user_cc496da8-03f6-43ea-9b6c-900ba695e4b6.yaml

# Demographics (1 file)
rm Learner_Demographics_And_Cert_Info_232090f5-bb02-4327-9067-2ae49b64074b.yaml

Note: These files are safe to delete because:

  1. No charts reference them after this migration
  2. All necessary data exists in physical dbt models
  3. Any specialized logic can be recreated in dbt if needed

Validation

All 15 remaining virtual datasets are orphaned (no charts reference them). After validation:

  1. Test dashboards in QA: https://bi-qa.ol.mit.edu
  2. Run SQL validation queries (see session docs)
  3. Clean up orphaned virtual dataset YAML files (commands above)

Testing Checklist

  • All dashboards load without errors in QA
  • Chart metrics match expected values
  • SQL validation queries pass (physical ≥ virtual rows)
  • No performance regressions
  • 1-2 week validation period completed
  • Orphaned virtual dataset files deleted (15 files listed above)

Related

Copy link
Copy Markdown
Contributor

@quazi-h quazi-h left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good to me and we observed the correct migration in QA with virtual datasets getting replaced by their physical counterparts.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR migrates 47 Superset charts from virtual datasets (with inline SQL) to physical dbt-managed datasets to improve performance, maintainability, and metadata tracking. It also converts one virtual dataset (marts__mitxonline_course_engagements_daily) to physical and updates the database UUID across all 66+ dataset files.

Changes:

  • Updated dataset_uuid references in 39 chart YAML files to point to physical dbt datasets
  • Converted 1 virtual dataset to physical by removing SQL query logic
  • Updated database_uuid from 8702691f-d666-4dac-943b-9382c02233e3 to 9a22a54c-8b2f-4c66-a866-3f23812ec929 across 66 dataset files

Reviewed changes

Copilot reviewed 40 out of 40 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
src/ol_superset/assets/databases/Trino.yaml Updated database UUID to new identifier
src/ol_superset/assets/datasets/Trino/*.yaml (66 files) Updated database_uuid reference in all dataset definitions
src/ol_superset/assets/datasets/Trino/marts__mitxonline_course_engagements_daily_e7330108-bdfa-4ff0-8ca3-084d7e8cbda8.yaml Converted from virtual to physical dataset by removing SQL query
src/ol_superset/assets/charts/Video_Engagement_*.yaml (5 files) Updated dataset references for video engagement visualizations
src/ol_superset/assets/charts/Program_Enrollment_*.yaml (2 files) Updated dataset references for program enrollment charts
src/ol_superset/assets/charts/Problem_Engagement_*.yaml (5 files) Updated dataset references for problem engagement charts
src/ol_superset/assets/charts/Page_Engagement_*.yaml (4 files) Updated dataset references for page engagement charts
src/ol_superset/assets/charts/Enrollment.yaml (8 files) Updated dataset references for course enrollment charts
src/ol_superset/assets/charts/Data_Detail_*.yaml (4 files) Updated dataset references for data detail charts
src/ol_superset/assets/charts/Learner_*.yaml (8 files) Updated dataset references for learner demographics and other learner charts
src/ol_superset/assets/charts/Course_Run_Metadata_*.yaml (3 files) Updated dataset references for course metadata visualizations

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

uuid: cf8305f6-fe5d-4791-8185-3051b76f4738
version: 1.0.0
dataset_uuid: 32e283f5-fc02-403a-9c92-b8286352cdba
dataset_uuid: 2189d254-14fd-4b38-85a8-6af918546dd0
Copy link

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The chart references column course_title (line 18) which does not exist in the new physical dataset tfact_course_navigation_events (UUID: 2189d254-14fd-4b38-85a8-6af918546dd0). The old virtual dataset Data_Detail_Nav had this column added via SQL JOIN, but the new base table does not include it. This chart will fail when executed.

Copilot uses AI. Check for mistakes.
uuid: 39c99c2d-1290-4d37-9b0f-eef559172947
version: 1.0.0
dataset_uuid: 5bab95fd-d465-44f3-bed8-40c559958fc4
dataset_uuid: 368125b7-c1c9-4f72-a360-9547536b4945
Copy link

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The chart references column course_title (line 17) which does not exist in the new physical dataset tfact_discussion_events (UUID: 368125b7-c1c9-4f72-a360-9547536b4945). The old virtual dataset Data_Detail_Discuss had this column added via SQL JOIN, but the new base table does not include it. This chart will fail when executed.

Copilot uses AI. Check for mistakes.
uuid: ebcf2551-ca8e-4fce-b04b-55e382b1c963
version: 1.0.0
dataset_uuid: 42f063d3-3c06-4834-b993-777fdbc6dea7
dataset_uuid: a7af739c-62dd-4be0-a46f-5771a7e0c466
Copy link

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The chart references columns user_watched_video_count, total_video_count, and max_video_index in its SQL metrics (lines 208, 211, 215, 222, 235, 238, 242), but these columns do not exist in the new physical dataset marts__combined_video_engagements (UUID: a7af739c-62dd-4be0-a46f-5771a7e0c466).

The old virtual dataset marts__combined_video_engagements_w_video_counts computed these aggregated columns via SQL, but the new physical dataset is the base table without these pre-aggregated fields. The chart's SQL expressions need to be updated to either:

  1. Calculate these aggregations dynamically from the base table columns, or
  2. The physical dataset needs to include these pre-aggregated columns

Without this fix, the chart will fail when executed.

Copilot uses AI. Check for mistakes.
uuid: 9bdb6168-d5c9-4651-af36-556e29f3b5db
version: 1.0.0
dataset_uuid: 42f063d3-3c06-4834-b993-777fdbc6dea7
dataset_uuid: a7af739c-62dd-4be0-a46f-5771a7e0c466
Copy link

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The chart references columns user_watched_video_count and total_video_count in its SQL metrics which do not exist in the new physical dataset marts__combined_video_engagements (UUID: a7af739c-62dd-4be0-a46f-5771a7e0c466). The old virtual dataset had these as pre-aggregated columns, but the new base table does not. This chart will fail when executed.

Copilot uses AI. Check for mistakes.
uuid: 8cd66663-f93f-4751-be11-14c5d4c28ed5
version: 1.0.0
dataset_uuid: 3967f60c-2d80-41ad-8245-67dbf555bd82
dataset_uuid: 2a608e5d-c828-4457-83c2-0912a1534430
Copy link

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The chart references columns course_title and video_name (lines 16, 21) which do not exist in the new physical dataset tfact_video_events (UUID: 2a608e5d-c828-4457-83c2-0912a1534430). The old virtual dataset Data_Detail_Video had these columns which were added via SQL JOINs to dim_video and int__combined__course_runs tables, but the new base table does not include them. This chart will fail when executed.

Copilot uses AI. Check for mistakes.
uuid: 5e8a4f03-1504-4f8b-9990-de300d73ff93
version: 1.0.0
dataset_uuid: dc0886e8-1861-4bd9-a694-25a063adcf83
dataset_uuid: 5fc7287d-1bf8-4fc2-b008-c80f84e3a6d9
Copy link

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The chart references columns course_title and problem_name (lines 17, 19) which do not exist in the new physical dataset tfact_problem_events (UUID: 5fc7287d-1bf8-4fc2-b008-c80f84e3a6d9). The old virtual dataset Data_Detail_Problems had these columns added via SQL JOINs, but the new base table does not include them. This chart will fail when executed.

Copilot uses AI. Check for mistakes.
blarghmatey added a commit that referenced this pull request Feb 5, 2026
- Created 5 new dbt reporting models in models/reporting/:
  - discussion_events_detail: tfact_discussion_events with course_title
  - navigation_events_detail: tfact_course_navigation_events with course_title
  - problem_events_detail: tfact_problem_events with course_title, problem_name
  - video_events_detail: tfact_video_events with course_title, video_name
  - video_engagement_by_user_chapter: Aggregated video metrics (user_watched_video_count, total_video_count, max_video_index)

- Converted 5 virtual datasets to physical by removing SQL and pointing to new dbt models:
  - Data_Detail_Discuss -> discussion_events_detail
  - Data_Detail_Nav -> navigation_events_detail
  - Data_Detail_Problems -> problem_events_detail
  - Data_Detail_Video -> video_events_detail
  - marts__combined_video_engagements_w_video_counts -> video_engagement_by_user_chapter

These models provide denormalized views for dashboard consumption while maintaining dimensional modeling principles in the base tfact tables.

Addresses PR feedback on #1887
Comment thread src/ol_dbt/models/reporting/navigation_events_detail.sql Outdated
blarghmatey added a commit that referenced this pull request Feb 5, 2026
The WHERE clause filtering to only 'edxorg' and 'mitxonline' platforms was
excluding valid navigation events from 'mitxpro' and 'residential' platforms.

This filter was an artifact from the original virtual dataset that may have been
created for a specific dashboard need. However, for a base reporting model, it
should include all platforms for completeness and consistency with other detail
models (discussion_events_detail, problem_events_detail, video_events_detail).

If specific dashboards need platform filtering, they should apply it in
Superset filters rather than having it hardcoded in the base dataset.

Addresses PR review feedback on #1887
@KatelynGit KatelynGit self-assigned this Feb 10, 2026
Copy link
Copy Markdown
Contributor

@KatelynGit KatelynGit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of the dashboards are broken in QA as a result of the changes. For the Learner Engagement dashboard the following charts aren't working in QA: Problem Engagement by Courserun, Page Engagement By Section, Problem Engagement by Section, Page Engagement By Subsection, Problem Engagement By Subsection, Learner Problem Engagement, Learner Page Engagement. For the Course Enrollment Activity Dashboard the follow charts aren't working in QA now: Enrollment Activity Over Time, Enrollment Geography. The Learner Demographic and Course data dashboard and chart are not working in QA now. We may need to modify the columns to fix this. I'll continue reviewing the other items that need to be tested but wanted to point these out first.

@KatelynGit
Copy link
Copy Markdown
Contributor

KatelynGit commented Feb 10, 2026

In the issue description above the text appears:
"Affected dashboards: Course Run Metadata, Data Detail Discussion, Learner Demographics, Video Engagement, Problem Engagement, Program Enrollment" but none of these are dashboards so that part of the issue can be ignored

@KatelynGit
Copy link
Copy Markdown
Contributor

One of the new files/table created by this PR is suppose to replace marts__combined_video_engagements_w_video_counts however we already have a reporting table in prod to replace this dataset so it would be a duplicate.

- Updated 39 charts to use physical datasets instead of virtual datasets
- Converted marts__mitxonline_course_engagements_daily from virtual to physical
- Total impact: 47 charts now use physical dbt-managed datasets

Mappings:
- Enrollment charts (19) -> marts__combined_course_enrollment_detail
- Data Detail charts (4) -> tfact tables (discussion, navigation, problems, video)
- Page engagement charts (4) -> afact_course_page_engagement
- Problem engagement charts (5) -> afact_problem_engagement
- Video engagement charts (5) -> marts__combined_video_engagements, mitxonline_video_engagements_w_video_counts
- Program enrollment charts (2) -> marts__combined_program_enrollment_detail
- Demographics chart (1) -> learner_demographics_and_cert_info
- MITxOnline engagement charts (8) -> marts__mitxonline_course_engagements_daily (converted to physical)

Benefits:
- Improved performance (pre-computed physical tables)
- Better metadata tracking (last updated timestamps)
- Single source of truth in dbt

Addresses #7644
- Created 5 new dbt reporting models in models/reporting/:
  - discussion_events_detail: tfact_discussion_events with course_title
  - navigation_events_detail: tfact_course_navigation_events with course_title
  - problem_events_detail: tfact_problem_events with course_title, problem_name
  - video_events_detail: tfact_video_events with course_title, video_name
  - video_engagement_by_user_chapter: Aggregated video metrics (user_watched_video_count, total_video_count, max_video_index)

- Converted 5 virtual datasets to physical by removing SQL and pointing to new dbt models:
  - Data_Detail_Discuss -> discussion_events_detail
  - Data_Detail_Nav -> navigation_events_detail
  - Data_Detail_Problems -> problem_events_detail
  - Data_Detail_Video -> video_events_detail
  - marts__combined_video_engagements_w_video_counts -> video_engagement_by_user_chapter

These models provide denormalized views for dashboard consumption while maintaining dimensional modeling principles in the base tfact tables.

Addresses PR feedback on #1887
The WHERE clause filtering to only 'edxorg' and 'mitxonline' platforms was
excluding valid navigation events from 'mitxpro' and 'residential' platforms.

This filter was an artifact from the original virtual dataset that may have been
created for a specific dashboard need. However, for a base reporting model, it
should include all platforms for completeness and consistency with other detail
models (discussion_events_detail, problem_events_detail, video_events_detail).

If specific dashboards need platform filtering, they should apply it in
Superset filters rather than having it hardcoded in the base dataset.

Addresses PR review feedback on #1887
@blarghmatey blarghmatey force-pushed the superset_virtual_to_physical branch from e00c98a to 1db2f7b Compare February 13, 2026 19:10
Comment thread src/ol_dbt/models/reporting/program_enrollment_demographics.sql
@blarghmatey blarghmatey closed this Mar 9, 2026
@blarghmatey blarghmatey deleted the superset_virtual_to_physical branch March 9, 2026 17:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants