Migrate Superset charts from virtual to physical dbt datasets#1887
Migrate Superset charts from virtual to physical dbt datasets#1887blarghmatey wants to merge 4 commits intomainfrom
Conversation
c66c8f0 to
76068da
Compare
quazi-h
left a comment
There was a problem hiding this comment.
Changes look good to me and we observed the correct migration in QA with virtual datasets getting replaced by their physical counterparts.
There was a problem hiding this comment.
Pull request overview
This PR migrates 47 Superset charts from virtual datasets (with inline SQL) to physical dbt-managed datasets to improve performance, maintainability, and metadata tracking. It also converts one virtual dataset (marts__mitxonline_course_engagements_daily) to physical and updates the database UUID across all 66+ dataset files.
Changes:
- Updated
dataset_uuidreferences in 39 chart YAML files to point to physical dbt datasets - Converted 1 virtual dataset to physical by removing SQL query logic
- Updated
database_uuidfrom8702691f-d666-4dac-943b-9382c02233e3to9a22a54c-8b2f-4c66-a866-3f23812ec929across 66 dataset files
Reviewed changes
Copilot reviewed 40 out of 40 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| src/ol_superset/assets/databases/Trino.yaml | Updated database UUID to new identifier |
| src/ol_superset/assets/datasets/Trino/*.yaml (66 files) | Updated database_uuid reference in all dataset definitions |
| src/ol_superset/assets/datasets/Trino/marts__mitxonline_course_engagements_daily_e7330108-bdfa-4ff0-8ca3-084d7e8cbda8.yaml | Converted from virtual to physical dataset by removing SQL query |
| src/ol_superset/assets/charts/Video_Engagement_*.yaml (5 files) | Updated dataset references for video engagement visualizations |
| src/ol_superset/assets/charts/Program_Enrollment_*.yaml (2 files) | Updated dataset references for program enrollment charts |
| src/ol_superset/assets/charts/Problem_Engagement_*.yaml (5 files) | Updated dataset references for problem engagement charts |
| src/ol_superset/assets/charts/Page_Engagement_*.yaml (4 files) | Updated dataset references for page engagement charts |
| src/ol_superset/assets/charts/Enrollment.yaml (8 files) | Updated dataset references for course enrollment charts |
| src/ol_superset/assets/charts/Data_Detail_*.yaml (4 files) | Updated dataset references for data detail charts |
| src/ol_superset/assets/charts/Learner_*.yaml (8 files) | Updated dataset references for learner demographics and other learner charts |
| src/ol_superset/assets/charts/Course_Run_Metadata_*.yaml (3 files) | Updated dataset references for course metadata visualizations |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| uuid: cf8305f6-fe5d-4791-8185-3051b76f4738 | ||
| version: 1.0.0 | ||
| dataset_uuid: 32e283f5-fc02-403a-9c92-b8286352cdba | ||
| dataset_uuid: 2189d254-14fd-4b38-85a8-6af918546dd0 |
There was a problem hiding this comment.
The chart references column course_title (line 18) which does not exist in the new physical dataset tfact_course_navigation_events (UUID: 2189d254-14fd-4b38-85a8-6af918546dd0). The old virtual dataset Data_Detail_Nav had this column added via SQL JOIN, but the new base table does not include it. This chart will fail when executed.
| uuid: 39c99c2d-1290-4d37-9b0f-eef559172947 | ||
| version: 1.0.0 | ||
| dataset_uuid: 5bab95fd-d465-44f3-bed8-40c559958fc4 | ||
| dataset_uuid: 368125b7-c1c9-4f72-a360-9547536b4945 |
There was a problem hiding this comment.
The chart references column course_title (line 17) which does not exist in the new physical dataset tfact_discussion_events (UUID: 368125b7-c1c9-4f72-a360-9547536b4945). The old virtual dataset Data_Detail_Discuss had this column added via SQL JOIN, but the new base table does not include it. This chart will fail when executed.
| uuid: ebcf2551-ca8e-4fce-b04b-55e382b1c963 | ||
| version: 1.0.0 | ||
| dataset_uuid: 42f063d3-3c06-4834-b993-777fdbc6dea7 | ||
| dataset_uuid: a7af739c-62dd-4be0-a46f-5771a7e0c466 |
There was a problem hiding this comment.
The chart references columns user_watched_video_count, total_video_count, and max_video_index in its SQL metrics (lines 208, 211, 215, 222, 235, 238, 242), but these columns do not exist in the new physical dataset marts__combined_video_engagements (UUID: a7af739c-62dd-4be0-a46f-5771a7e0c466).
The old virtual dataset marts__combined_video_engagements_w_video_counts computed these aggregated columns via SQL, but the new physical dataset is the base table without these pre-aggregated fields. The chart's SQL expressions need to be updated to either:
- Calculate these aggregations dynamically from the base table columns, or
- The physical dataset needs to include these pre-aggregated columns
Without this fix, the chart will fail when executed.
| uuid: 9bdb6168-d5c9-4651-af36-556e29f3b5db | ||
| version: 1.0.0 | ||
| dataset_uuid: 42f063d3-3c06-4834-b993-777fdbc6dea7 | ||
| dataset_uuid: a7af739c-62dd-4be0-a46f-5771a7e0c466 |
There was a problem hiding this comment.
The chart references columns user_watched_video_count and total_video_count in its SQL metrics which do not exist in the new physical dataset marts__combined_video_engagements (UUID: a7af739c-62dd-4be0-a46f-5771a7e0c466). The old virtual dataset had these as pre-aggregated columns, but the new base table does not. This chart will fail when executed.
| uuid: 8cd66663-f93f-4751-be11-14c5d4c28ed5 | ||
| version: 1.0.0 | ||
| dataset_uuid: 3967f60c-2d80-41ad-8245-67dbf555bd82 | ||
| dataset_uuid: 2a608e5d-c828-4457-83c2-0912a1534430 |
There was a problem hiding this comment.
The chart references columns course_title and video_name (lines 16, 21) which do not exist in the new physical dataset tfact_video_events (UUID: 2a608e5d-c828-4457-83c2-0912a1534430). The old virtual dataset Data_Detail_Video had these columns which were added via SQL JOINs to dim_video and int__combined__course_runs tables, but the new base table does not include them. This chart will fail when executed.
| uuid: 5e8a4f03-1504-4f8b-9990-de300d73ff93 | ||
| version: 1.0.0 | ||
| dataset_uuid: dc0886e8-1861-4bd9-a694-25a063adcf83 | ||
| dataset_uuid: 5fc7287d-1bf8-4fc2-b008-c80f84e3a6d9 |
There was a problem hiding this comment.
The chart references columns course_title and problem_name (lines 17, 19) which do not exist in the new physical dataset tfact_problem_events (UUID: 5fc7287d-1bf8-4fc2-b008-c80f84e3a6d9). The old virtual dataset Data_Detail_Problems had these columns added via SQL JOINs, but the new base table does not include them. This chart will fail when executed.
- Created 5 new dbt reporting models in models/reporting/: - discussion_events_detail: tfact_discussion_events with course_title - navigation_events_detail: tfact_course_navigation_events with course_title - problem_events_detail: tfact_problem_events with course_title, problem_name - video_events_detail: tfact_video_events with course_title, video_name - video_engagement_by_user_chapter: Aggregated video metrics (user_watched_video_count, total_video_count, max_video_index) - Converted 5 virtual datasets to physical by removing SQL and pointing to new dbt models: - Data_Detail_Discuss -> discussion_events_detail - Data_Detail_Nav -> navigation_events_detail - Data_Detail_Problems -> problem_events_detail - Data_Detail_Video -> video_events_detail - marts__combined_video_engagements_w_video_counts -> video_engagement_by_user_chapter These models provide denormalized views for dashboard consumption while maintaining dimensional modeling principles in the base tfact tables. Addresses PR feedback on #1887
The WHERE clause filtering to only 'edxorg' and 'mitxonline' platforms was excluding valid navigation events from 'mitxpro' and 'residential' platforms. This filter was an artifact from the original virtual dataset that may have been created for a specific dashboard need. However, for a base reporting model, it should include all platforms for completeness and consistency with other detail models (discussion_events_detail, problem_events_detail, video_events_detail). If specific dashboards need platform filtering, they should apply it in Superset filters rather than having it hardcoded in the base dataset. Addresses PR review feedback on #1887
There was a problem hiding this comment.
Some of the dashboards are broken in QA as a result of the changes. For the Learner Engagement dashboard the following charts aren't working in QA: Problem Engagement by Courserun, Page Engagement By Section, Problem Engagement by Section, Page Engagement By Subsection, Problem Engagement By Subsection, Learner Problem Engagement, Learner Page Engagement. For the Course Enrollment Activity Dashboard the follow charts aren't working in QA now: Enrollment Activity Over Time, Enrollment Geography. The Learner Demographic and Course data dashboard and chart are not working in QA now. We may need to modify the columns to fix this. I'll continue reviewing the other items that need to be tested but wanted to point these out first.
|
In the issue description above the text appears: |
|
One of the new files/table created by this PR is suppose to replace marts__combined_video_engagements_w_video_counts however we already have a reporting table in prod to replace this dataset so it would be a duplicate. |
- Updated 39 charts to use physical datasets instead of virtual datasets - Converted marts__mitxonline_course_engagements_daily from virtual to physical - Total impact: 47 charts now use physical dbt-managed datasets Mappings: - Enrollment charts (19) -> marts__combined_course_enrollment_detail - Data Detail charts (4) -> tfact tables (discussion, navigation, problems, video) - Page engagement charts (4) -> afact_course_page_engagement - Problem engagement charts (5) -> afact_problem_engagement - Video engagement charts (5) -> marts__combined_video_engagements, mitxonline_video_engagements_w_video_counts - Program enrollment charts (2) -> marts__combined_program_enrollment_detail - Demographics chart (1) -> learner_demographics_and_cert_info - MITxOnline engagement charts (8) -> marts__mitxonline_course_engagements_daily (converted to physical) Benefits: - Improved performance (pre-computed physical tables) - Better metadata tracking (last updated timestamps) - Single source of truth in dbt Addresses #7644
- Created 5 new dbt reporting models in models/reporting/: - discussion_events_detail: tfact_discussion_events with course_title - navigation_events_detail: tfact_course_navigation_events with course_title - problem_events_detail: tfact_problem_events with course_title, problem_name - video_events_detail: tfact_video_events with course_title, video_name - video_engagement_by_user_chapter: Aggregated video metrics (user_watched_video_count, total_video_count, max_video_index) - Converted 5 virtual datasets to physical by removing SQL and pointing to new dbt models: - Data_Detail_Discuss -> discussion_events_detail - Data_Detail_Nav -> navigation_events_detail - Data_Detail_Problems -> problem_events_detail - Data_Detail_Video -> video_events_detail - marts__combined_video_engagements_w_video_counts -> video_engagement_by_user_chapter These models provide denormalized views for dashboard consumption while maintaining dimensional modeling principles in the base tfact tables. Addresses PR feedback on #1887
The WHERE clause filtering to only 'edxorg' and 'mitxonline' platforms was excluding valid navigation events from 'mitxpro' and 'residential' platforms. This filter was an artifact from the original virtual dataset that may have been created for a specific dashboard need. However, for a base reporting model, it should include all platforms for completeness and consistency with other detail models (discussion_events_detail, problem_events_detail, video_events_detail). If specific dashboards need platform filtering, they should apply it in Superset filters rather than having it hardcoded in the base dataset. Addresses PR review feedback on #1887
e00c98a to
1db2f7b
Compare
Summary
Migrates 47 Superset charts from virtual datasets (with inline SQL) to physical dbt-managed datasets, addressing https://github.com/mitodl/hq/issues/7644.
Changes
Charts Updated (39 files)
Updated
dataset_uuidreferences in chart YAML files to point to physical dbt datasets instead of virtual datasets:Data Detail Charts (16 charts)
Data_Detail_Discuss→tfact_discussion_eventsData_Detail_Nav→tfact_course_navigation_eventsData_Detail_Problems→tfact_problem_eventsData_Detail_Video→tfact_video_eventsEngagement Charts (10 charts)
afact_problem_engagementpage_engagement_views→afact_course_page_engagementEnrollment Charts (8 charts)
marts__combined_course_enrollment_detailProgram_Enrollment_with_user→marts__combined_program_enrollment_detailVideo Engagement (5 charts)
marts__combined_video_engagements_w_video_counts→marts__combined_video_engagementsDemographics (8 charts)
Learner_Demographics_And_Cert_Info→learner_demographics_and_cert_infoDataset Converted (1 file)
marts__mitxonline_course_engagements_daily: Converted from virtual to physical by removing SQL query logic, keeping same UUIDBenefits
✅ Performance: Physical datasets are pre-computed by dbt (faster queries, better caching)
✅ Maintainability: Dataset logic now in version-controlled dbt models
✅ Single Source of Truth: All transformations managed in dbt, not scattered across virtual datasets
✅ Flexibility: Charts can leverage Superset's native JOIN/aggregation instead of baked-in logic
✅ Metadata: Physical datasets support lineage tracking and impact analysis
Migration Strategy
Virtual datasets primarily added JOINs or aggregations to base physical tables. Rather than replicate this in new physical datasets, charts now use the normalized dbt base tables directly. Superset can handle JOINs/aggregations at query time with caching and optimization.
Post-Merge Cleanup
After validation, delete these 15 orphaned virtual dataset YAML files (no charts reference them):
Note: These files are safe to delete because:
Validation
All 15 remaining virtual datasets are orphaned (no charts reference them). After validation:
Testing Checklist
Related