Conversation
- Create reporting model joining afact_course_page_engagement with course content hierarchy and user data - Join dim_course_content for unit, subsection, and section titles and indexes - Join dim_user for user email and full name - Join int__combined__course_runs for course title - Add full documentation and tests to _reporting__models.yml - Replaces Superset virtual dataset with physical table
There was a problem hiding this comment.
Pull request overview
This PR adds a new physical dbt reporting model page_engagement_views_report to replace a virtual dataset in Superset, improving query performance. The model aggregates page engagement data by joining course page engagement facts with course content hierarchy and user information.
Changes:
- Created
page_engagement_views_report.sqlto materialize page engagement data with user, unit, subsection, section, and course details - Added model documentation and column tests in
_reporting__models.yml
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| src/ol_dbt/models/reporting/page_engagement_views_report.sql | New SQL model joining afact_course_page_engagement with dim_course_content, dim_user, and int__combined__course_runs to create a denormalized reporting view |
| src/ol_dbt/models/reporting/_reporting__models.yml | Added model and column documentation with basic not_null tests for the new reporting model |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| inner join section_blocks | ||
| on page_engagement.chapter_block_fk = section_blocks.block_id | ||
| inner join dim_user | ||
| on page_engagement.openedx_user_id = dim_user.mitxonline_openedx_user_id |
There was a problem hiding this comment.
The join to dim_user only matches users from the 'mitxonline' platform using mitxonline_openedx_user_id. This will fail to join users from other platforms like 'edxorg', 'mitxpro', or 'residential'.
The existing learner_engagement_report.sql (lines 82-89) shows the correct pattern: use conditional joins based on the platform field to match the appropriate platform-specific user ID column. For example, use mitxonline_openedx_user_id when platform='mitxonline' and edxorg_openedx_user_id when platform='edxorg'.
| on page_engagement.openedx_user_id = dim_user.mitxonline_openedx_user_id | |
| on ( | |
| (page_engagement.platform = 'mitxonline' | |
| and page_engagement.openedx_user_id = dim_user.mitxonline_openedx_user_id) | |
| or (page_engagement.platform = 'edxorg' | |
| and page_engagement.openedx_user_id = dim_user.edxorg_openedx_user_id) | |
| or (page_engagement.platform = 'mitxpro' | |
| and page_engagement.openedx_user_id = dim_user.mitxpro_openedx_user_id) | |
| or (page_engagement.platform = 'residential' | |
| and page_engagement.openedx_user_id = dim_user.residential_openedx_user_id) | |
| ) |
| group by | ||
| course_title | ||
| , courserun_readable_id |
There was a problem hiding this comment.
The GROUP BY in the course_runs CTE is unnecessary since int__combined__course_runs already contains one row per course run with unique courserun_readable_id values. This adds overhead without changing the result.
Consider removing the GROUP BY clause and using a simple SELECT DISTINCT or just selecting the columns directly if the source guarantees uniqueness.
| group by | |
| course_title | |
| , courserun_readable_id |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| , subsection_blocks as ( | ||
| select * from {{ ref('dim_course_content') }} | ||
| where is_latest = true | ||
| ) | ||
|
|
||
| , section_blocks as ( | ||
| select * from {{ ref('dim_course_content') }} | ||
| where is_latest = true | ||
| ) |
There was a problem hiding this comment.
The subsection_blocks and section_blocks CTEs are redundant - they both select all columns from dim_course_content with only the is_latest = true filter. This creates unnecessary code duplication and could lead to maintenance issues if filters need to be updated.
Consider consolidating these into a single CTE (e.g., course_content_blocks) and reusing it for both joins, or selecting only the specific columns needed for each join to make the purpose clearer.
| group by | ||
| course_title | ||
| , courserun_readable_id | ||
| ) | ||
|
|
There was a problem hiding this comment.
The GROUP BY clause is unnecessary here because courserun_readable_id is already unique in int__combined__course_runs (confirmed by the unique test on line 235 of _combined_models.yml). The relationship between courserun_readable_id and course_title is 1:1.
Removing the GROUP BY will improve query performance by avoiding an unnecessary aggregation operation.
| group by | |
| course_title | |
| , courserun_readable_id | |
| ) | |
| ) |
|
@copilot open a new pull request to apply changes based on the comments in this thread |
…ers, and uniqueness test (#1921) * Initial plan * Apply review feedback to page_engagement_views_report - Remove unnecessary GROUP BY from course_runs CTE - Add block_category filters to subsection_blocks (sequential) and section_blocks (chapter) CTEs - Fix dim_user join to use platform-conditional logic for all platforms - Add compound uniqueness test on [user_email, courserun_readable_id, unit_title] Co-authored-by: quazi-h <59845076+quazi-h@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: quazi-h <59845076+quazi-h@users.noreply.github.com>
|
@copilot review this branch with the new changes |
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
|
@copilot please open pull request to apply changes based on the comments in this thread |
rachellougee
left a comment
There was a problem hiding this comment.
I was unable to build the dbt model run due to a performance issue with the join and had to stop the query before it ran too long. Please see the comment below and review the uniqueness test as well
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Rachel Lougee <rlougee@mit.edu>
SQL fixes: - Remove duplicate WHERE clauses in subsection_blocks and section_blocks CTEs - Add block_fk to SELECT for proper unique identification Documentation/Testing fixes: - Add block_fk column documentation to YAML - Change uniqueness test from unit_title to block_fk (prevents false failures) - Remove duplicate test definition Addresses Sentry feedback on uniqueness test bug
Resolved conflict in _reporting__models.yml by keeping all four model definitions: - page_engagement_views_report (from this branch) - problem_engagement_detail_report (from main) - mitxonline_course_engagements_daily_report (from main) - combined_video_engagements_counts_report (from main)
- Change INNER JOIN to LEFT JOIN for dim_user - Prevents silent data loss when user_fk is NULL - Matches pattern used in problem_engagement_detail_report - Addresses Sentry bot feedback on data loss issue
- Remove incompatible not_null test on user_email column - user_email can be NULL when user_fk is NULL (due to LEFT JOIN) - Updated description to clarify nullability - Addresses Sentry feedback on test failure with NULL user_fk
- Change INNER JOIN to LEFT JOIN for unit_blocks, subsection_blocks, section_blocks - Prevents silent data loss when block foreign keys are NULL - afact_course_page_engagement can have NULL values for: - block_fk - sequential_block_fk - chapter_block_fk - Update column descriptions to clarify nullability - Addresses Sentry feedback on data loss from INNER JOINs
- Add GROUP BY course_title, courserun_readable_id - Prevents duplicate rows when int__combined__course_runs has duplicates - Avoids inflated row counts and uniqueness test failures - Addresses Sentry feedback on data fan-out issue
- Add GROUP BY on grain columns: user_email, full_name, platform, courserun_readable_id, block_fk - Wrap all non-grouped columns in MAX() aggregation - Defensive programming: prevents fan-out if dim_course_content has duplicate block_ids - Matches pattern used in problem_engagement_detail_report - Ensures uniqueness test on [user_email, courserun_readable_id, block_fk] passes Addresses Sentry feedback on potential duplicate rows
…users CRITICAL FIX: Data loss for anonymous/unmatched users Issue: - When user_fk is NULL, LEFT JOIN to dim_user produces NULL for email and full_name - GROUP BY on (user_email, full_name, ...) collapses ALL NULL users into ONE row - MAX(num_of_views) only keeps highest view count, losing other users' data - Breaks the grain: multiple distinct users merged into single row Fix: - Add page_engagement.openedx_user_id to SELECT and GROUP BY - Each user (identified by openedx_user_id) now gets their own row - Even when email/full_name are NULL, users remain distinct - Update uniqueness test from [user_email, ...] to [openedx_user_id, ...] Changes: - SQL: Add openedx_user_id to SELECT (line 6) and GROUP BY (line 31) - YAML: Add openedx_user_id column documentation with not_null test - YAML: Update uniqueness test to use openedx_user_id instead of user_email Addresses Sentry feedback on data loss for NULL user scenarios
Issue: - Model groups by: platform, openedx_user_id, courserun_readable_id, block_fk - Uniqueness test only checked: openedx_user_id, courserun_readable_id, block_fk - Missing platform causes test failure when same user+course+block on different platforms Fix: - Add platform to uniqueness test column list - Test now matches actual grain of the model Addresses Sentry feedback on uniqueness test mismatch
Resolved conflict in _reporting__models.yml by keeping both model definitions: - page_engagement_views_report (from this branch) - program_enrollment_with_user_report (from main) - problem_engagement_detail_report (already in both)
What are the relevant tickets?
https://github.com/mitodl/hq/issues/7644
Description (What does it do?)
This PR converts the Superset virtual dataset
page_engagement_viewsinto a physical dbt reporting model to improve query performance in Superset.The new
page_engagement_views_reportmodel joins the following sources to replicate the virtual dataset:afact_course_page_engagement- Page engagement facts (views per user per unit)dim_course_content- Course content hierarchy for unit, subsection, and section titles and indexesdim_user- User email and full nameint__combined__course_runs- Course titleHow can this be tested?
Build and test the model:
dbt build --select page_engagement_views_report --vars 'schema_suffix: <your_username>' --target dev_productionOnce this PR is merged and deployed, the corresponding Superset virtual dataset can be replaced with this physical table.