Skip to content

Add page_engagement_views_report#1915

Merged
quazi-h merged 16 commits intomainfrom
page-engagement-views-report
Mar 6, 2026
Merged

Add page_engagement_views_report#1915
quazi-h merged 16 commits intomainfrom
page-engagement-views-report

Conversation

@quazi-h
Copy link
Copy Markdown
Contributor

@quazi-h quazi-h commented Feb 17, 2026

What are the relevant tickets?

https://github.com/mitodl/hq/issues/7644

Description (What does it do?)

This PR converts the Superset virtual dataset page_engagement_views into a physical dbt reporting model to improve query performance in Superset.

The new page_engagement_views_report model joins the following sources to replicate the virtual dataset:

  • afact_course_page_engagement - Page engagement facts (views per user per unit)
  • dim_course_content - Course content hierarchy for unit, subsection, and section titles and indexes
  • dim_user - User email and full name
  • int__combined__course_runs - Course title

How can this be tested?

Build and test the model:

dbt build --select page_engagement_views_report --vars 'schema_suffix: <your_username>' --target dev_production

Once this PR is merged and deployed, the corresponding Superset virtual dataset can be replaced with this physical table.

- Create reporting model joining afact_course_page_engagement with course content hierarchy and user data
- Join dim_course_content for unit, subsection, and section titles and indexes
- Join dim_user for user email and full name
- Join int__combined__course_runs for course title
- Add full documentation and tests to _reporting__models.yml
- Replaces Superset virtual dataset with physical table
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a new physical dbt reporting model page_engagement_views_report to replace a virtual dataset in Superset, improving query performance. The model aggregates page engagement data by joining course page engagement facts with course content hierarchy and user information.

Changes:

  • Created page_engagement_views_report.sql to materialize page engagement data with user, unit, subsection, section, and course details
  • Added model documentation and column tests in _reporting__models.yml

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
src/ol_dbt/models/reporting/page_engagement_views_report.sql New SQL model joining afact_course_page_engagement with dim_course_content, dim_user, and int__combined__course_runs to create a denormalized reporting view
src/ol_dbt/models/reporting/_reporting__models.yml Added model and column documentation with basic not_null tests for the new reporting model

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

inner join section_blocks
on page_engagement.chapter_block_fk = section_blocks.block_id
inner join dim_user
on page_engagement.openedx_user_id = dim_user.mitxonline_openedx_user_id
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The join to dim_user only matches users from the 'mitxonline' platform using mitxonline_openedx_user_id. This will fail to join users from other platforms like 'edxorg', 'mitxpro', or 'residential'.

The existing learner_engagement_report.sql (lines 82-89) shows the correct pattern: use conditional joins based on the platform field to match the appropriate platform-specific user ID column. For example, use mitxonline_openedx_user_id when platform='mitxonline' and edxorg_openedx_user_id when platform='edxorg'.

Suggested change
on page_engagement.openedx_user_id = dim_user.mitxonline_openedx_user_id
on (
(page_engagement.platform = 'mitxonline'
and page_engagement.openedx_user_id = dim_user.mitxonline_openedx_user_id)
or (page_engagement.platform = 'edxorg'
and page_engagement.openedx_user_id = dim_user.edxorg_openedx_user_id)
or (page_engagement.platform = 'mitxpro'
and page_engagement.openedx_user_id = dim_user.mitxpro_openedx_user_id)
or (page_engagement.platform = 'residential'
and page_engagement.openedx_user_id = dim_user.residential_openedx_user_id)
)

Copilot uses AI. Check for mistakes.
Comment thread src/ol_dbt/models/reporting/page_engagement_views_report.sql
Comment thread src/ol_dbt/models/reporting/_reporting__models.yml
Comment on lines +14 to +16
group by
course_title
, courserun_readable_id
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The GROUP BY in the course_runs CTE is unnecessary since int__combined__course_runs already contains one row per course run with unique courserun_readable_id values. This adds overhead without changing the result.

Consider removing the GROUP BY clause and using a simple SELECT DISTINCT or just selecting the columns directly if the source guarantees uniqueness.

Suggested change
group by
course_title
, courserun_readable_id

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +25 to +33
, subsection_blocks as (
select * from {{ ref('dim_course_content') }}
where is_latest = true
)

, section_blocks as (
select * from {{ ref('dim_course_content') }}
where is_latest = true
)
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The subsection_blocks and section_blocks CTEs are redundant - they both select all columns from dim_course_content with only the is_latest = true filter. This creates unnecessary code duplication and could lead to maintenance issues if filters need to be updated.

Consider consolidating these into a single CTE (e.g., course_content_blocks) and reusing it for both joins, or selecting only the specific columns needed for each join to make the purpose clearer.

Copilot uses AI. Check for mistakes.
Comment on lines +14 to +18
group by
course_title
, courserun_readable_id
)

Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The GROUP BY clause is unnecessary here because courserun_readable_id is already unique in int__combined__course_runs (confirmed by the unique test on line 235 of _combined_models.yml). The relationship between courserun_readable_id and course_title is 1:1.

Removing the GROUP BY will improve query performance by avoiding an unnecessary aggregation operation.

Suggested change
group by
course_title
, courserun_readable_id
)
)

Copilot uses AI. Check for mistakes.
@quazi-h
Copy link
Copy Markdown
Contributor Author

quazi-h commented Feb 20, 2026

@copilot open a new pull request to apply changes based on the comments in this thread

Copy link
Copy Markdown
Contributor

Copilot AI commented Feb 20, 2026

@quazi-h I've opened a new pull request, #1921, to work on those changes. Once the pull request is ready, I'll request review from you.

…ers, and uniqueness test (#1921)

* Initial plan
* Apply review feedback to page_engagement_views_report
- Remove unnecessary GROUP BY from course_runs CTE
- Add block_category filters to subsection_blocks (sequential) and section_blocks (chapter) CTEs
- Fix dim_user join to use platform-conditional logic for all platforms
- Add compound uniqueness test on [user_email, courserun_readable_id, unit_title]
Co-authored-by: quazi-h <59845076+quazi-h@users.noreply.github.com>
---------
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: quazi-h <59845076+quazi-h@users.noreply.github.com>
@quazi-h
Copy link
Copy Markdown
Contributor Author

quazi-h commented Feb 20, 2026

@copilot review this branch with the new changes

Copy link
Copy Markdown
Contributor

Copilot AI commented Feb 20, 2026

@quazi-h I've opened a new pull request, #1923, to work on those changes. Once the pull request is ready, I'll request review from you.

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Comment thread src/ol_dbt/models/reporting/_reporting__models.yml Outdated
@quazi-h
Copy link
Copy Markdown
Contributor Author

quazi-h commented Feb 23, 2026

@copilot please open pull request to apply changes based on the comments in this thread

Copy link
Copy Markdown
Contributor

Copilot AI commented Feb 23, 2026

@quazi-h I've opened a new pull request, #1938, to work on those changes. Once the pull request is ready, I'll request review from you.

@rachellougee rachellougee self-assigned this Feb 23, 2026
Copy link
Copy Markdown
Contributor

@rachellougee rachellougee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was unable to build the dbt model run due to a performance issue with the join and had to stop the query before it ran too long. Please see the comment below and review the uniqueness test as well

Comment thread src/ol_dbt/models/reporting/page_engagement_views_report.sql Outdated
Comment thread src/ol_dbt/models/reporting/_reporting__models.yml Outdated
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Comment thread src/ol_dbt/models/reporting/page_engagement_views_report.sql Outdated
quazi-h and others added 4 commits February 27, 2026 13:33
Co-authored-by: Rachel Lougee <rlougee@mit.edu>
SQL fixes:
- Remove duplicate WHERE clauses in subsection_blocks and section_blocks CTEs
- Add block_fk to SELECT for proper unique identification

Documentation/Testing fixes:
- Add block_fk column documentation to YAML
- Change uniqueness test from unit_title to block_fk (prevents false failures)
- Remove duplicate test definition

Addresses Sentry feedback on uniqueness test bug
Resolved conflict in _reporting__models.yml by keeping all four model definitions:
- page_engagement_views_report (from this branch)
- problem_engagement_detail_report (from main)
- mitxonline_course_engagements_daily_report (from main)
- combined_video_engagements_counts_report (from main)
Comment thread src/ol_dbt/models/reporting/page_engagement_views_report.sql Outdated
- Change INNER JOIN to LEFT JOIN for dim_user
- Prevents silent data loss when user_fk is NULL
- Matches pattern used in problem_engagement_detail_report
- Addresses Sentry bot feedback on data loss issue
Comment thread src/ol_dbt/models/reporting/_reporting__models.yml Outdated
- Remove incompatible not_null test on user_email column
- user_email can be NULL when user_fk is NULL (due to LEFT JOIN)
- Updated description to clarify nullability
- Addresses Sentry feedback on test failure with NULL user_fk
Comment thread src/ol_dbt/models/reporting/page_engagement_views_report.sql Outdated
- Change INNER JOIN to LEFT JOIN for unit_blocks, subsection_blocks, section_blocks
- Prevents silent data loss when block foreign keys are NULL
- afact_course_page_engagement can have NULL values for:
  - block_fk
  - sequential_block_fk
  - chapter_block_fk
- Update column descriptions to clarify nullability
- Addresses Sentry feedback on data loss from INNER JOINs
Comment thread src/ol_dbt/models/reporting/page_engagement_views_report.sql
- Add GROUP BY course_title, courserun_readable_id
- Prevents duplicate rows when int__combined__course_runs has duplicates
- Avoids inflated row counts and uniqueness test failures
- Addresses Sentry feedback on data fan-out issue
Comment thread src/ol_dbt/models/reporting/page_engagement_views_report.sql
- Add GROUP BY on grain columns: user_email, full_name, platform, courserun_readable_id, block_fk
- Wrap all non-grouped columns in MAX() aggregation
- Defensive programming: prevents fan-out if dim_course_content has duplicate block_ids
- Matches pattern used in problem_engagement_detail_report
- Ensures uniqueness test on [user_email, courserun_readable_id, block_fk] passes

Addresses Sentry feedback on potential duplicate rows
Comment thread src/ol_dbt/models/reporting/page_engagement_views_report.sql
…users

CRITICAL FIX: Data loss for anonymous/unmatched users

Issue:
- When user_fk is NULL, LEFT JOIN to dim_user produces NULL for email and full_name
- GROUP BY on (user_email, full_name, ...) collapses ALL NULL users into ONE row
- MAX(num_of_views) only keeps highest view count, losing other users' data
- Breaks the grain: multiple distinct users merged into single row

Fix:
- Add page_engagement.openedx_user_id to SELECT and GROUP BY
- Each user (identified by openedx_user_id) now gets their own row
- Even when email/full_name are NULL, users remain distinct
- Update uniqueness test from [user_email, ...] to [openedx_user_id, ...]

Changes:
- SQL: Add openedx_user_id to SELECT (line 6) and GROUP BY (line 31)
- YAML: Add openedx_user_id column documentation with not_null test
- YAML: Update uniqueness test to use openedx_user_id instead of user_email

Addresses Sentry feedback on data loss for NULL user scenarios
Comment thread src/ol_dbt/models/reporting/_reporting__models.yml Outdated
Issue:
- Model groups by: platform, openedx_user_id, courserun_readable_id, block_fk
- Uniqueness test only checked: openedx_user_id, courserun_readable_id, block_fk
- Missing platform causes test failure when same user+course+block on different platforms

Fix:
- Add platform to uniqueness test column list
- Test now matches actual grain of the model

Addresses Sentry feedback on uniqueness test mismatch
@quazi-h quazi-h requested a review from rachellougee March 4, 2026 20:00
Resolved conflict in _reporting__models.yml by keeping both model definitions:
- page_engagement_views_report (from this branch)
- program_enrollment_with_user_report (from main)
- problem_engagement_detail_report (already in both)
Copy link
Copy Markdown
Contributor

@rachellougee rachellougee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@quazi-h quazi-h merged commit 43ccd17 into main Mar 6, 2026
6 checks passed
@quazi-h quazi-h deleted the page-engagement-views-report branch March 6, 2026 13:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants