Skip to content

task(content analytics) #34395 : Implement the Engagement SQL structures#34512

Open
jcastro-dotcms wants to merge 1 commit intomainfrom
issue-34395-Implement-the-Engagement-SQL-structures
Open

task(content analytics) #34395 : Implement the Engagement SQL structures#34512
jcastro-dotcms wants to merge 1 commit intomainfrom
issue-34395-Implement-the-Engagement-SQL-structures

Conversation

@jcastro-dotcms
Copy link
Contributor

@jcastro-dotcms jcastro-dotcms commented Feb 5, 2026

Proposed Changes

Implements the complete engagement analytics SQL structures and data pipeline for dotCMS content analytics, enabling GA4-style engagement metrics tracking at scale.

Changes Overview

This PR introduces a multi-layered analytics pipeline in ClickHouse that computes session engagement metrics efficiently through incremental aggregation and daily rollups.

Database Layer (ClickHouse)

New Pipeline Architecture:

  events (raw)
    ↓ (real-time MV)
  session_states (incremental aggregates)
    ↓ (refreshable MV every 15min)
  session_facts (finalized sessions)
    ↓ (refreshable MVs every 15min)
  engagement_daily + sessions_by_*_daily (dashboard rollups)
    ↓
  CubeJS API

Core Tables & Views Added:

  1. Session Aggregation Layer
    - session_states (AggregatingMergeTree) - Incremental, mergeable session states
    - session_states_mv - Real-time MV aggregating events into session states
    - Supports late-arriving events through merge semantics
  2. Session Facts Layer
    - session_facts (ReplacingMergeTree) - Finalized session snapshots with engagement flags
    - session_facts_rmv - Refreshable MV (every 15min) finalizing sessions from last 72 hours
    - Computes engagement flag based on: duration > 10s OR pageviews >= 2 OR conversions >= 1
  3. Daily Rollup Tables
    - engagement_daily - Daily engagement KPIs (total/engaged sessions, durations, event counts)
    - sessions_by_device_daily - Sessions by device category (Desktop/Mobile/Tablet/Other)
    - sessions_by_browser_daily - Sessions by browser family (Chrome/Safari/Firefox/Edge/Other)
    - sessions_by_language_daily - Sessions by language ID
    - All with corresponding refreshable MVs recomputing last 90 days
  4. Classification Tables
    - device_category_map - User-agent to device category mapping
    - device_category_fallback_rules - Priority-ordered fallback heuristics for device detection
    - browser_family_map - User-agent to browser family mapping
    - browser_family_fallback_rules - Priority-ordered fallback heuristics for browser detection

CubeJS Schema Layer

New Cubes:

  1. EngagementDaily (docker/docker-compose-examples/analytics/setup/config/dev/cube/schema/EngagementDaily.js)
    - Measures: engagement rate, conversion rate, avg interactions, avg session time
    - Dimensions: customer_id, cluster_id, context_site_id, day
    - Enables KPI cards and trend charts
  2. SessionsByDeviceDaily (docker/docker-compose-examples/analytics/setup/config/dev/cube/schema/SessionsByDeviceDaily.js)
    - Measures: total/engaged sessions, engagement rate within device, avg engaged session time
    - Dimensions: device_category (Desktop/Mobile/Tablet/Other)
  3. SessionsByBrowserDaily (docker/docker-compose-examples/analytics/setup/config/dev/cube/schema/SessionsByBrowserDaily.js)
    - Measures: total/engaged sessions, engagement rate within browser, avg engaged session time
    - Dimensions: browser_family (Chrome/Safari/Firefox/Edge/Other)
  4. SessionsByLanguageDaily (docker/docker-compose-examples/analytics/setup/config/dev/cube/schema/SessionsByLanguageDaily.js)
    - Measures: total/engaged sessions, engagement rate within language, avg engaged session time
    - Dimensions: language_id (dotCMS language ID as String)

Updated:

  • cube.js - Added new cubes to security whitelist
  • EventSummary.js - Fixed filter params to use correct cube name (was referencing ContentAttribution incorrectly)

Key Features

Engagement Rules (GA4-aligned)

A session is marked as "engaged" if ANY of:

  • Duration > 10 seconds
  • Pageviews >= 2
  • Conversions >= 1

Performance Optimizations

  • Incremental aggregation: Events → session states via real-time MV
  • Bounded refresh windows: Only recompute recent data (72h for sessions, 90d for rollups)
  • Daily pre-aggregation: Dashboard queries hit small rollup tables, not raw events
  • Late-event handling: Refreshable MVs correct for late-arriving events automatically

Multi-tenant & Multi-cluster Support

All tables scoped by:

  • customer_id (tenant identifier)
  • cluster_id (environment: prod/stage/etc.)

Files Changed

  • docker/docker-compose-examples/analytics/setup/config/dev/cube/cube.js (+10 lines)
  • docker/docker-compose-examples/analytics/setup/config/dev/cube/schema/EngagementDaily.js (+191 lines, new)
  • docker/docker-compose-examples/analytics/setup/config/dev/cube/schema/EventSummary.js (bug fix)
  • docker/docker-compose-examples/analytics/setup/config/dev/cube/schema/SessionsByBrowserDaily.js (+110 lines, new)
  • docker/docker-compose-examples/analytics/setup/config/dev/cube/schema/SessionsByDeviceDaily.js (+114 lines, new)
  • docker/docker-compose-examples/analytics/setup/config/dev/cube/schema/SessionsByLanguageDaily.js (+111 lines, new)
  • docker/docker-compose-examples/analytics/setup/db/clickhouse/init-scripts/init.sql (+1144 lines)

Test Plan

  • Verify ClickHouse schema creation (tables, MVs, initial data)
  • Test event ingestion → session_states aggregation
  • Verify session finalization with engagement flag calculation
  • Test daily rollup generation for all dimension tables
  • Validate CubeJS cube queries return correct metrics
  • Test multi-tenant data isolation
  • Verify late-event handling and MV refresh behavior

This PR fixes: #34395

/* Partitioning note:
We partition by a hash of (customer, cluster) to spread writes and merges.
This avoids a single giant partition for big tenants and keeps merges parallelizable. */
PARTITION BY sipHash64(customer_id, cluster_id) % 64
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should add the date for this partition maybe "min_ts_state"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

[TASK] Implement the Engagement SQL structures [TASK] Generate CubeJS queries for pulling data into dashboard

2 participants