Skip to content

feat(site_analytics): app for collecting anonymized visit data for community pages#149

Open
bryangingechen wants to merge 17 commits intomasterfrom
analytics-ingestion
Open

feat(site_analytics): app for collecting anonymized visit data for community pages#149
bryangingechen wants to merge 17 commits intomasterfrom
analytics-ingestion

Conversation

@bryangingechen
Copy link
Copy Markdown
Contributor

@bryangingechen bryangingechen commented Mar 26, 2026

This consists of a single API endpoint at QUEUEBOARD_SITE/api/v1/analytics/collect which will be called by a small tracking script embedded on each page. No cookies will be stored on visitors' devices.

We begin by embedding the tracking script in the queueboard frontend pages.

The data collected / aggregated:

  • AnalyticsPageView — raw event rows; immutable after insert; pruned after SITE_ANALYTICS_RETENTION_DAYS (default 540).
    • Fields: site, path, referrer, user_agent, occurred_at, visitor_month_hash.
  • AnalyticsDailyMetric — daily aggregate per site; unique on (site, date).
    • Fields: site, date (UTC), pageviews, unique_visitors.
  • AnalyticsMonthlyMetric — monthly aggregate per site; unique on (site, month).
    • Fields: site, month (UTC first-of-month DateField, e.g. 2026-03-01), pageviews, unique_visitors.

Visitor IPs are not stored, only the hash of the IP + month, which makes it impossible to track users across months by design.

More analytics built on top of AnalyticsPageView events may be added later.

Prepared with Claude.

bryangingechen and others added 17 commits March 25, 2026 15:19
- Resolve site config open question: SITE_ANALYTICS_ALLOWED_SITES env var
- Resolve auth open question: no per-site tokens in v1
- Document IP extraction strategy (X-Forwarded-For → REMOTE_ADDR fallback)
- Clarify CSRF handling via DRF authentication_classes pattern
- Specify backup_policy.py table placement for all three new tables
- Split chunk plan: A1 includes backup_policy + AGENTS.md, A3/A4 split by daily/monthly

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…tings

- New `site_analytics` Django app with models/, services/, tasks/, tests/ layout
- AnalyticsPageView raw event model: site, path, referrer, user_agent,
  occurred_at, visitor_month_hash; indexes on (site, occurred_at) and
  (occurred_at); initial migration generated
- Settings: SITE_ANALYTICS_HASH_SALT, SITE_ANALYTICS_ALLOWED_SITES,
  SITE_ANALYTICS_RETENTION_DAYS, and task period vars in base.py
- backup_policy.py: site_analytics_analyticspageview → TRUNCATE_TABLES
  (raw rows contain visitor hashes; excluded from public backup)
- repo_check_compose.sh: add step 12/12 for site_analytics test suite
- AGENTS.md files created/updated for new app and root/qb_site indexes

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ests

- POST /api/v1/analytics/collect: validate site/path, check allowlist,
  drop bots silently (204), insert AnalyticsPageView row, return 204
- services/hashing.py: get_client_ip (XFF → REMOTE_ADDR fallback) and
  compute_visitor_month_hash (pipe-separated fields, UA lowercased)
- services/bot_filter.py: substring denylist for known bots/crawlers
- .env.example: SITE_ANALYTICS_HASH_SALT, ALLOWED_SITES, RETENTION_DAYS
- tests/test_services.py: IP extraction, hash determinism/isolation, bot detection
- tests/test_collect_view.py: 204 success, 400 validation, bot drop,
  no raw IP in row, XFF hash isolation, field truncation, empty allowlist
- design doc: A1+A2 progress notes, three implementation subtleties recorded

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- AnalyticsDailyMetric model: site, date (UTC), pageviews, unique_visitors;
  unique constraint on (site, date); migration 0002
- aggregate_daily_metrics service: idempotent upsert over a rolling days_back
  window; preserves existing aggregates when raw rows have been pruned
- site_analytics.aggregate_daily_metrics Celery task + beat schedule entry
- backup_policy.py: site_analytics_analyticsdailymetric → RETAIN_TABLES
- tests: basic count, idempotency, multi-site, UTC date boundary, prune-safe
- design doc: A3 progress notes and three implementation subtleties recorded

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…schedules

- AnalyticsMonthlyMetric model: site, month (UTC first-of-month DateField),
  pageviews, unique_visitors; unique constraint on (site, month); migration 0003
- aggregate_monthly_metrics service: idempotent upsert over rolling months_back
  window; preserves existing aggregates when raw rows have been pruned
- prune_old_pageviews service: deletes AnalyticsPageView rows older than
  SITE_ANALYTICS_RETENTION_DAYS; aggregate tables are never pruned
- site_analytics.aggregate_monthly_metrics and site_analytics.prune_old_pageviews
  Celery tasks + beat schedule entries for monthly aggregate and prune
- backup_policy.py: site_analytics_analyticsmonthlymetric → RETAIN_TABLES
- tests: monthly count, idempotency, month boundary, prune boundaries,
  settings default retention, prune-safe aggregate preservation
- design doc: A4 progress notes; index name length limit subtlety noted

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
All three admin classes (AnalyticsPageView, AnalyticsDailyMetric,
AnalyticsMonthlyMetric) are read-only: add/change/delete permissions
disabled to enforce immutability of analytics data through the admin UI.
PageView admin shows hash prefix and truncated referrer for readability.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add SITE_ANALYTICS_REJECT_EMPTY_UA setting (default off) to optionally
drop requests with no User-Agent header before the existing bot-filter
check.  Returns 204 (same as bot drop) to avoid leaking detection logic.
Tests cover default-allow, flag-enabled drop, and flag-enabled accept.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…nal ADR

CORS:
- AnalyticsCollectView now returns Access-Control-Allow-Origin: * on all
  responses and handles OPTIONS preflight so browsers on third-party static
  sites can call the endpoint directly without a server-side proxy
- Two new tests: POST includes CORS header, OPTIONS preflight returns 204

ADR:
- Convert 031 from living implementation plan to concise final decision record
- Covers architecture, models, privacy invariants, operational notes,
  consequences, and deferred v1.1 items
- Includes static-site tracking snippet (sendBeacon + fetch fallback)
  with onboarding instructions for new sites

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
write_dashboard crashed with KeyError when rendering fixtures that
predate newer Dashboard enum members (e.g. NotFromFork). Use .get()
with an empty-list default so old snapshots produce an empty table
rather than aborting page generation.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds --analytics-site / QUEUEBOARD_ANALYTICS_SITE support to dashboard.py:
- Widens CSP connect-src when an analytics host is configured.
- Injects a sendBeacon/fetch snippet before </body> on every generated
  page, including the static area_stats.html and dependency_dashboard.html.
- Snippet is omitted entirely when the flag is absent, so existing
  deployments are unaffected.

Updates docs/queueboard_main_workflow.md to pass QUEUEBOARD_ANALYTICS_SITE
(new repo secret) alongside the existing QUEUEBOARD_API_BASE_URL in all
three dashboard-generation steps.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds a prose overview of how the workflow operates, a table of the two
required repo secrets (QUEUEBOARD_API_BASE_URL and the new
QUEUEBOARD_ANALYTICS_SITE), and a note that omitting the analytics
secret silently disables snippet injection.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Injects a visible one-line notice ("no cookies, no IP addresses stored")
before </body> on every analytics-enabled page, styled via a new
.analytics-notice CSS rule. The notice is part of the same injection
block as the tracking script, so it appears whenever analytics is
active and is absent otherwise.

Updates the ADR (031) to:
- Add disclosure as an explicit onboarding step.
- Document the required notice wording and note that site adopters are
  responsible for adding equivalent disclosure to their own pages.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Injects a visible one-line notice ("no cookies, no IP addresses stored")
before </body> on every analytics-enabled page, styled via a new
.analytics-notice CSS rule. The notice appears whenever analytics is
active and is absent otherwise.

Adds a "Disclosure and privacy regulations" section to the ADR (031)
covering ePrivacy Art. 5(3), GDPR Recital 26 / Art. 13, CJEU Breyer
(C-582/14), EDPB Guidelines 2/2023, and CNIL consent-exempt analytics
guidance — all with verified source URLs. Documents the practical
position: no consent banner required, but a brief privacy notice is
recommended as good practice and to satisfy GDPR Art. 13 under the
cautious reading.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Mock django.utils.timezone.now in tests that compare hardcoded fixture
dates against the real clock. Without this, task tests flake near
midnight boundaries and prune service tests rot as the hardcoded dates
age past their retention windows.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replaces the fixed SITE_ANALYTICS_HASH_SALT env var with a
randomly-generated per-month salt stored in a new SiteAnalyticsSalt
model. A new Celery beat task (site_analytics.rotate_salt) runs at
midnight UTC on the 1st of each month, creates a fresh salt, and
deletes the previous row atomically, providing forward secrecy: past
hashes cannot be re-derived even if the current salt leaks.

Changes:
- New SiteAnalyticsSalt model + migration 0004.
- compute_visitor_hash() replaces compute_visitor_month_hash(): drops
  the explicit YYYY-MM argument; cross-month isolation is now provided
  by the rotating salt instead.
- 60-second in-process salt cache with _reset_salt_cache() helper for
  test isolation; falls back to SITE_ANALYTICS_HASH_SALT until the
  first rotation task runs.
- rotate_salt_task registered in beat schedule (monthly crontab).
- SiteAnalyticsSalt added to TRUNCATE_TABLES in backup_policy.py
  (contains the live secret; excluded from sanitized backups).
- Updated tests: ComputeVisitorHashTests, cache reset in collect-view
  setUp, new test_salt.py covering rotation task behaviour.
- ADR and AGENTS.md updated to reflect new privacy model.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant