Skip to content

Epic: BNPL Data Engineering Pipeline - Historical Data Ingestion & ML Preparation #9

@whitehackr

Description

@whitehackr

Epic: BNPL Data Engineering Pipeline

Business Objective

Build production-grade data infrastructure for BNPL transaction analysis and ML model development. Ingest comprehensive historical dataset and establish scalable processing pipeline.

Architecture Overview

  • Data Flow: simtom API → BigQuery (raw) → dbt (transform) → Analytics/ML tables
  • Orchestration: Airflow for daily processing patterns
  • Quality: Great Expectations for data validation
  • Pattern: ELT leveraging BigQuery compute with dbt transformations

Repository Structure

flit-data-platform/
├── scripts/bnpl/              # Data ingestion and utilities
├── airflow/                   # Orchestration infrastructure
├── great_expectations/        # Data quality framework
├── models/staging/bnpl/       # dbt staging models
├── models/intermediate/bnpl/  # dbt intermediate models
└── models/marts/bnpl/         # dbt analytics models

Implementation Roadmap

PR 1: Infrastructure Foundation

Branch: feat/bnpl-infrastructure
Description: Core infrastructure setup and project foundation

Scope:

  • Directory structure establishment
  • BigQuery dataset provisioning (flit_bnpl_raw, flit_bnpl_intermediate, flit_bnpl_marts)
  • Docker Compose configuration for local Airflow development
  • Base API client framework for simtom integration
  • Project documentation and architectural decisions
  • Dependency management (extend root requirements.txt)

Files:

  • scripts/bnpl/__init__.py
  • scripts/bnpl/api_client.py (interface definition)
  • airflow/docker-compose.yml
  • docs/bnpl_pipeline_architecture.md
  • requirements.txt (updated with new dependencies)

PR 2: Historical Data Ingestion Engine

Branch: feat/bnpl-ingestion
Description: Comprehensive historical data acquisition with realistic transaction patterns

Scope:

  • Production-grade API client with exponential backoff and circuit breaker
  • Historical data ingestion orchestrator
  • BigQuery staging infrastructure with date partitioning
  • JSON schema extraction and normalization (transaction/customer separation)
  • Comprehensive error handling and observability
  • Data volume and pattern validation

Key Decisions:

  • Dataset Size: 1.8M transactions (realistic volume for ML training)
  • Time Distribution: Leverage simtom's realistic holiday/weekend patterns
  • Payload Strategy: Dynamic total_records based on day type (weekdays: 5-7k, weekends: 3-4k, holidays: 8-10k)
  • Scope: Historical ingestion only - production streaming pipeline addressed separately

Files:

  • scripts/bnpl/ingest_historical_bnpl.py
  • scripts/bnpl/api_client.py (complete implementation)
  • scripts/bnpl/data_patterns.py (holiday/weekend logic)
  • scripts/bnpl/schema_utils.py (JSON normalization)

API Integration:

  • URL: https://simtom-production.up.railway.app/stream/bnpl
  • Dynamic payload generation based on date characteristics
  • Automatic schema extraction for transaction/customer entity separation

PR 3: dbt Transformation Pipeline

Branch: feat/bnpl-dbt-models
Description: Scalable data transformation pipeline with proper entity modeling

Scope:

  • Staging models (JSON → normalized relational structures)
  • Intermediate models (business logic, data enrichment)
  • Mart models (analytics-optimized tables)
  • Comprehensive dbt testing framework
  • Model lineage documentation
  • Performance optimization (clustering, partitioning)

Entity Separation Strategy:

  • Extract customer profiles from transaction JSON in staging layer
  • Deduplicate and enrich customer data in intermediate layer
  • Create separate transaction/customer marts with proper relationships

dbt Models:

  • models/staging/bnpl/stg_bnpl_raw_transactions.sql
  • models/staging/bnpl/stg_bnpl_extracted_customers.sql
  • models/intermediate/bnpl/int_bnpl_transactions_enriched.sql
  • models/intermediate/bnpl/int_bnpl_customer_profiles.sql
  • models/marts/bnpl/mart_bnpl_transactions.sql
  • models/marts/bnpl/mart_bnpl_customers.sql
  • models/marts/bnpl/mart_bnpl_ml_features.sql

PR 4: Data Quality Framework

Branch: feat/bnpl-data-quality
Description: Comprehensive data validation and monitoring infrastructure

Scope:

  • Great Expectations suite configuration
  • Business rule validation (transaction amounts, customer behavior)
  • Statistical pattern validation (holiday spikes, weekend dips)
  • Data freshness and completeness monitoring
  • Quality gate integration with dbt
  • Alerting framework for quality failures

Files:

  • great_expectations/great_expectations.yml
  • great_expectations/expectations/bnpl_business_rules.json
  • great_expectations/expectations/bnpl_statistical_patterns.json
  • great_expectations/checkpoints/daily_quality_gate.yml

PR 5: Airflow Orchestration Pipeline

Branch: feat/bnpl-airflow-pipeline
Description: Production-grade workflow orchestration

Scope:

  • Daily processing DAG with proper dependency management
  • Task-level error handling and retry strategies
  • dbt integration with proper artifact management
  • Great Expectations checkpoint integration
  • Historical backfill simulation capabilities
  • Monitoring and alerting integration

Files:

  • airflow/dags/daily_bnpl_processing.py
  • airflow/plugins/bnpl_operators.py
  • airflow/utils/bnpl_utils.py

DAG Architecture:

extract_daily_batch >> validate_raw_schema >> 
normalize_entities >> run_dbt_pipeline >> 
validate_business_rules >> publish_quality_metrics

PR 6: ML Feature Engineering

Branch: feat/bnpl-ml-features
Description: Advanced analytics and ML preparation infrastructure

Scope:

  • Time-series feature engineering (rolling aggregations, trends)
  • Customer behavioral scoring (payment patterns, risk indicators)
  • Transaction categorization and enrichment
  • Feature store preparation
  • Model training dataset generation
  • Analytics dashboard preparation

Files:

  • models/marts/bnpl/mart_bnpl_ml_training_set.sql
  • models/marts/bnpl/mart_bnpl_risk_features.sql
  • models/marts/bnpl/mart_bnpl_customer_analytics.sql
  • scripts/bnpl/feature_pipeline.py

Technical Architecture

Infrastructure Stack

  • Orchestration: Apache Airflow (containerized)
  • Data Warehouse: Google BigQuery (partitioned, clustered)
  • Transformations: dbt (with proper testing and documentation)
  • Data Quality: Great Expectations (integrated with pipeline)
  • API Integration: Python with production-grade error handling

Data Volumes and Performance

  • Historical Dataset: 1.8M transactions across 365 days
  • Daily Processing: 3k-10k transactions (realistic business patterns)
  • Storage Strategy: Date-partitioned tables with customer_id clustering
  • Processing Pattern: ELT with BigQuery compute optimization

Success Metrics

  • Complete historical dataset ingested with proper entity separation
  • Airflow pipeline processing daily patterns with <5% failure rate
  • Data quality gates preventing bad data propagation
  • dbt pipeline generating clean, tested analytics tables
  • ML-ready feature sets available for model development
  • End-to-end pipeline documentation and runbooks

Dependencies and Prerequisites

  • BigQuery project permissions and dataset creation rights
  • Docker environment for local Airflow development
  • simtom API access and rate limit understanding
  • Python environment with data engineering dependencies

Risk Mitigation

  • API Rate Limits: Implement exponential backoff and respectful request patterns
  • Data Volume: Monitor BigQuery costs and implement proper partitioning
  • Schema Evolution: Design flexible JSON parsing with schema validation
  • Pipeline Failures: Comprehensive error handling and recovery mechanisms

Future Considerations

  • Real-time streaming pipeline (separate from this historical ingestion)
  • Production ML model serving infrastructure
  • Advanced analytics and reporting layer
  • Data lineage and governance framework

Metadata

Metadata

Assignees

Labels

No labels
No labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions