-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Epic: BNPL Data Engineering Pipeline
Business Objective
Build production-grade data infrastructure for BNPL transaction analysis and ML model development. Ingest comprehensive historical dataset and establish scalable processing pipeline.
Architecture Overview
- Data Flow: simtom API → BigQuery (raw) → dbt (transform) → Analytics/ML tables
- Orchestration: Airflow for daily processing patterns
- Quality: Great Expectations for data validation
- Pattern: ELT leveraging BigQuery compute with dbt transformations
Repository Structure
flit-data-platform/
├── scripts/bnpl/ # Data ingestion and utilities
├── airflow/ # Orchestration infrastructure
├── great_expectations/ # Data quality framework
├── models/staging/bnpl/ # dbt staging models
├── models/intermediate/bnpl/ # dbt intermediate models
└── models/marts/bnpl/ # dbt analytics models
Implementation Roadmap
PR 1: Infrastructure Foundation
Branch: feat/bnpl-infrastructure
Description: Core infrastructure setup and project foundation
Scope:
- Directory structure establishment
- BigQuery dataset provisioning (
flit_bnpl_raw,flit_bnpl_intermediate,flit_bnpl_marts) - Docker Compose configuration for local Airflow development
- Base API client framework for simtom integration
- Project documentation and architectural decisions
- Dependency management (extend root requirements.txt)
Files:
scripts/bnpl/__init__.pyscripts/bnpl/api_client.py(interface definition)airflow/docker-compose.ymldocs/bnpl_pipeline_architecture.mdrequirements.txt(updated with new dependencies)
PR 2: Historical Data Ingestion Engine
Branch: feat/bnpl-ingestion
Description: Comprehensive historical data acquisition with realistic transaction patterns
Scope:
- Production-grade API client with exponential backoff and circuit breaker
- Historical data ingestion orchestrator
- BigQuery staging infrastructure with date partitioning
- JSON schema extraction and normalization (transaction/customer separation)
- Comprehensive error handling and observability
- Data volume and pattern validation
Key Decisions:
- Dataset Size: 1.8M transactions (realistic volume for ML training)
- Time Distribution: Leverage simtom's realistic holiday/weekend patterns
- Payload Strategy: Dynamic
total_recordsbased on day type (weekdays: 5-7k, weekends: 3-4k, holidays: 8-10k) - Scope: Historical ingestion only - production streaming pipeline addressed separately
Files:
scripts/bnpl/ingest_historical_bnpl.pyscripts/bnpl/api_client.py(complete implementation)scripts/bnpl/data_patterns.py(holiday/weekend logic)scripts/bnpl/schema_utils.py(JSON normalization)
API Integration:
- URL:
https://simtom-production.up.railway.app/stream/bnpl - Dynamic payload generation based on date characteristics
- Automatic schema extraction for transaction/customer entity separation
PR 3: dbt Transformation Pipeline
Branch: feat/bnpl-dbt-models
Description: Scalable data transformation pipeline with proper entity modeling
Scope:
- Staging models (JSON → normalized relational structures)
- Intermediate models (business logic, data enrichment)
- Mart models (analytics-optimized tables)
- Comprehensive dbt testing framework
- Model lineage documentation
- Performance optimization (clustering, partitioning)
Entity Separation Strategy:
- Extract customer profiles from transaction JSON in staging layer
- Deduplicate and enrich customer data in intermediate layer
- Create separate transaction/customer marts with proper relationships
dbt Models:
models/staging/bnpl/stg_bnpl_raw_transactions.sqlmodels/staging/bnpl/stg_bnpl_extracted_customers.sqlmodels/intermediate/bnpl/int_bnpl_transactions_enriched.sqlmodels/intermediate/bnpl/int_bnpl_customer_profiles.sqlmodels/marts/bnpl/mart_bnpl_transactions.sqlmodels/marts/bnpl/mart_bnpl_customers.sqlmodels/marts/bnpl/mart_bnpl_ml_features.sql
PR 4: Data Quality Framework
Branch: feat/bnpl-data-quality
Description: Comprehensive data validation and monitoring infrastructure
Scope:
- Great Expectations suite configuration
- Business rule validation (transaction amounts, customer behavior)
- Statistical pattern validation (holiday spikes, weekend dips)
- Data freshness and completeness monitoring
- Quality gate integration with dbt
- Alerting framework for quality failures
Files:
great_expectations/great_expectations.ymlgreat_expectations/expectations/bnpl_business_rules.jsongreat_expectations/expectations/bnpl_statistical_patterns.jsongreat_expectations/checkpoints/daily_quality_gate.yml
PR 5: Airflow Orchestration Pipeline
Branch: feat/bnpl-airflow-pipeline
Description: Production-grade workflow orchestration
Scope:
- Daily processing DAG with proper dependency management
- Task-level error handling and retry strategies
- dbt integration with proper artifact management
- Great Expectations checkpoint integration
- Historical backfill simulation capabilities
- Monitoring and alerting integration
Files:
airflow/dags/daily_bnpl_processing.pyairflow/plugins/bnpl_operators.pyairflow/utils/bnpl_utils.py
DAG Architecture:
extract_daily_batch >> validate_raw_schema >>
normalize_entities >> run_dbt_pipeline >>
validate_business_rules >> publish_quality_metrics
PR 6: ML Feature Engineering
Branch: feat/bnpl-ml-features
Description: Advanced analytics and ML preparation infrastructure
Scope:
- Time-series feature engineering (rolling aggregations, trends)
- Customer behavioral scoring (payment patterns, risk indicators)
- Transaction categorization and enrichment
- Feature store preparation
- Model training dataset generation
- Analytics dashboard preparation
Files:
models/marts/bnpl/mart_bnpl_ml_training_set.sqlmodels/marts/bnpl/mart_bnpl_risk_features.sqlmodels/marts/bnpl/mart_bnpl_customer_analytics.sqlscripts/bnpl/feature_pipeline.py
Technical Architecture
Infrastructure Stack
- Orchestration: Apache Airflow (containerized)
- Data Warehouse: Google BigQuery (partitioned, clustered)
- Transformations: dbt (with proper testing and documentation)
- Data Quality: Great Expectations (integrated with pipeline)
- API Integration: Python with production-grade error handling
Data Volumes and Performance
- Historical Dataset: 1.8M transactions across 365 days
- Daily Processing: 3k-10k transactions (realistic business patterns)
- Storage Strategy: Date-partitioned tables with customer_id clustering
- Processing Pattern: ELT with BigQuery compute optimization
Success Metrics
- Complete historical dataset ingested with proper entity separation
- Airflow pipeline processing daily patterns with <5% failure rate
- Data quality gates preventing bad data propagation
- dbt pipeline generating clean, tested analytics tables
- ML-ready feature sets available for model development
- End-to-end pipeline documentation and runbooks
Dependencies and Prerequisites
- BigQuery project permissions and dataset creation rights
- Docker environment for local Airflow development
- simtom API access and rate limit understanding
- Python environment with data engineering dependencies
Risk Mitigation
- API Rate Limits: Implement exponential backoff and respectful request patterns
- Data Volume: Monitor BigQuery costs and implement proper partitioning
- Schema Evolution: Design flexible JSON parsing with schema validation
- Pipeline Failures: Comprehensive error handling and recovery mechanisms
Future Considerations
- Real-time streaming pipeline (separate from this historical ingestion)
- Production ML model serving infrastructure
- Advanced analytics and reporting layer
- Data lineage and governance framework