Skip to content

YsK-dev/datahon25

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

6 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ›’ E-commerce Session Value Prediction - Datathon 2025

πŸ“‹ Project Overview

This project focuses on predicting the monetary value of user sessions in an e-commerce platform using machine learning techniques. The goal is to analyze user behavior patterns and build a robust predictive model that can estimate session values based on user interactions.

🎯 Objective

Predict the session_value for e-commerce user sessions using behavioral event data including views, cart additions, and purchases.

πŸ† Competition Context

  • Event: Datathon 2025
  • Task Type: Regression (Session Value Prediction)
  • Evaluation Metric: MSE (Mean Squared Error) and RMSE
  • Data Format: Time-series user event data

πŸ“Š Dataset Description

Training Data (train (1).csv)

  • User Events: Individual user interactions (views, cart additions, purchases)
  • Session Information: User session identifiers and timestamps
  • Product Details: Product and category identifiers
  • Target Variable: session_value - monetary value of each session

Test Data (test (1).csv)

  • Same structure as training data
  • Missing session_value (target for prediction)
  • Used for final model evaluation and submission

Key Features

  • user_id: Unique user identifier
  • user_session: Unique session identifier
  • event_type: Type of user action (VIEW, ADD_CART, BUY)
  • product_id: Unique product identifier
  • category_id: Product category identifier
  • event_time: Timestamp of the event
  • session_value: Target variable (monetary value)

πŸ”§ Technical Approach

1. Exploratory Data Analysis (EDA)

  • Data Quality Assessment: Missing value analysis and data integrity checks
  • Target Distribution Analysis: Understanding session value patterns and skewness
  • Feature Distribution: Analyzing event types, product categories, and user behaviors
  • Temporal Patterns: Event sequence analysis and time-based insights
  • Cross-validation: Train/test distribution comparison for model generalization

2. Feature Engineering

  • Temporal Features: Hour, day of week, month, weekend indicators
  • Cyclical Encoding: Sine/cosine transformations for time periodicity
  • Session Aggregations: Event counts, unique products/categories per session
  • Behavioral Metrics: Event type ratios and conversion indicators
  • User Patterns: Historical user behavior and engagement metrics

3. Machine Learning Pipeline

  • Model Selection: CatBoost Regressor (optimized for categorical features)
  • Data Splitting: Time-based validation (80/20 split)
  • Target Transformation: Log1p transformation to handle skewed distribution
  • Hyperparameter Optimization: Optuna-based automated tuning
  • Validation Strategy: Early stopping and cross-validation

4. Model Optimization

  • Automated Hyperparameter Tuning: 20 trials with Optuna optimization
  • Parameters Optimized: iterations, learning_rate, depth, regularization
  • Categorical Handling: Native CatBoost categorical feature processing
  • Overfitting Prevention: Early stopping and validation monitoring

πŸ“ Project Structure

datahon25/
β”œβ”€β”€ README.md                          # Project documentation
β”œβ”€β”€ notebook9db22db4c8.ipynb          # Main analysis notebook
β”œβ”€β”€ train (1).csv                     # Training dataset
β”œβ”€β”€ test (1).csv                      # Test dataset
β”œβ”€β”€ catboost_submission.csv           # Final predictions
└── 520.csv                           # the csv that get 523 score on private lb

πŸš€ Getting Started

Prerequisites

# Required Python packages
pip install numpy pandas matplotlib seaborn missingno scikit-learn catboost optuna

Running the Analysis

  1. Clone the repository

    git clone https://github.com/YsK-dev/datahon25.git
    cd datathon25
  2. Open the Jupyter Notebook

    jupyter notebook notebook9db22db4c8.ipynb
  3. Run the complete pipeline

    • Execute cells sequentially for full analysis
    • EDA section provides comprehensive data insights
    • ML pipeline generates final predictions

Key Notebook Sections

  • πŸ“¦ Package Installation: Required dependencies
  • πŸ” Data Loading & Exploration: Initial data analysis
  • 🎯 Target Variable Analysis: Session value distribution
  • πŸ”„ Event Type Analysis: User behavior patterns
  • πŸ›οΈ Product & Category Analysis: Market insights
  • πŸ‘₯ User Behavior Analysis: Engagement patterns
  • ⏰ Temporal Analysis: Time-based patterns
  • πŸ”§ Feature Engineering: Advanced feature creation
  • πŸ€– Machine Learning Pipeline: Model training and optimization
  • πŸ“Š Results & Submission: Final predictions and evaluation

πŸ“ˆ Results & Performance

Model Performance

  • Algorithm: CatBoost Regressor with Optuna optimization
  • Validation Strategy: Time-based split (chronological)
  • Target Transformation: Log1p (handles skewed distribution)
  • Feature Count: 15+ engineered features
  • Categorical Features: Native CatBoost handling

Key Achievements

  • βœ… Comprehensive EDA with actionable insights
  • βœ… Robust feature engineering pipeline
  • βœ… Automated hyperparameter optimization
  • βœ… Time-aware validation strategy
  • βœ… Production-ready prediction pipeline

Output Files

  • catboost_submission.csv: Final predictions for test data
  • Session-level predictions with quality validation
  • Duplicate detection and data integrity checks

πŸ” Key Insights

Data Patterns

  • Event Distribution: Majority of events are product views
  • Session Behavior: Longer sessions tend to have higher values
  • Temporal Patterns: Clear time-of-day and day-of-week effects
  • User Segments: Distinct high-value and casual user groups

Feature Importance

  • Session event counts and engagement metrics
  • Event type ratios (especially purchase/cart ratios)
  • Temporal features (hour, day patterns)
  • Product category preferences
  • User historical behavior patterns

πŸ› οΈ Technical Details

Model Architecture

  • Base Model: CatBoost Regressor
  • Categorical Features: 5 native categorical features
  • Numerical Features: 10+ engineered numerical features
  • Target Processing: Log-normal distribution handling

Optimization Strategy

  • Hyperparameter Search: Bayesian optimization with Optuna
  • Search Space: 7 key parameters optimized
  • Evaluation Metric: RMSE on validation set
  • Early Stopping: Prevents overfitting

Data Processing

  • Missing Value Handling: Comprehensive imputation strategy
  • Categorical Encoding: Native CatBoost processing
  • Feature Scaling: Log transformation for target variable
  • Temporal Features: Cyclical encoding for time variables

πŸ“š Dependencies

Core Libraries

  • pandas (1.5+): Data manipulation and analysis
  • numpy (1.21+): Numerical computations
  • scikit-learn (1.1+): Machine learning utilities
  • catboost (1.0+): Gradient boosting framework

Visualization

  • matplotlib (3.5+): Basic plotting
  • seaborn (0.11+): Statistical visualizations
  • missingno: Missing data visualization

Optimization

  • optuna (3.0+): Hyperparameter optimization

🀝 Contributing

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/improvement)
  3. Commit your changes (git commit -am 'Add new feature')
  4. Push to the branch (git push origin feature/improvement)
  5. Create a Pull Request

πŸ“„ License

This project is part of Datathon 2025 competition. Please respect competition rules and academic integrity guidelines.

πŸ‘¨β€πŸ’» Author

YsK-dev

  • GitHub: @YsK-dev
  • Project: Datathon 2025 - E-commerce Session Value Prediction

πŸ”— Additional Resources


Last Updated: September 2025 Competition: Datathon 2025

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published