This project focuses on predicting the monetary value of user sessions in an e-commerce platform using machine learning techniques. The goal is to analyze user behavior patterns and build a robust predictive model that can estimate session values based on user interactions.
Predict the session_value for e-commerce user sessions using behavioral event data including views, cart additions, and purchases.
- Event: Datathon 2025
- Task Type: Regression (Session Value Prediction)
- Evaluation Metric: MSE (Mean Squared Error) and RMSE
- Data Format: Time-series user event data
- User Events: Individual user interactions (views, cart additions, purchases)
- Session Information: User session identifiers and timestamps
- Product Details: Product and category identifiers
- Target Variable:
session_value- monetary value of each session
- Same structure as training data
- Missing
session_value(target for prediction) - Used for final model evaluation and submission
user_id: Unique user identifieruser_session: Unique session identifierevent_type: Type of user action (VIEW, ADD_CART, BUY)product_id: Unique product identifiercategory_id: Product category identifierevent_time: Timestamp of the eventsession_value: Target variable (monetary value)
- Data Quality Assessment: Missing value analysis and data integrity checks
- Target Distribution Analysis: Understanding session value patterns and skewness
- Feature Distribution: Analyzing event types, product categories, and user behaviors
- Temporal Patterns: Event sequence analysis and time-based insights
- Cross-validation: Train/test distribution comparison for model generalization
- Temporal Features: Hour, day of week, month, weekend indicators
- Cyclical Encoding: Sine/cosine transformations for time periodicity
- Session Aggregations: Event counts, unique products/categories per session
- Behavioral Metrics: Event type ratios and conversion indicators
- User Patterns: Historical user behavior and engagement metrics
- Model Selection: CatBoost Regressor (optimized for categorical features)
- Data Splitting: Time-based validation (80/20 split)
- Target Transformation: Log1p transformation to handle skewed distribution
- Hyperparameter Optimization: Optuna-based automated tuning
- Validation Strategy: Early stopping and cross-validation
- Automated Hyperparameter Tuning: 20 trials with Optuna optimization
- Parameters Optimized: iterations, learning_rate, depth, regularization
- Categorical Handling: Native CatBoost categorical feature processing
- Overfitting Prevention: Early stopping and validation monitoring
datahon25/
βββ README.md # Project documentation
βββ notebook9db22db4c8.ipynb # Main analysis notebook
βββ train (1).csv # Training dataset
βββ test (1).csv # Test dataset
βββ catboost_submission.csv # Final predictions
βββ 520.csv # the csv that get 523 score on private lb
# Required Python packages
pip install numpy pandas matplotlib seaborn missingno scikit-learn catboost optuna-
Clone the repository
git clone https://github.com/YsK-dev/datahon25.git cd datathon25 -
Open the Jupyter Notebook
jupyter notebook notebook9db22db4c8.ipynb
-
Run the complete pipeline
- Execute cells sequentially for full analysis
- EDA section provides comprehensive data insights
- ML pipeline generates final predictions
- π¦ Package Installation: Required dependencies
- π Data Loading & Exploration: Initial data analysis
- π― Target Variable Analysis: Session value distribution
- π Event Type Analysis: User behavior patterns
- ποΈ Product & Category Analysis: Market insights
- π₯ User Behavior Analysis: Engagement patterns
- β° Temporal Analysis: Time-based patterns
- π§ Feature Engineering: Advanced feature creation
- π€ Machine Learning Pipeline: Model training and optimization
- π Results & Submission: Final predictions and evaluation
- Algorithm: CatBoost Regressor with Optuna optimization
- Validation Strategy: Time-based split (chronological)
- Target Transformation: Log1p (handles skewed distribution)
- Feature Count: 15+ engineered features
- Categorical Features: Native CatBoost handling
- β Comprehensive EDA with actionable insights
- β Robust feature engineering pipeline
- β Automated hyperparameter optimization
- β Time-aware validation strategy
- β Production-ready prediction pipeline
catboost_submission.csv: Final predictions for test data- Session-level predictions with quality validation
- Duplicate detection and data integrity checks
- Event Distribution: Majority of events are product views
- Session Behavior: Longer sessions tend to have higher values
- Temporal Patterns: Clear time-of-day and day-of-week effects
- User Segments: Distinct high-value and casual user groups
- Session event counts and engagement metrics
- Event type ratios (especially purchase/cart ratios)
- Temporal features (hour, day patterns)
- Product category preferences
- User historical behavior patterns
- Base Model: CatBoost Regressor
- Categorical Features: 5 native categorical features
- Numerical Features: 10+ engineered numerical features
- Target Processing: Log-normal distribution handling
- Hyperparameter Search: Bayesian optimization with Optuna
- Search Space: 7 key parameters optimized
- Evaluation Metric: RMSE on validation set
- Early Stopping: Prevents overfitting
- Missing Value Handling: Comprehensive imputation strategy
- Categorical Encoding: Native CatBoost processing
- Feature Scaling: Log transformation for target variable
- Temporal Features: Cyclical encoding for time variables
pandas(1.5+): Data manipulation and analysisnumpy(1.21+): Numerical computationsscikit-learn(1.1+): Machine learning utilitiescatboost(1.0+): Gradient boosting framework
matplotlib(3.5+): Basic plottingseaborn(0.11+): Statistical visualizationsmissingno: Missing data visualization
optuna(3.0+): Hyperparameter optimization
- Fork the repository
- Create a feature branch (
git checkout -b feature/improvement) - Commit your changes (
git commit -am 'Add new feature') - Push to the branch (
git push origin feature/improvement) - Create a Pull Request
This project is part of Datathon 2025 competition. Please respect competition rules and academic integrity guidelines.
YsK-dev
- GitHub: @YsK-dev
- Project: Datathon 2025 - E-commerce Session Value Prediction
Last Updated: September 2025 Competition: Datathon 2025