A comprehensive machine learning pipeline for predicting student success outcomes at Bishop State Community College.
This project implements six machine learning models to predict various aspects of student success (authoritative names and data flows: codebenders-dashboard/content/ai-transparency.ts):
- Retention Prediction - Will the student be retained?
- Time-to-Credential - How long until credential completion?
- Credential Type - What credential will they earn?
- Gateway Math Success - Will the student succeed in gateway math?
- Gateway English Success - Will the student succeed in gateway English?
- First-Semester Low-GPA Prediction - Is the student at risk of a low first-semester GPA?
The models use demographic, academic preparation, enrollment, and course performance data to generate actionable predictions for student support services.
The Next.js dashboard adds natural language query (NLQ) features: three OpenAI gpt-4o-mini API routes (codebenders-dashboard/app/api/analyze/route.ts, codebenders-dashboard/app/api/query-summary/route.ts, codebenders-dashboard/app/api/courses/explain-pairing/route.ts), a rule-based fallback in codebenders-dashboard/lib/prompt-analyzer.ts, and (when not using direct database mode) an external data API at schools.syntex-ai.com. See the same ai-transparency.ts file for the full inventory.
codebenders-datathon/
βββ ai_model/ # Machine learning models and scripts
β βββ __init__.py # Package initialization
β βββ complete_ml_pipeline.py # Main ML pipeline (6 models)
β βββ generate_bishop_state_data.py # Synthetic data generation
β βββ merge_bishop_state_data.py # Data merging script
β
βββ data/ # Data files (CSV and Excel)
β βββ ar_bscc_with_zip.csv # AR data with zip codes
β βββ bishop_state_cohorts_with_zip.csv # Student cohort data
β βββ bishop_state_courses.csv # Course enrollment data
β βββ bishop_state_student_level_with_zip.csv # Student-level aggregated data
β βββ bishop_state_student_level_with_predictions.csv # Student-level with predictions
β βββ bishop_state_merged_with_predictions.csv # Course-level with predictions
β βββ De-identified PDP AR Files.xlsx # Original Excel data
β
βββ codebenders-dashboard/ # Next.js web application
βββ operations/ # Database utilities and configuration
βββ DATA_DICTIONARY.md # Detailed data field descriptions
βββ ML_MODELS_GUIDE.md # Machine learning models guide
βββ requirements.txt # Python dependencies
βββ LICENSE # MIT License
βββ README.md # This file
- Retention Risk Assessment: Retention probability and risk categories; dashboard alert views (URGENT / HIGH / MODERATE / LOW) are driven by these signals
- Graduation Timeline: Predict time to credential completion
- Credential Path: Forecast credential type (Certificate, Associate's, Bachelor's)
- Gateway Success: Predict gateway math and English completion outcomes
- Early Academic Risk: First-semester low-GPA risk prediction
- XGBoost & Random Forest: State-of-the-art ensemble methods
- Feature Engineering: 40+ engineered features from raw data
- Comprehensive Evaluation: Multiple metrics for each model
- Production-Ready: Generates predictions for all students
- Detailed Reporting: Automated summary reports with model performance
- Python 3.8 or higher
- pip package manager
- Postgres database access via Supabase (for saving predictions)
-
Clone the repository
git clone https://github.com/devcolor/codebenders-datathon.git cd codebenders-datathon -
Create and activate virtualenv
python -m venv venv source venv/bin/activate -
Install dependencies
pip install -r requirements.txt
-
Configure database (Optional - will fallback to CSV if not configured)
Copy
codebenders-dashboard/env.exampleto.envand update:DB_HOST=127.0.0.1 DB_USER=postgres DB_PASSWORD=postgres DB_PORT=54332 DB_NAME=postgres DB_SSL=false
-
Start local Supabase (for local development)
supabase start
-
Test database connection
python -m operations.test_db_connection
-
Verify data files Ensure all required CSV files are in the
data/folder.
Run the complete ML pipeline:
cd ai_model
python complete_ml_pipeline.pyThis will:
- Test database connection
- Load and preprocess data
- Train all 6 models
- Generate predictions for all students
- Save results to Postgres database (or CSV files as fallback)
- Save model performance metrics to database
- Create a summary report
If you need to re-merge the source data files:
cd ai_model
python merge_bishop_state_data.py- Data loading: ~30 seconds
- Model training: ~5-10 minutes
- Prediction generation: ~1 minute
- Total: ~10-15 minutes
The pipeline uses an efficient batch upload system to save predictions to Postgres:
Features:
- Automatic batching: Large datasets are split into manageable chunks (1,000 records per batch)
- Progress tracking: Real-time progress updates during upload
- Connection pooling: SQLAlchemy engine with connection pooling for reliability
- Error handling: Automatic fallback to CSV if database connection fails
- Verification: Automatic record count verification after upload
Example Output:
Saving 99,559 records to table 'course_predictions'...
β Successfully saved to 'course_predictions'
- Records: 99,559
- Columns: 45
- Verified: 99,559 records in database
Configuration:
- Default batch size: 1,000 records per chunk
- Adjustable via
chunksizeparameter insave_dataframe_to_db() - Located in
operations/db_utils.py
Tables Created:
student_predictions- Student-level predictions (~4,000 records)course_predictions- Course-level predictions (~99,559 records)ml_model_performance- Model metrics and training history
For more details, see operations/README.md.
Authoritative descriptions (inputs, algorithms, data flow): codebenders-dashboard/content/ai-transparency.ts. The summaries below match that inventory.
Algorithm: XGBoost classifier (model family selected in ai_model/complete_ml_pipeline.py)
Target: Binary (Retained / Not Retained)
Features: Demographic, enrollment, year-one performance, and program signals
Output (examples):
- Retention probability and binary prediction
- Retention risk category (Critical / High / Moderate / Low)
- Dashboard risk alerts combine retention and related metrics
Algorithm: Random Forest regressor Target: Continuous (Years to credential)
Output:
predicted_time_to_credential: Years to completionpredicted_graduation_year: Expected graduation year
Algorithm: Random Forest multi-class classifier Target: Multi-class (No Credential / Certificate / Associate's / Bachelor's)
Output:
predicted_credential_type: Numeric code (0-3)predicted_credential_label: Text labelprob_no_credential,prob_certificate,prob_associate,prob_bachelor: Class probabilities
Algorithm: XGBoost classifier Target: Binary (success in gateway math)
Output: Probability and prediction fields written to student_predictions (see data dictionary and pipeline outputs).
Algorithm: XGBoost classifier Target: Binary (success in gateway English)
Output: Probability and prediction fields written to student_predictions.
Algorithm: XGBoost classifier Target: Binary (low first-semester GPA risk)
Output: Probability and prediction fields written to student_predictions.
| File | Description | Records |
|---|---|---|
ar_bscc_with_zip.csv |
AR data with zip codes | ~4K |
bishop_state_cohorts_with_zip.csv |
Student cohort information | ~4K |
bishop_state_courses.csv |
Course enrollment records | ~100K |
bishop_state_student_level_with_zip.csv |
Aggregated student-level data | ~4K |
- Demographics: Age, race, ethnicity, gender, first-generation status
- Academic Preparation: Math/English/Reading placement levels
- Enrollment: Type, intensity, attendance status, cohort term
- Course Performance: Credits, grades, completion rates, gateway courses
- Financial: Pell grant status
- Geographic: Zip code information
Predictions are saved to Postgres (Supabase):
-
student_predictions(Table)- Student-level data with all predictions
- One row per student (~4,000 records)
-
course_predictions(Table)- Course-level data with predictions
- One row per course enrollment (~99,559 records)
-
ml_model_performance(Table)- Model performance metrics for each training run
If database connection fails, predictions are saved to CSV:
bishop_state_student_level_with_predictions.csvbishop_state_merged_with_predictions.csvML_PIPELINE_REPORT.txt
- codebenders-dashboard/content/ai-transparency.ts: Authoritative inventory of ML models, OpenAI/NLQ routes, rule-based fallback, and external data API surfaces
- DATA_DICTIONARY.md: Detailed descriptions of all data fields
- ML_MODELS_GUIDE.md: In-depth guide to machine learning models
- DOCKER_SETUP.md: Docker Compose setup for local Postgres
- Model Code: Extensively commented Python scripts in
ai_model/
Edit complete_ml_pipeline.py to adjust:
- XGBoost parameters:
n_estimators,max_depth,learning_rate - Random Forest parameters:
n_estimators,max_depth,n_jobs - Train-test split:
test_size,random_state - Risk thresholds: Alert levels in
assign_alert_level()
This project was developed for the Bishop State Datathon. Contributions are welcome!
- Fork the repository
- Create a feature branch (
git checkout -b feature/AmazingFeature) - Commit your changes (
git commit -m 'Add some AmazingFeature') - Push to the branch (
git push origin feature/AmazingFeature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
CodeBenders Team Bishop State Datathon 2025
- Bishop State Community College
- Datathon organizers and mentors
- Open-source ML community (scikit-learn, XGBoost, pandas)
For questions or support, please open an issue on GitHub or contact the team.
Built with β€οΈ for student success