Realistic, large-scale practice datasets for development economics — 36 generators, 840,000+ rows.
Built for researchers, students, and practitioners who need real-feeling data modelled on DHS, NFHS, ASER, and other major development survey frameworks.
Full documentation: varnasr.github.io/devdata-practice
DevData Practice generates synthetic datasets that closely mirror the structure, variable distributions, and statistical properties of real development sector surveys. The data is designed for:
- Learning — practice data analysis, MEL, and econometrics without needing access to restricted datasets
- Teaching — ready-made datasets for classroom exercises, workshops, and tutorials
- Prototyping — build and test tools against realistic data before connecting to real sources
- Demonstration — showcase analysis workflows without sharing confidential programme data
All datasets are synthetic — no real individuals are represented.
# Clone the repository
git clone https://github.com/Varnasr/devdata-practice.git
cd devdata-practice
# Install dependencies
pip install -r requirements.txt
# Generate all 36 datasets
python generate.py
# List available datasets
python generate.py --list
# Generate specific datasets
python generate.py rct_experiment labor_market household_surveyGenerated files are saved to the data/ directory as CSV files.
| Category | Generators |
|---|---|
| Health & Nutrition | health_nutrition, public_health, wash |
| Education | education, girls_education, irt_assessment |
| Livelihoods & Labour | livelihoods, labor_market, decent_work, microfinance |
| Gender & Social | gender_programme, care_economy, intersectionality, social_emotional_learning |
| Agriculture & Environment | agriculture, agri_value_chain, climate_resilience, environmental_justice |
| Governance & Policy | governance, social_protection, ngo_finance |
| Impact Evaluation | rct_experiment, cost_effectiveness, targeting, panel_data |
| Surveys & Field Work | household_survey, field_survey_quality |
| Behaviour & Communications | behaviour_change, media_development, bcc |
| Economics & Markets | trade_markets, digital_access, humanitarian |
| Development Architecture | aid_effectiveness, advocacy_rights, community_development |
Each generator produces datasets modelled on real-world survey frameworks:
| Framework | Modelled in |
|---|---|
| NFHS / DHS | health_nutrition, household_survey, gender_programme |
| ASER | education, girls_education |
| IHDS | household_survey, livelihoods |
| J-PAL RCT designs | rct_experiment, targeting |
| IRT (Rasch/2PL) | irt_assessment |
Variable names, distributions, and correlation structures are calibrated to approximate real survey data. Row counts are configurable — default is ~23,000 rows per dataset.
devdata-practice/
├── generate.py # Main entry point
├── requirements.txt # Python dependencies
├── generators/ # One file per dataset type (36 generators)
│ ├── __init__.py
│ ├── household_survey.py
│ ├── rct_experiment.py
│ ├── health_nutrition.py
│ └── ... (33 more)
├── docs/ # Documentation source (GitHub Pages)
├── LICENSE
└── README.md
pandas>=1.5.0
numpy>=1.23.0
scipy>=1.9.0
faker>=15.0.0
Python 3.9 or higher.
DevData Practice is a ImpactMojo Professional tier resource, also available as open-source for self-hosted use.
Related repositories:
- ImpactMojo — Main platform
- deveconomics-toolkit — R and Python Shiny apps for development econometrics
- InsightStack — MEL tools and calculators
MIT License — see LICENSE for details.
If you use DevData Practice in research or teaching, please cite:
Sri Raman, V. (2025). DevData Practice: Synthetic datasets for development economics [Software].
GitHub. https://github.com/Varnasr/devdata-practice
Or use the CITATION.cff file.