- ๐ฏ Overview
- ๐ ๏ธ Tech Stack
- ๐ Projects
- 1๏ธโฃ Demographic Data Analyzer
- 2๏ธโฃ Mean, Variance & Standard Deviation Calculator
- 3๏ธโฃ Medical Data Visualizer
- 4๏ธโฃ Page View Time Series Visualizer
- 5๏ธโฃ Sea Level Predictor
- ๐ Getting Started
- โค๏ธ Credits & Attribution
๐ Note: Each project includes a Jupyter Notebook (
.ipynb) for local interactive exploration and is also available on Google Colab - no local setup required.
A collection of Python data analysis and visualization projects demonstrating core data science techniques:
- Statistical computation - matrix operations, descriptive statistics, linear regression
- Exploratory data analysis (EDA) - trends, distributions, seasonality, correlations, forecasting
- Data visualization - line charts, bar plots, box plots, heatmaps, scatter plots, time-series plots
- Data cleaning & preprocessing - outlier removal, categorical encoding, feature engineering
- Predictive modeling - trend analysis, extrapolation, linear regression forecasting
Two complementary implementations exist for each project:
- Python scripts (
.py) โ implement the exact logic required to pass the freeCodeCamp certification unit tests defined in each project'stest_module.py. They are scoped precisely to the certification requirements and serve as the executable entry point viamain.py. - Jupyter Notebooks (
.ipynb) โ go significantly further than the certification scope, combining the required logic with extended exploratory data analysis (EDA), additional visualizations (heatmaps, rolling averages, distribution plots, correlation rankings), and deeper dataset research. Available both locally and on Google Colab for a cloud-based experience with no local setup required.
| Category | Tools & Libraries |
|---|---|
| Language | Python 3.7+ |
| Data Handling | pandas, NumPy |
| Visualization | Matplotlib, Seaborn, pandas.plotting |
| Statistics | SciPy (linregress, pearsonr, stats) |
| Machine Learning | scikit-learn (sklearn) |
| Testing | Unit tests (unittest) |
| Environment | Jupyter Notebook, Google Colab, IDE (VS Code) |
Analyzes demographic and income patterns from the UCI Adult Census Income dataset - answering key questions on socioeconomic disparities, education premiums, and workforce demographics.
Key Features:
- ๐ Race distribution count across the full dataset
- ๐จ Average age of male respondents
- ๐ Percentage of respondents with Bachelor's degree or higher
- ๐ฐ Income differential by education level (higher vs lower education)
- โฑ๏ธ Minimum work hours + income rate for minimum-hour workers
- ๐ Country with highest high-earner proportion & top occupation for high earners from India
Key Findings:
- โ Education premium: Higher education respondents earn >50K at ~4.2ร higher rate than lower education
- โ Race skew: Dataset heavily skewed toward White respondents (~27,000 of 32,561); Black and Asian populations underrepresented
- โ Class imbalance: Only ~24% earn >50K - binary outcome is imbalanced, important for classification models
- โ Age factor: Average age of >50K earners is ~44 years; strong correlation with experience and seniority
- โ Gender patterns: Male respondents average age ~40; female respondents average ~38 - consistent with historical workforce trends
- โ Geographic insight: India ranks among top countries with >50K earners despite small sample size - likely selection bias (skilled professionals)
Computes mean, variance, standard deviation, max, min, and sum on a list of 9 numbers reshaped into a 3ร3 matrix - along columns, rows, and the flattened array.
Key Features:
- โ
Input validation - raises
ValueErrorif input is not exactly 9 elements - โ Results computed along axis 0 (columns), axis 1 (rows), and flattened
- โ Visualizations: heatmap + bar charts per statistical measure
Output Format:
{
'mean': [axis_0_list, axis_1_list, flattened_value],
'variance': [axis_0_list, axis_1_list, flattened_value],
'standard deviation': [axis_0_list, axis_1_list, flattened_value],
'max': [axis_0_list, axis_1_list, flattened_value],
'min': [axis_0_list, axis_1_list, flattened_value],
'sum': [axis_0_list, axis_1_list, flattened_value]
}Analyzes the Kaggle Cardiovascular Disease dataset (70,000 patients) to explore relationships between lifestyle factors, biomarkers, and cardiovascular disease risk through exploratory data analysis and statistical visualization.
Key Features:
- ๐ Categorical plot - count of categorical variables (active, smoke, alcohol, cholesterol, glucose, BMI) stratified by CVD status
- ๐ฅ Correlation heatmap - Pearson correlation matrix with upper triangle masked for clarity
- ๐ Feature correlation rankings - horizontal bar chart showing which features correlate strongest with CVD
- ๐ฏ Age & BMI group analysis - CVD prevalence by age group and BMI category
- ๐ Biomarker distributions - cholesterol and glucose normalization (binary: normal vs elevated)
- ๐งฎ Feature engineering - BMI calculation from weight/height, categorical binning for age and BMI groups
- โ๏ธ Data cleaning - removes physiologically impossible values (e.g., DBP > SBP), outliers beyond 2.5โ97.5 percentiles
Key Findings:
- โ Balanced binary outcome: 50/50 split between CVD/non-CVD - excellent for supervised learning without class resampling
- โ Lifestyle factors paradox: Activity, smoking, alcohol show ~0 correlation with CVD - may indicate dataset bias or confounding variables
- โ Real physiological drivers: Systolic BP (rโ+0.4), cholesterol (rโ+0.3), age (rโ+0.2) are strongest CVD predictors
- โ Age-BMI interaction: Overweight prevalence increases with age; BMI categories show age-dependent CVD risk
- โ Data quality: Absence of impossible values (DBP > SBP); outliers removed conservatively (2.5โ97.5 percentile)
Visualizes and analyzes 1,238 days of freeCodeCamp.org forum page views - revealing growth trends, seasonal patterns, and distribution shifts across a 3.5-year period. Data is cleaned by removing the top/bottom 2.5% of outliers.
Key Features:
- ๐ Line plot: Tracks daily page views with clear visibility of growth trajectory and spike events
- ๐ Grouped bar chart: Average monthly views by year - reveals seasonal patterns and year-over-year comparison
- ๐ฆ Box plots: Dual side-by-side plots showing year-wise trend (upward distribution shift) and month-wise seasonality (OctoberโNovember peaks)
- ๐ Rolling mean analysis: 30-day centred rolling average isolates the long-term trend from daily noise
- ๐ Statistical outlier documentation: Box plots display IQR-based outlier dots with full transparency on data quality
Key Findings:
- โ Explosive growth: Forum traffic grew ~3.3ร from 2016 to 2019 (mean daily views: ~30K โ ~100K)
- โ Acceleration phase: Clear growth acceleration begins in late 2018 - detected via 30-day rolling mean inflection
- โ Academic seasonality: JanuaryโFebruary and OctoberโNovember show consistently elevated activity; JuneโAugust show dips - suggests student/educator-driven traffic
- โ Improving consistency: 2019 shows narrower gap between mean and max values - traffic became more stable, fewer extreme spikes relative to baseline
- โ Statistical outliers present: Even after 2.5% quantile removal, box plots reveal IQR-based outliers - genuine exceptional traffic days, especially in peak months
- โ Partial-year caveat: 2016 data begins in May (not January) - full-year comparison with 2017โ2019 requires interpretation care
Analyzes 172 years of global average sea level measurements (1880โpresent) from NOAA/CSIRO data - predicting sea level rise through 2050 using linear regression across two distinct time windows.
Key Features:
- ๐ Scatter plot - Visualizes historical sea level observations with clear temporal trends
- ๐ Long-term regression (1880โpresent) - Establishes baseline 140-year trend slope and projects through 2050
- ๐ Recent trend regression (2000โpresent) - Captures acceleration phase by fitting recent 20+ years separately, revealing steeper rise rate
- ๐ฎ Dual forecasts - Compares pessimistic (recent trend) vs. conservative (full history) scenarios for 2050 predictions
- ๐ Statistical metrics - Reports slope (inches/year) and y-intercept for both regression models
- ๐ฏ Annotated visualization - Line labels display slope values for easy interpretation
Key Findings:
- โ Accelerating rise: Recent sea level rise rate (~0.13 in/year) significantly exceeds long-term average (~0.06 in/year) - 2.2ร acceleration detected
- โ 2050 projections: Conservative model predicts ~14.5 inches rise from 1880 baseline; recent-trend model projects ~20+ inches - wide range reflects climate uncertainty
- โ Inflection point: Acceleration becomes evident after year 2000, aligning with intensified climate change impacts
- โ Linear extrapolation limits: Assumes constant rates; actual sea level rise may be nonlinear due to ice sheet collapse, thermal expansion acceleration
- โ Data quality: CSIRO Adjusted Sea Level provides satellite-era continuity with pre-satellite tide gauge records
Ensure Python 3.7 or later is installed:
python --version-
Clone this repository
git clone https://github.com/pedromst2000/Data-Analysis.git cd Data-Analysis -
Navigate to a project folder:
cd <project-folder>
-
Install dependencies:
pip install -r requirements.txt
๐ Dependency Pinning:
requirements.txtfiles pin specific versions to ensure reproducibility across projects. If installation fails, install packages individually:pip install pandas numpy matplotlib seaborn scipy
Note:: You only need to install dependencies if you want to run the Python scripts. The Jupyter notebooks can read all necessary libraries from the Colab environment or Kernel without additional setup.
Each project is fully self-contained. Run the main script to execute the analysis and display unit test results:
python main.pyOutput includes:
- โ Function results (statistics, aggregations, visualizations)
- โ Unit test pass/fail status with detailed assertions
Each project folder contains a .ipynb notebook for interactive, cell-by-cell exploration. Open it with Jupyter:
jupyter notebook <notebook-name>.ipynbOr simply open the .ipynb file directly in VS Code (with the Jupyter extension installed).
๐ Note: If Jupyter is not installed, run
pip install notebookfirst.
ยฉ Built with freeCodeCamp
These projects are derived from the freeCodeCamp Data Analysis with Python certification, leveraging their official boilerplate templates as starting foundations. All solutions have been extended with deeper exploratory data analysis, advanced visualization techniques, and statistical insights beyond certification requirements.
Certification Earned: Data Analysis with Python - v7
Full credit and attribution go to the freeCodeCamp open-source community for the curriculum and boilerplate designs.
