📊 Data-Analysis

📋 Table of Contents

🎯 Overview
🛠️ Tech Stack
📁 Projects
- 1️⃣ Demographic Data Analyzer
- 2️⃣ Mean, Variance & Standard Deviation Calculator
- 3️⃣ Medical Data Visualizer
- 4️⃣ Page View Time Series Visualizer
- 5️⃣ Sea Level Predictor
🚀 Getting Started
❤️ Credits & Attribution

📌 Note: Each project includes a Jupyter Notebook (.ipynb) for local interactive exploration and is also available on Google Colab - no local setup required.

🎯 Overview

A collection of Python data analysis and visualization projects demonstrating core data science techniques:

Statistical computation - matrix operations, descriptive statistics, linear regression
Exploratory data analysis (EDA) - trends, distributions, seasonality, correlations, forecasting
Data visualization - line charts, bar plots, box plots, heatmaps, scatter plots, time-series plots
Data cleaning & preprocessing - outlier removal, categorical encoding, feature engineering
Predictive modeling - trend analysis, extrapolation, linear regression forecasting

Two complementary implementations exist for each project:

Python scripts (.py) — implement the exact logic required to pass the freeCodeCamp certification unit tests defined in each project's test_module.py. They are scoped precisely to the certification requirements and serve as the executable entry point via main.py.
Jupyter Notebooks (.ipynb) — go significantly further than the certification scope, combining the required logic with extended exploratory data analysis (EDA), additional visualizations (heatmaps, rolling averages, distribution plots, correlation rankings), and deeper dataset research. Available both locally and on Google Colab for a cloud-based experience with no local setup required.

🛠️ Tech Stack

Category	Tools & Libraries
Language	Python 3.7+
Data Handling	pandas, NumPy
Visualization	Matplotlib, Seaborn, pandas.plotting
Statistics	SciPy (linregress, pearsonr, stats)
Machine Learning	scikit-learn (sklearn)
Testing	Unit tests (unittest)
Environment	Jupyter Notebook, Google Colab, IDE (VS Code)

📁 Projects

1️⃣ Demographic Data Analyzer

US Census Income & Demographics Analyzer

Analyzes demographic and income patterns from the UCI Adult Census Income dataset - answering key questions on socioeconomic disparities, education premiums, and workforce demographics.

Key Features:

🌐 Race distribution count across the full dataset
👨 Average age of male respondents
📚 Percentage of respondents with Bachelor's degree or higher
💰 Income differential by education level (higher vs lower education)
⏱️ Minimum work hours + income rate for minimum-hour workers
🌎 Country with highest high-earner proportion & top occupation for high earners from India

Key Findings:

✅ Education premium: Higher education respondents earn >50K at ~4.2× higher rate than lower education
✅ Race skew: Dataset heavily skewed toward White respondents (~27,000 of 32,561); Black and Asian populations underrepresented
✅ Class imbalance: Only ~24% earn >50K - binary outcome is imbalanced, important for classification models
✅ Age factor: Average age of >50K earners is ~44 years; strong correlation with experience and seniority
✅ Gender patterns: Male respondents average age ~40; female respondents average ~38 - consistent with historical workforce trends
✅ Geographic insight: India ranks among top countries with >50K earners despite small sample size - likely selection bias (skilled professionals)

2️⃣ Mean, Variance & Standard Deviation Calculator

Statistical Calculator for 3×3 Matrix Operations

Computes mean, variance, standard deviation, max, min, and sum on a list of 9 numbers reshaped into a 3×3 matrix - along columns, rows, and the flattened array.

Key Features:

✅ Input validation - raises ValueError if input is not exactly 9 elements
✅ Results computed along axis 0 (columns), axis 1 (rows), and flattened
✅ Visualizations: heatmap + bar charts per statistical measure

Output Format:

{
    'mean':               [axis_0_list, axis_1_list, flattened_value],
    'variance':           [axis_0_list, axis_1_list, flattened_value],
    'standard deviation': [axis_0_list, axis_1_list, flattened_value],
    'max':                [axis_0_list, axis_1_list, flattened_value],
    'min':                [axis_0_list, axis_1_list, flattened_value],
    'sum':                [axis_0_list, axis_1_list, flattened_value]
}

3️⃣ Medical Data Visualizer

Cardiovascular Disease Risk Analysis & Data Visualization

Analyzes the Kaggle Cardiovascular Disease dataset (70,000 patients) to explore relationships between lifestyle factors, biomarkers, and cardiovascular disease risk through exploratory data analysis and statistical visualization.

Key Features:

📊 Categorical plot - count of categorical variables (active, smoke, alcohol, cholesterol, glucose, BMI) stratified by CVD status
🔥 Correlation heatmap - Pearson correlation matrix with upper triangle masked for clarity
📈 Feature correlation rankings - horizontal bar chart showing which features correlate strongest with CVD
🎯 Age & BMI group analysis - CVD prevalence by age group and BMI category
💉 Biomarker distributions - cholesterol and glucose normalization (binary: normal vs elevated)
🧮 Feature engineering - BMI calculation from weight/height, categorical binning for age and BMI groups
⚕️ Data cleaning - removes physiologically impossible values (e.g., DBP > SBP), outliers beyond 2.5–97.5 percentiles

Key Findings:

✅ Balanced binary outcome: 50/50 split between CVD/non-CVD - excellent for supervised learning without class resampling
✅ Lifestyle factors paradox: Activity, smoking, alcohol show ~0 correlation with CVD - may indicate dataset bias or confounding variables
✅ Real physiological drivers: Systolic BP (r≈+0.4), cholesterol (r≈+0.3), age (r≈+0.2) are strongest CVD predictors
✅ Age-BMI interaction: Overweight prevalence increases with age; BMI categories show age-dependent CVD risk
✅ Data quality: Absence of impossible values (DBP > SBP); outliers removed conservatively (2.5–97.5 percentile)

4️⃣ Page View Time Series Visualizer

freeCodeCamp Forum Daily Traffic Analysis (2016–2019)

Visualizes and analyzes 1,238 days of freeCodeCamp.org forum page views - revealing growth trends, seasonal patterns, and distribution shifts across a 3.5-year period. Data is cleaned by removing the top/bottom 2.5% of outliers.

Key Features:

📈 Line plot: Tracks daily page views with clear visibility of growth trajectory and spike events
📊 Grouped bar chart: Average monthly views by year - reveals seasonal patterns and year-over-year comparison
📦 Box plots: Dual side-by-side plots showing year-wise trend (upward distribution shift) and month-wise seasonality (October–November peaks)
🔁 Rolling mean analysis: 30-day centred rolling average isolates the long-term trend from daily noise
📋 Statistical outlier documentation: Box plots display IQR-based outlier dots with full transparency on data quality

Key Findings:

✅ Explosive growth: Forum traffic grew ~3.3× from 2016 to 2019 (mean daily views: ~30K → ~100K)
✅ Acceleration phase: Clear growth acceleration begins in late 2018 - detected via 30-day rolling mean inflection
✅ Academic seasonality: January–February and October–November show consistently elevated activity; June–August show dips - suggests student/educator-driven traffic
✅ Improving consistency: 2019 shows narrower gap between mean and max values - traffic became more stable, fewer extreme spikes relative to baseline
✅ Statistical outliers present: Even after 2.5% quantile removal, box plots reveal IQR-based outliers - genuine exceptional traffic days, especially in peak months
✅ Partial-year caveat: 2016 data begins in May (not January) - full-year comparison with 2017–2019 requires interpretation care

5️⃣ Sea Level Predictor

Global Sea Level Rise Analysis & 2050 Forecasting

Analyzes 172 years of global average sea level measurements (1880–present) from NOAA/CSIRO data - predicting sea level rise through 2050 using linear regression across two distinct time windows.

Key Features:

📊 Scatter plot - Visualizes historical sea level observations with clear temporal trends
📈 Long-term regression (1880–present) - Establishes baseline 140-year trend slope and projects through 2050
🚀 Recent trend regression (2000–present) - Captures acceleration phase by fitting recent 20+ years separately, revealing steeper rise rate
🔮 Dual forecasts - Compares pessimistic (recent trend) vs. conservative (full history) scenarios for 2050 predictions
📐 Statistical metrics - Reports slope (inches/year) and y-intercept for both regression models
🎯 Annotated visualization - Line labels display slope values for easy interpretation

Key Findings:

✅ Accelerating rise: Recent sea level rise rate (~0.13 in/year) significantly exceeds long-term average (~0.06 in/year) - 2.2× acceleration detected
✅ 2050 projections: Conservative model predicts ~14.5 inches rise from 1880 baseline; recent-trend model projects ~20+ inches - wide range reflects climate uncertainty
✅ Inflection point: Acceleration becomes evident after year 2000, aligning with intensified climate change impacts
✅ Linear extrapolation limits: Assumes constant rates; actual sea level rise may be nonlinear due to ice sheet collapse, thermal expansion acceleration
✅ Data quality: CSIRO Adjusted Sea Level provides satellite-era continuity with pre-satellite tide gauge records

🚀 Getting Started

Prerequisites

Ensure Python 3.7 or later is installed:

python --version

Installation

Clone this repository

git clone https://github.com/pedromst2000/Data-Analysis.git
cd Data-Analysis

Navigate to a project folder:
```
cd <project-folder>
```
Install dependencies:
```
pip install -r requirements.txt
```
📌 Dependency Pinning: requirements.txt files pin specific versions to ensure reproducibility across projects. If installation fails, install packages individually:
```
pip install pandas numpy matplotlib seaborn scipy
```
Note:: You only need to install dependencies if you want to run the Python scripts. The Jupyter notebooks can read all necessary libraries from the Colab environment or Kernel without additional setup.

Running Projects

Option A - Python Script

Each project is fully self-contained. Run the main script to execute the analysis and display unit test results:

python main.py

Output includes:

✅ Function results (statistics, aggregations, visualizations)
✅ Unit test pass/fail status with detailed assertions

Option B - Jupyter Notebook (Local)

Each project folder contains a .ipynb notebook for interactive, cell-by-cell exploration. Open it with Jupyter:

jupyter notebook <notebook-name>.ipynb

Or simply open the .ipynb file directly in VS Code (with the Jupyter extension installed).

📌 Note: If Jupyter is not installed, run pip install notebook first.

Option C - Google Colab (Cloud)

❤️ Credits & Attribution

These projects are derived from the freeCodeCamp Data Analysis with Python certification, leveraging their official boilerplate templates as starting foundations. All solutions have been extended with deeper exploratory data analysis, advanced visualization techniques, and statistical insights beyond certification requirements.

Certification Earned: Data Analysis with Python - v7

Full credit and attribution go to the freeCodeCamp open-source community for the curriculum and boilerplate designs.

⬆️ Back to Top

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
assets		assets
demographic-data-analyzer		demographic-data-analyzer
mean-variance-standard-deviation-calculator		mean-variance-standard-deviation-calculator
medical-data-visualizer		medical-data-visualizer
page-view-time-series-visualizer		page-view-time-series-visualizer
sea-level-predictor		sea-level-predictor
utils		utils
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

📊 Data-Analysis

📋 Table of Contents

🎯 Overview

🛠️ Tech Stack

📁 Projects

1️⃣ Demographic Data Analyzer

2️⃣ Mean, Variance & Standard Deviation Calculator

3️⃣ Medical Data Visualizer

4️⃣ Page View Time Series Visualizer

5️⃣ Sea Level Predictor

🚀 Getting Started

Prerequisites

Installation

Running Projects

Option A - Python Script

Option B - Jupyter Notebook (Local)

Option C - Google Colab (Cloud)

❤️ Credits & Attribution

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

📊 Data-Analysis

📋 Table of Contents

🎯 Overview

🛠️ Tech Stack

📁 Projects

1️⃣ Demographic Data Analyzer

2️⃣ Mean, Variance & Standard Deviation Calculator

3️⃣ Medical Data Visualizer

4️⃣ Page View Time Series Visualizer

5️⃣ Sea Level Predictor

🚀 Getting Started

Prerequisites

Installation

Running Projects

Option A - Python Script

Option B - Jupyter Notebook (Local)

Option C - Google Colab (Cloud)

❤️ Credits & Attribution

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages