Skip to content

pedromst2000/Data-Analysis

Repository files navigation

๐Ÿ“Š Data-Analysis


๐Ÿ“‹ Table of Contents

๐Ÿ“Œ Note: Each project includes a Jupyter Notebook (.ipynb) for local interactive exploration and is also available on Google Colab - no local setup required.


๐ŸŽฏ Overview

A collection of Python data analysis and visualization projects demonstrating core data science techniques:

  • Statistical computation - matrix operations, descriptive statistics, linear regression
  • Exploratory data analysis (EDA) - trends, distributions, seasonality, correlations, forecasting
  • Data visualization - line charts, bar plots, box plots, heatmaps, scatter plots, time-series plots
  • Data cleaning & preprocessing - outlier removal, categorical encoding, feature engineering
  • Predictive modeling - trend analysis, extrapolation, linear regression forecasting

Two complementary implementations exist for each project:

  • Python scripts (.py) โ€” implement the exact logic required to pass the freeCodeCamp certification unit tests defined in each project's test_module.py. They are scoped precisely to the certification requirements and serve as the executable entry point via main.py.
  • Jupyter Notebooks (.ipynb) โ€” go significantly further than the certification scope, combining the required logic with extended exploratory data analysis (EDA), additional visualizations (heatmaps, rolling averages, distribution plots, correlation rankings), and deeper dataset research. Available both locally and on Google Colab for a cloud-based experience with no local setup required.

๐Ÿ› ๏ธ Tech Stack

Category Tools & Libraries
Language Python 3.7+
Data Handling pandas, NumPy
Visualization Matplotlib, Seaborn, pandas.plotting
Statistics SciPy (linregress, pearsonr, stats)
Machine Learning scikit-learn (sklearn)
Testing Unit tests (unittest)
Environment Jupyter Notebook, Google Colab, IDE (VS Code)

๐Ÿ“ Projects

1๏ธโƒฃ Demographic Data Analyzer

US Census Income & Demographics Analyzer

Open in Colab

Analyzes demographic and income patterns from the UCI Adult Census Income dataset - answering key questions on socioeconomic disparities, education premiums, and workforce demographics.

Key Features:

  • ๐ŸŒ Race distribution count across the full dataset
  • ๐Ÿ‘จ Average age of male respondents
  • ๐Ÿ“š Percentage of respondents with Bachelor's degree or higher
  • ๐Ÿ’ฐ Income differential by education level (higher vs lower education)
  • โฑ๏ธ Minimum work hours + income rate for minimum-hour workers
  • ๐ŸŒŽ Country with highest high-earner proportion & top occupation for high earners from India

Key Findings:

  • โœ… Education premium: Higher education respondents earn >50K at ~4.2ร— higher rate than lower education
  • โœ… Race skew: Dataset heavily skewed toward White respondents (~27,000 of 32,561); Black and Asian populations underrepresented
  • โœ… Class imbalance: Only ~24% earn >50K - binary outcome is imbalanced, important for classification models
  • โœ… Age factor: Average age of >50K earners is ~44 years; strong correlation with experience and seniority
  • โœ… Gender patterns: Male respondents average age ~40; female respondents average ~38 - consistent with historical workforce trends
  • โœ… Geographic insight: India ranks among top countries with >50K earners despite small sample size - likely selection bias (skilled professionals)

2๏ธโƒฃ Mean, Variance & Standard Deviation Calculator

Statistical Calculator for 3ร—3 Matrix Operations

Open in Colab

Computes mean, variance, standard deviation, max, min, and sum on a list of 9 numbers reshaped into a 3ร—3 matrix - along columns, rows, and the flattened array.

Key Features:

  • โœ… Input validation - raises ValueError if input is not exactly 9 elements
  • โœ… Results computed along axis 0 (columns), axis 1 (rows), and flattened
  • โœ… Visualizations: heatmap + bar charts per statistical measure

Output Format:

{
    'mean':               [axis_0_list, axis_1_list, flattened_value],
    'variance':           [axis_0_list, axis_1_list, flattened_value],
    'standard deviation': [axis_0_list, axis_1_list, flattened_value],
    'max':                [axis_0_list, axis_1_list, flattened_value],
    'min':                [axis_0_list, axis_1_list, flattened_value],
    'sum':                [axis_0_list, axis_1_list, flattened_value]
}

3๏ธโƒฃ Medical Data Visualizer

Cardiovascular Disease Risk Analysis & Data Visualization

Open in Colab

Analyzes the Kaggle Cardiovascular Disease dataset (70,000 patients) to explore relationships between lifestyle factors, biomarkers, and cardiovascular disease risk through exploratory data analysis and statistical visualization.

Key Features:

  • ๐Ÿ“Š Categorical plot - count of categorical variables (active, smoke, alcohol, cholesterol, glucose, BMI) stratified by CVD status
  • ๐Ÿ”ฅ Correlation heatmap - Pearson correlation matrix with upper triangle masked for clarity
  • ๐Ÿ“ˆ Feature correlation rankings - horizontal bar chart showing which features correlate strongest with CVD
  • ๐ŸŽฏ Age & BMI group analysis - CVD prevalence by age group and BMI category
  • ๐Ÿ’‰ Biomarker distributions - cholesterol and glucose normalization (binary: normal vs elevated)
  • ๐Ÿงฎ Feature engineering - BMI calculation from weight/height, categorical binning for age and BMI groups
  • โš•๏ธ Data cleaning - removes physiologically impossible values (e.g., DBP > SBP), outliers beyond 2.5โ€“97.5 percentiles

Key Findings:

  • โœ… Balanced binary outcome: 50/50 split between CVD/non-CVD - excellent for supervised learning without class resampling
  • โœ… Lifestyle factors paradox: Activity, smoking, alcohol show ~0 correlation with CVD - may indicate dataset bias or confounding variables
  • โœ… Real physiological drivers: Systolic BP (rโ‰ˆ+0.4), cholesterol (rโ‰ˆ+0.3), age (rโ‰ˆ+0.2) are strongest CVD predictors
  • โœ… Age-BMI interaction: Overweight prevalence increases with age; BMI categories show age-dependent CVD risk
  • โœ… Data quality: Absence of impossible values (DBP > SBP); outliers removed conservatively (2.5โ€“97.5 percentile)

4๏ธโƒฃ Page View Time Series Visualizer

freeCodeCamp Forum Daily Traffic Analysis (2016โ€“2019)

Open in Colab

Visualizes and analyzes 1,238 days of freeCodeCamp.org forum page views - revealing growth trends, seasonal patterns, and distribution shifts across a 3.5-year period. Data is cleaned by removing the top/bottom 2.5% of outliers.

Key Features:

  • ๐Ÿ“ˆ Line plot: Tracks daily page views with clear visibility of growth trajectory and spike events
  • ๐Ÿ“Š Grouped bar chart: Average monthly views by year - reveals seasonal patterns and year-over-year comparison
  • ๐Ÿ“ฆ Box plots: Dual side-by-side plots showing year-wise trend (upward distribution shift) and month-wise seasonality (Octoberโ€“November peaks)
  • ๐Ÿ” Rolling mean analysis: 30-day centred rolling average isolates the long-term trend from daily noise
  • ๐Ÿ“‹ Statistical outlier documentation: Box plots display IQR-based outlier dots with full transparency on data quality

Key Findings:

  • โœ… Explosive growth: Forum traffic grew ~3.3ร— from 2016 to 2019 (mean daily views: ~30K โ†’ ~100K)
  • โœ… Acceleration phase: Clear growth acceleration begins in late 2018 - detected via 30-day rolling mean inflection
  • โœ… Academic seasonality: Januaryโ€“February and Octoberโ€“November show consistently elevated activity; Juneโ€“August show dips - suggests student/educator-driven traffic
  • โœ… Improving consistency: 2019 shows narrower gap between mean and max values - traffic became more stable, fewer extreme spikes relative to baseline
  • โœ… Statistical outliers present: Even after 2.5% quantile removal, box plots reveal IQR-based outliers - genuine exceptional traffic days, especially in peak months
  • โœ… Partial-year caveat: 2016 data begins in May (not January) - full-year comparison with 2017โ€“2019 requires interpretation care

5๏ธโƒฃ Sea Level Predictor

Global Sea Level Rise Analysis & 2050 Forecasting

Open in Colab

Analyzes 172 years of global average sea level measurements (1880โ€“present) from NOAA/CSIRO data - predicting sea level rise through 2050 using linear regression across two distinct time windows.

Key Features:

  • ๐Ÿ“Š Scatter plot - Visualizes historical sea level observations with clear temporal trends
  • ๐Ÿ“ˆ Long-term regression (1880โ€“present) - Establishes baseline 140-year trend slope and projects through 2050
  • ๐Ÿš€ Recent trend regression (2000โ€“present) - Captures acceleration phase by fitting recent 20+ years separately, revealing steeper rise rate
  • ๐Ÿ”ฎ Dual forecasts - Compares pessimistic (recent trend) vs. conservative (full history) scenarios for 2050 predictions
  • ๐Ÿ“ Statistical metrics - Reports slope (inches/year) and y-intercept for both regression models
  • ๐ŸŽฏ Annotated visualization - Line labels display slope values for easy interpretation

Key Findings:

  • โœ… Accelerating rise: Recent sea level rise rate (~0.13 in/year) significantly exceeds long-term average (~0.06 in/year) - 2.2ร— acceleration detected
  • โœ… 2050 projections: Conservative model predicts ~14.5 inches rise from 1880 baseline; recent-trend model projects ~20+ inches - wide range reflects climate uncertainty
  • โœ… Inflection point: Acceleration becomes evident after year 2000, aligning with intensified climate change impacts
  • โœ… Linear extrapolation limits: Assumes constant rates; actual sea level rise may be nonlinear due to ice sheet collapse, thermal expansion acceleration
  • โœ… Data quality: CSIRO Adjusted Sea Level provides satellite-era continuity with pre-satellite tide gauge records

๐Ÿš€ Getting Started

Prerequisites

Ensure Python 3.7 or later is installed:

python --version

Installation

  1. Clone this repository

    git clone https://github.com/pedromst2000/Data-Analysis.git
    cd Data-Analysis
  2. Navigate to a project folder:

    cd <project-folder>
  3. Install dependencies:

    pip install -r requirements.txt

    ๐Ÿ“Œ Dependency Pinning: requirements.txt files pin specific versions to ensure reproducibility across projects. If installation fails, install packages individually:

    pip install pandas numpy matplotlib seaborn scipy

    Note:: You only need to install dependencies if you want to run the Python scripts. The Jupyter notebooks can read all necessary libraries from the Colab environment or Kernel without additional setup.

Running Projects

Option A - Python Script

Each project is fully self-contained. Run the main script to execute the analysis and display unit test results:

python main.py

Output includes:

  • โœ… Function results (statistics, aggregations, visualizations)
  • โœ… Unit test pass/fail status with detailed assertions

Option B - Jupyter Notebook (Local)

Each project folder contains a .ipynb notebook for interactive, cell-by-cell exploration. Open it with Jupyter:

jupyter notebook <notebook-name>.ipynb

Or simply open the .ipynb file directly in VS Code (with the Jupyter extension installed).

๐Ÿ“Œ Note: If Jupyter is not installed, run pip install notebook first.

Option C - Google Colab (Cloud)


โค๏ธ Credits & Attribution

freeCodeCamp Logo

ยฉ Built with freeCodeCamp

These projects are derived from the freeCodeCamp Data Analysis with Python certification, leveraging their official boilerplate templates as starting foundations. All solutions have been extended with deeper exploratory data analysis, advanced visualization techniques, and statistical insights beyond certification requirements.

Certification Earned: Data Analysis with Python - v7

Full credit and attribution go to the freeCodeCamp open-source community for the curriculum and boilerplate designs.


About

A collection of data analysis projects using Python, pandas, NumPy, and visualization libraries like Matplotlib and Seaborn. Includes exploratory data analysis (EDA), data cleaning, and insights extraction from real-world datasets.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors