Skip to content

Suresh-28/-Layoffs-Data-Cleaning-Project-SQL-

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 

Repository files navigation

Layoffs Data Cleaning Project

Cleaned and standardized a raw layoffs dataset using SQL for analysis and reporting purposes

Tools & Technologies Used:

  • MySQL 8.0+
  • SQL Queries (CTEs, Window Functions, JOINs)
  • STR_TO_DATE(), TRIM(), LIKE

Dataset Source:

Step-by-Step Cleaning Process:

Step 1: Remove Duplicates

  • Created layoffs_staging as a copy of original table.
  • Added row_num using ROW_NUMBER() PARTITION BY company, industry, total_laid_off, percentage_laid_off, date.
  • Created layoffs_staging2 with row_num column.
  • Deleted all rows where row_num > 1.

Step 2: Standardize Data

  • Trimmed whitespace from company names: UPDATE layoffs_staging2 SET company = TRIM(company);
  • Standardized crypto-related industries to 'Crypto': UPDATE layoffs_staging2 SET industry = 'Crypto' WHERE industry LIKE 'cry%';
  • Fixed country names: Updated "United States." → "United States" using TRIM(TRAILING '.' FROM country).
  • Converted date from TEXT ('MM/DD/YYYY') to DATE type: UPDATE layoffs_staging2 SET date = STR_TO_DATE(date, '%m/%d/%Y'); ALTER TABLE layoffs_staging2 MODIFY COLUMN date DATE;

Step 3: Handle Null / Blank Values

  • Changed empty strings in industry to NULL: UPDATE layoffs_staging2 SET industry = NULL WHERE industry = '';
  • Filled missing industry values using self-JOIN on company: UPDATE layoffs_staging2 lay2 JOIN layoffs_staging2 lay1 ON lay1.company = lay2.company SET lay1.industry = lay2.industry WHERE (lay1.industry IS NULL OR lay1.industry = '') AND lay2.industry IS NOT NULL;

Step 4: Remove Unnecessary Columns

  • Kept all columns (none removed); row_num was dropped after deduplication.

Final Table Structure (layoffs_staging2):

  • company (TEXT)
  • location (TEXT)
  • industry (TEXT) — standardized
  • total_laid_off (INT)
  • percentage_laid_off (TEXT)
  • date (DATE)
  • stage (TEXT)
  • country (TEXT) — cleaned
  • funds_raised_millions (INT)

Sample Queries:

Top 5 Industries by Total Laid Off: SELECT industry, SUM(total_laid_off) AS total_laid_off FROM layoffs_staging2 WHERE total_laid_off IS NOT NULL GROUP BY industry ORDER BY total_laid_off DESC LIMIT 5;

Monthly Layoff Trends: SELECT YEAR(date) AS year, MONTH(date) AS month, SUM(total_laid_off) AS monthly_layoffs FROM layoffs_staging2 WHERE date IS NOT NULL GROUP BY year, month ORDER BY year, month;

Highest % Laid Off by Company: SELECT company, percentage_laid_off, total_laid_off, country FROM layoffs_staging2 WHERE percentage_laid_off IS NOT NULL ORDER BY CAST(percentage_laid_off AS DECIMAL(5,2)) DESC LIMIT 10;

Key Insights:

  • Crypto had the highest total layoffs (>100k).
  • US, India, Germany were top countries.
  • Peak layoffs occurred Q1–Q2 2022.

Next Steps:

  • Build dashboard in Tableau/Power BI.
  • Join with economic data (interest rates, GDP).
  • Add Python EDA notebook.

Acknowledgments:

  • Kaggle for dataset
  • Public reports from TechCrunch, Bloomberg, LinkedIn

License: Educational use only.

Contact: Email: your.email@example.com LinkedIn: linkedin.com/in/yourprofile GitHub: github.com/yourgithubusername

Star ⭐ this repo if you found it helpful!

About

Cleaned and standardized a raw layoffs dataset using SQL for analysis and reporting purposes.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published