Skip to content

A structured, hands-on Pandas practice repository covering core data manipulation concepts, best practices, and exploratory data analysis using real-world datasets.

Notifications You must be signed in to change notification settings

aarogyaojha/pandas-python

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Pandas Practice & Learning Repository

Pandas Python

1. Project Overview

This repository serves as a comprehensive, structured guide to mastering Pandas for data manipulation and analysis. It is designed effectively as a "Zero to Hero" roadmap, covering everything from basic Series creation to memory optimization and Exploratory Data Analysis (EDA) on real-world datasets.

The core of this project is the notebooks/pandas_practice.ipynb Jupyter Notebook, which is organized into logical, progressive modules.

2. Why This Repository Exists

Data Engineering and Data Science interviews often focus heavily on data manipulation skills. While many tutorials exist, few focus on industry best practices, such as:

  • Vectorized operations over loops.
  • Explicit indexing (.loc vs .iloc).
  • Proper handling of missing data (NaN).
  • Memory optimization strategies.

This repository bridges the gap between basic syntax and professional application.

3. What You Will Learn

By working through this repository, you will master:

  • Core Structures: Deep dive into Series and DataFrames.
  • Data Cleaning: Handling missing values, duplicates, and string manipulation.
  • Advanced Selection: Boolean masking, querying, and conditional logic.
  • Aggregation: GroupBy, pivoting, and statistical summaries.
  • Performance: Writing efficient, vectorized Pandas code.
  • EDA: Applying these skills to analyze a real Spam/Ham dataset.

4. Repository Structure

pandas-practice/
│
├── notebooks/
│   └── pandas_practice.ipynb   # Main learning notebook
│
├── data/
│   └── spam.csv                # Real-world dataset for EDA
│
├── README.md                   # Project documentation
├── requirements.txt            # Python dependencies
└── .gitignore                  # Git configuration

5. Installation and Setup

Prerequisites

  • Python 3.8+
  • Git

6. How to Clone and Run

  1. Clone the repository

    git clone https://github.com/<username>/pandas-practice.git
    cd pandas-practice
  2. Install dependencies It is recommended to use a virtual environment.

    pip install -r requirements.txt
  3. Launch Jupyter Notebook

    jupyter notebook

    Open notebooks/pandas_practice.ipynb to begin.

7. Dataset Information

The project includes data/spam.csv, a classic dataset for text classification.

  • Content: SMS messages labelled as 'spam' or 'ham' (legitimate).
  • Usage: Used in Section 12 & 13 to demonstrate file reading, cleaning, and Exploratory Data Analysis.

8. Notes on Pandas Best Practices

  • Avoid Loops: Always look for a vectorized solution first.
  • Be Explicit: Use .loc and .iloc instead of relying on ambiguous [] indexing.
  • Chain Method: Use method chaining (e.g., df.query().groupby().mean()) for readable code, but don't overdo it.
  • Copy vs View: Be aware of SettingWithCopyWarning. Use .copy() when creating a new dataframe from a subset.

About

A structured, hands-on Pandas practice repository covering core data manipulation concepts, best practices, and exploratory data analysis using real-world datasets.

Topics

Resources

Stars

Watchers

Forks