This repository serves as a comprehensive, structured guide to mastering Pandas for data manipulation and analysis. It is designed effectively as a "Zero to Hero" roadmap, covering everything from basic Series creation to memory optimization and Exploratory Data Analysis (EDA) on real-world datasets.
The core of this project is the notebooks/pandas_practice.ipynb Jupyter Notebook, which is organized into logical, progressive modules.
Data Engineering and Data Science interviews often focus heavily on data manipulation skills. While many tutorials exist, few focus on industry best practices, such as:
- Vectorized operations over loops.
- Explicit indexing (
.locvs.iloc). - Proper handling of missing data (
NaN). - Memory optimization strategies.
This repository bridges the gap between basic syntax and professional application.
By working through this repository, you will master:
- Core Structures: Deep dive into Series and DataFrames.
- Data Cleaning: Handling missing values, duplicates, and string manipulation.
- Advanced Selection: Boolean masking, querying, and conditional logic.
- Aggregation: GroupBy, pivoting, and statistical summaries.
- Performance: Writing efficient, vectorized Pandas code.
- EDA: Applying these skills to analyze a real Spam/Ham dataset.
pandas-practice/
│
├── notebooks/
│ └── pandas_practice.ipynb # Main learning notebook
│
├── data/
│ └── spam.csv # Real-world dataset for EDA
│
├── README.md # Project documentation
├── requirements.txt # Python dependencies
└── .gitignore # Git configuration
- Python 3.8+
- Git
-
Clone the repository
git clone https://github.com/<username>/pandas-practice.git cd pandas-practice
-
Install dependencies It is recommended to use a virtual environment.
pip install -r requirements.txt
-
Launch Jupyter Notebook
jupyter notebook
Open
notebooks/pandas_practice.ipynbto begin.
The project includes data/spam.csv, a classic dataset for text classification.
- Content: SMS messages labelled as 'spam' or 'ham' (legitimate).
- Usage: Used in Section 12 & 13 to demonstrate file reading, cleaning, and Exploratory Data Analysis.
- Avoid Loops: Always look for a vectorized solution first.
- Be Explicit: Use
.locand.ilocinstead of relying on ambiguous[]indexing. - Chain Method: Use method chaining (e.g.,
df.query().groupby().mean()) for readable code, but don't overdo it. - Copy vs View: Be aware of
SettingWithCopyWarning. Use.copy()when creating a new dataframe from a subset.