This repository contains an end-to-end analysis of loan applications using PySpark. It includes data manipulation, feature engineering, and binary classification models.
The data for this project is sourced from the Kaggle competition Home Credit Default Risk. The goal of the competition is to predict the capability of each applicant in repaying a loan.
- Number of Entries:
307,511 - Number of Columns:
122 - Column Types:
Float64(65), Int64(41), Object(16)
| SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | ... |
|---|---|---|---|---|---|
| 100002 | 1 | Cash loans | M | N | ... |
| 100003 | 0 | Cash loans | F | N | ... |
| 100004 | 0 | Revolving loans | M | Y | ... |
| 100006 | 0 | Cash loans | F | N | ... |
| 100007 | 0 | Cash loans | M | N | ... |
Income-Spark.ipynb: Main Jupyter Notebook for the project.
- Import Essential Libraries: Libraries like
osandpandasare imported. - Initialize PySpark Configuration: The Spark Configuration and Context are initialized.
- Import PySpark and Initialize: PySpark library is imported and Spark Session is initialized.
-
Importing essential libraries
import os import pandas as pd
-
Initializing PySpark Configuration
from pyspark import SparkConf, SparkContext
-
Initializing Spark Session
import pyspark from pyspark.sql import SparkSession
To run the Jupyter Notebook, execute:
jupyter notebook Loan-Application-PySpark.ipynb