- Problem Statement
- Dataset
- Workflow
- Setup
- Testing
- Dockerization
- Deployment
- Application
- Model Training & Evaluation
- Challenges & Solutions
- Impact
- Folder Structure
- License
- In the used car market, buyers and sellers often struggle to determine a fair price for their vehicle.
- This project aims to provide accurate and transparent pricing for used cars by analyzing real-world data.
- To train the model, I collected real-world used car listings data directly from the Cars24 website.
- Since Cars24 uses dynamically loaded content, a static scraper would not capture all the data.
- Instead, I implemented an automated Selenium + BeautifulSoup Python Script.
Input : URL of a Cars24 listing page to scrape.
- Script uses
ChromeDriverManagerto install and manage the drivers without manual setup.
- Loads the given URL in a real browser session.
- Scrolls down the page in increments, with short random pauses (2-4 seconds) between scrolls.
- This ensures all dynamically loaded listings are fetched.
- Stops scrolling when the bottom of the page is reached or no new content loads.
- Once fully loaded, it retrieves the complete DOM (including dynamically injected elements).
- Returns a BeautifulSoup object containing the entire page's HTML for later parsing and data extraction.
Note
At this stage, no data is extracted, the output is just the complete HTML source.
It which will be parsed to a separate script to extract features like price, model, year, transmission, etc.
Input : BeautifulSoup object (soup) containing the fully-rendered HTML of a Cars24 listing page.
- Looks for
<span>elements with classsc-braxZu kjFjan. - Extracts the text using
.textinto a list calledmodel_name. - The code only keeps those model that start with
2and stores them inclean_model_name.
Important
Inspect the HTML Element : <span id class="sc-braxZu kjFjan">2016 Maruti Wagon R 1.0</span>
Tag : <span> → id (empty) → class : sc-braxZu kjFjan (two classes, separated by space)
However when you hover over it in the browser, it shows : span.sc-braxZu.kjFjan
CSS uses a dot . to indicate classes. The dot is not a part of the class name itself.
It just means "this is a class", it is not the part of the class name.
This might look confusing for someone with little HTML/CSS knowledge, so it's better to clarify it.
- Looks for
<p>elements with classsc-braxZu kvfdZL(each holds one specification value). - These are appended to
specslist.
['69.95k km',
'Petrol',
'Manual',
'1st owner',
'DL-1C',
'70.72k km',
'Diesel',
'Manual',
'2nd owner',
'UP-14',
'15.96k km',
'CNG',
'Manual',
'1st owner',
'UP-16',...]- The flat
specslist is split into consecutive groups of 5 (clean_specs.appendspecs[i:i+5])). - Each group corresponds to one listing's set of specification value.
[['69.95k km', 'Petrol', 'Manual', '1st owner', 'DL-1C'],
['70.72k km', 'Diesel', 'Manual', '2nd owner', 'UP-14'],
['15.96k km', 'CNG', 'Manual', '1st owner', 'UP-16'],...]- From each 5-item group, the script extracts :
clean_specs[0]→km_drivenclean_specs[1]→fuel_typeclean_specs[2]→transmissionclean_specs[3]→owner
clean_specs[4]→number_plateexists but is of no use.
soup.find_all('p', 'sc-braxZu cyPhJl')collects price elements intopricelist.- The script then slices
price = price[2:], removing the first two entries (non-listing elements on the page). - So the remaining prices align with the listings.
['₹3.09 lakh',
'₹5.71 lakh',
'₹7.37 lakh',...]soup.find_all('a', 'styles_carCardWrapper__sXLIp')collects anchor tag for each card and extractshref.
['https://www.cars24.com/buy-used-honda-amaze-2018-cars-noida-11068642783/',
'https://www.cars24.com/buy-used-ford-ecosport-2020-cars-noida-11234948707/',
'https://www.cars24.com/buy-used-tata-altroz-2024-cars-noida-10563348767/',...]- All lists are assembled into a
pandas.DataFrame. - The column names are
model_name,km_driven,fuel_type,transmission,owner,price,link.
- Finally, function returns the above DataFrame for further cleaning, analysis and modelling.
Input : List of URLs for individual car listings (link from the previous DataFrame).
- Loops over the list of individual car listing page URL.
- Uses the
requestslibrary to retrieve each page's HTML content. - Adds a User-Agent header to simulate a real browser and reduce blocking risk.
- Applies a random timeout (4-8 seconds) between requests to avoid overloading the server.
- Converts the response into a BeautifulSoup object using the
lxmlparser for fast, reliable parsing.
- Searches for all
<p>tags with the classsc-braxZu jjIUAi. - Checks if the text exactly matches "Engine capacity".
- If the label is found, grab the value from the next sibling element (
1197 cc). - Marks the entry as successfully found.
- If no engine capacity value is found, store
Noneto maintain positional consistency.
- Outputs a list of engine capacities in the same order as the input URLs.
Click Here to view Example Function
# Parsing HTML Content of Hyderabad City from Cars24 Website
soup = scrape_car_listing('https://www.cars24.com/buy-used-cars-hyderabad/')
# Extracting Car Details of Hyderabad City
hyderabad = get_car_details(soup)# Parsing HTML Content of Bangalore City from Cars24 Website
soup = scrape_car_listing('https://www.cars24.com/buy-used-cars-bangalore/')
# Extracting Car Details of Bangalore City
bangalore = get_car_details(soup)# Parsing HTML Content of Mumbai City from Cars24 Website
soup = scrape_car_listing('https://www.cars24.com/buy-used-cars-mumbai/')
# Extracting Car Details of Mumbai City
mumbai = get_car_details(soup)# Parsing HTML Content of Delhi City from Cars24 Website
soup = scrape_car_listing('https://www.cars24.com/buy-used-cars-delhi-ncr/')
# Extracting Car Details of Delhi City
delhi = get_car_details(soup)# Concatenating Car Details of Different Cities into Single DataFrame
df = pd.concat([hyderabad, bangalore, mumbai, delhi], ignore_index=True)
df.head()# Extracting engine capacity of each car using its car listing link from Cars24 Website
engine_capacity = get_engine_capacity(df['link'])
# Adding "engine_capacity" column in the DataFrame
df['engine_capacity'] = engine_capacity
# Final DataFrame after Web Scrapping
df.head()The final dataset consists of 2,800+ unique car listings, with each record containing :
model_name: Model name of the car (2014 Hyundai Grand i10, etc).fuel_type: Type of fuel the car uses (Petrol, Diesel, CNG, Electric).transmission: Type of transmission the car has (Automatic or Manual).owner: Number of previous owners (1st owner, 2nd owner, 3rd owner, etc).engine_capacity: Size of the engine (in cc).km_driven: Total distance traveled by the car (in km).price: Selling price of the car (target variable).
Tip
Scraping code in the repository depends on the current structure of the target website.
Websites often update their HTML, element IDs or class names which can break the scraping logic.
So before running the scraper, inspect the website to ensure the HTML structure matches the code.
Update any selectors or parsing logic if the website has changed.
flowchart TB
%% =========================
%% 1) DATA COLLECTION
%% =========================
subgraph S1["1) Data Collection"]
direction TB
A1["Cars24 Website"] --> A2["Scraping Notebook<br/>scrape_code.ipynb"]
A2 --> A3["Raw Dataset<br/>scraped_data.csv"]
end
%% =========================
%% 2) DATA PREPROCESSING + EDA
%% =========================
subgraph S2["2) Data Processing"]
direction TB
B1["Load Raw CSV<br/>pd.read_csv()"]
B2["Cleaning & Preprocessing<br/>Handle Nulls, Format Columns, Fix Types"]
B3["Save Cleaned Dataset<br/>clean_data.parquet"]
B4["Exploratory Data Analysis<br/>Feature Understanding, Trends, Distributions"]
B5["Save Dataset after EDA<br/>clean_data_after_eda.parquet"]
B6["Outlier Removal<br/>IQR/Rules-based Filtering"]
B7["Final Training Dataset<br/>clean_data_with_no_outlier.parquet"]
A3 --> B1 --> B2 --> B3 --> B4 --> B5 --> B6 --> B7
end
%% =========================
%% 3) MODEL TRAINING
%% =========================
subgraph S3["3) Model Training"]
direction TB
C1["Train Multiple Regression Models"]
C2["Model Optimization with<br/>Hyperparameter Tuning"]
C3["Select Best Model<br/>based on MAE & R2 score"]
C4["Serialize Model Artifact<br/>as model.pkl"]
B7 --> C1 --> C2 --> C3 --> C4
end
%% =========================
%% 4) MODEL SERVING API
%% =========================
subgraph S4["4) Model Serving"]
direction TB
D1["FastAPI App<br/>loads model.pkl"]
D2["Endpoints"]
D2a["GET /<br/>Root"]
D2b["GET /health<br/>Health Check"]
D2c["POST /predict<br/>Returns Price Prediction"]
C4 --> D1 --> D2
D2 --> D2a
D2 --> D2b
D2 --> D2c
D3["Deployment<br/>Render Cloud"]
D1 --> D3
end
%% =========================
%% 5) CONTAINERIZATION
%% =========================
subgraph S5["5) Containerization"]
direction TB
E1["Dockerfile with<br/>Multi-stage Build"]
E2["Build Docker Image<br/>API + Model + Dependencies"]
E3["Push Image to Docker Hub<br/>for reuse & deployment"]
D1 --> E1 --> E2 --> E3
end
%% =========================
%% 6) FRONTEND INTEGRATION
%% =========================
subgraph S6["6) Frontend Integration"]
direction TB
F1["Frontend UI<br/>HTML/CSS/JS"]
F2["User enters car details<br/>as JSON Payload"]
F3["POST request to /predict<br/>endpoint deployed on Render"]
F4["Prediction displayed<br/>on the Website"]
F1 --> F2 --> F3 --> F4
D3 --> F3
end
%% =========================
%% Minimal dark styling (GitHub-safe)
%% =========================
classDef block fill:#0d1117,stroke:#30363d,stroke-width:1px,color:#ffffff;
classDef step fill:#161b22,stroke:#8b949e,stroke-width:1px,color:#ffffff;
class S1,S2,S3,S4,S5,S6 block;
class A1,A2,A3,B1,B2,B3,B4,B5,B6,B7,C1,C2,C3,C4,D1,D2,D2a,D2b,D2c,D3,E1,E2,E3,F1,F2,F3,F4 step;
Follow these steps carefully to setup and run the project on your local machine :
First, you need to clone the project from GitHub to your local system.
git clone https://github.com/themrityunjaypathak/AutoIQ.gitDocker allows you to package the application with all its dependencies.
docker build -t your_image_name .Tip
Make sure Docker is installed and running on your machine before executing this command.
This project uses a .env file to store configuration settings like model paths, allowed origins, etc.
- Stores environment variables in plain text.
# .env
ENV=environment_name
MAE=mean_absolute_error
PIPE_PATH=pipeline_path
MODEL_FREQ_PATH=model_freq_path
ALLOWED_ORIGINS=list_of_URLs_that_are_allowed_to_access_the_APIImportant
Never commit .env to GitHub / Docker.
Add .env to .gitignore and .dockerignore to keep it private.
- Load and validate environment variables from
.env. - Uses Pydantic
BaseSettingsto read environment variables, validate types and provide easy access.
Click Here to view Example Python File
# api/config.py
import os
from pathlib import Path
from typing import List
from pydantic_settings import BaseSettings
# Required Environment Variables
class Settings(BaseSettings):
ENV: str = "dev"
MAE: int
PIPE_PATH: Path
MODEL_FREQ_PATH: Path
ALLOWED_ORIGINS: str # Comma-separated
# Convert ALLOWED_ORIGINS string into a list
@property
def cors_origins(self) -> List[str]:
return [origin.strip() for origin in self.ALLOWED_ORIGINS.split(",")]
# Load .env locally (development), but skips in Render (deployment)
class Config:
env_file = ".env" if not os.getenv("RENDER") else None
# Create an object of Settings class
settings = Settings()- Uses
settingsfromconfig.pyin FastAPI. - Imports the
settingsobject to provide API's metadata dynamically from.env.
Click Here to view Example Python File
# api/main.py
import pickle
from fastapi import FastAPI
from fastapi.middleware.cors import CORSMiddleware
from api.config import settings
app = FastAPI(title="AutoIQ by Motor.co")
app.add_middleware(
CORSMiddleware,
allow_origins=settings.cors_origins,
allow_credentials=True,
allow_methods=["GET", "POST"],
allow_headers=["*"],
)
with open(settings.PIPE_PATH, "rb") as f:
pipe = pickle.load(f)
with open(settings.MODEL_FREQ_PATH, "rb") as f:
model_freq = pickle.load(f)Start the application using Docker. This will run the FastAPI server and handle all the dependencies automatically.
docker run --env-file .env -p 8000:8000 your_image_name /
uvicorn api.main:app --host 0.0.0.0 --port 8000 --reloadNote
api.main → Refers to the main.py file inside the api folder.
app → The FastAPI instance defined in your code.
--reload → Automatically reloads when code changes (development only).
Once the container is running, open your browser and navigate to :
http://localhost:8000/docs
or
http://127.0.0.1:8000/docsThis opens the Swagger UI for testing the API endpoints.
Access the live API here or Click on the Image below.
When you're done using the application, stop the running container.
docker stop your_image_nameOnce the FastAPI server is running, you can test the API endpoints in Postman or any similar software.
- Launch the Postman application on your computer.
- Click on the "New" button, then select "HTTP" requests.
- Retrieve information from the server without modifying any data.
- Open Postman and create a new request.
- Set the HTTP method to "GET" from the dropdown menu.
- Enter the endpoint URL you want to query.
http://127.0.0.1:8000
- Click the "Send" button to submit the request.
- Status Code : It indicates that the request was successful and the server responded with the requested data.
200 OK
- Response Body (JSON) : This confirms that the API is running and returns the result of your API call.
{
"message":"Pipeline is live"
}
- Send data to a server to create/update a resource.
- Open Postman and create a new request.
- Set the HTTP method to "POST" from the dropdown menu.
- Enter the endpoint URL you want to query.
http://127.0.0.1:8000/predict
- Navigate to the "Headers" tab and add the following : Key →
Content-Type, Value →application/json - Go to the "Body" tab, Select "raw", then choose "JSON" from the format dropdown menu.
- Enter the request payload in JSON format.
{
"brand": "MG",
"model": "HECTOR",
"km_driven": 80000,
"engine_capacity": 1498,
"fuel_type": "Petrol",
"transmission": "Manual",
"year": 2022,
"owner": "1st owner"
}- Click the "Send" button to submit the request.
- Status Code : It indicates that the server successfully processed the request and generated a prediction.
200 OK
- Response Body (JSON) : This confirms that the API is running and returns the result of your API call.
{
"output": "₹9,69,000 to ₹11,50,000"
}
Follow these steps carefully to containerize your project with Docker :
- Before starting, make sure Docker is installed on your system.
- Visit Docker → Click on Download Docker Desktop → Choose Windows / Mac / Linux
- Open Docker Desktop → Make sure Docker Engine is Running
- Create a
Dockerfileand place it in the root folder of your Repository.
Click Here to view Example Dockerfile
# Start with the official Python 3.11 image.
# -slim means this is a smaller Debian-based image with fewer preinstalled packages, which makes it lighter.
FROM python:3.11-slim
# Install required system packages for Python libraries.
RUN apt-get update && apt-get install -y --no-install-recommends \
gcc \
g++ \
python3-dev \
libopenblas-dev \
liblapack-dev \
gfortran \
&& rm -rf /var/lib/apt/lists/*
# Set the working directory to /app inside the container.
# All future commands (COPY, RUN, CMD) will be executed from here.
WORKDIR /app
# Copies your local requirements.txt into the container's /app folder.
COPY requirements.txt .
# Install all the dependencies from requirements.txt.
# --no-cache-dir prevents pip from keeping installation caches, making the image smaller.
RUN pip install --no-cache-dir --upgrade pip && \
pip install --no-cache-dir -r requirements.txt
# Copies all the remaining project files (Flask API, HTML, CSS, JS, etc.) into /app.
COPY . .
# Expose FastAPI port, so it can be accessed from outside the container.
EXPOSE 8000
# Default command to run the FastAPI app with Uvicorn in production mode.
# --host 0.0.0.0 allows external connections (necessary in Docker).
# --port 8000 specifies the port.
CMD ["uvicorn", "api.main:app", "--host", "0.0.0.0", "--port", "8000"]- This file tells Docker which files and folders to exclude from the image.
- This keeps the image small and prevents unnecessary files from being copied.
- A
.dockerignorefile is used to exclude all files and folders that are not required to run your application.
Click Here to view Example Dockerignore File
# Virtual Environment
.venv/
# Jupyter Notebooks
*.ipynb
# Jupyter Notebook Checkpoints
.ipynb_checkpoints/
# Python Cache
__pycache__/
*.pyc
*.pyo
*.pyd
# Environment File
.env
*.env
# Dataset (Parquet & CSV Files)
*.parquet
*.csv
# Python Package (utils)
utils/
# Local/Temporary Files
*.log
*.tmp
*.bak
# Version Control Files
.git/
.gitignore
# IDE/Editor Configs
.vscode/
.idea/
.DS_Store
# Python Package Build Artifacts
*.egg-info/
build/
dist/- A Docker image is essentially a read-only template that contains everything needed to run an application.
- You can think of a Docker image as a blueprint or snapshot of an environment. It doesn't run anything.
docker build -t your_image_name .- When you run a Docker image, it becomes a Docker container.
- It is a live instance of that image, running your application in an isolated environment.
docker run --env-file .env -p 8000:8000 your_image_name /
uvicorn api.main:app --host 0.0.0.0 --port 8000 --reloaddocker run --env-file .env -p 8000:8000 your_image_name- After the container starts, you can access your API.
http://localhost:8000
or
http://127.0.0.1:8000- Once your Docker image is ready, you can push it to Docker Hub.
- It allows anyone to pull and run it without building it themselves.
Access the Docker Hub here or Click on the Image below.
- Prompts you to enter your Docker Hub username and password.
- This authenticates your local Docker client with your Docker Hub account.
docker login- Tagging prepares the image for upload to Docker Hub.
docker tag your_image_name your-dockerhub-username/your_image_name:latest- Uploads your image to your Docker Hub Repository.
- Once pushed, your image is publicly available.
- Anyone can now pull and run the image without building it locally.
docker push your-dockerhub-username/your_image_name:latest- Once pushed, anyone can pull your image from Docker Hub and run it.
- This ensures that the application behaves the same way across all systems.
docker pull your-dockerhub-username/your_image_name:latest- After pulling the Docker image, you can run it to create Docker container from it.
docker run --env-file .env -p 8000:8000 your-dockerhub-username/your_image_name:latest- Lists all the running containers with
container_id.
docker ps- Stops the running container safely.
container_idcan be obtained fromdocker psoutput.
docker stop container_idFollow these steps carefully to deploy your FastAPI application on Render :
- Link your GitHub Repository / Existing Docker Image
- Add details about your API
- Add Environment Variables in Render Dashboard (same as
.env)
- Deploy the Web Service
The frontend application files are in the project root :
index.html→ This file defines the structure and layout of the web page.style.css→ This file handles the visual appearance of the web page.script.js→ This file communicates between the web page and the API.
You can open index.html directly in your browser or serve it via a local HTTP server (like VS Code Live Server).
Note
Remember to update the API URL in script.js when deploying on GitHub Pages to get real-time predictions.
Change from :
const fetchPromise = fetch("http://127.0.0.1:8000/predict", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify(data),
});To :
const fetchPromise = fetch("https://your_api_name.onrender.com/predict", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify(data),
});Access the live Website here or Click on the Image below.
Important
The API for this project is deployed using the free tier on Render.
As a result, it may go to sleep after periods of inactivity.
Please start the API first by visiting the API URL. Then, navigate to the website to make predictions.
If the API was inactive, the first prediction may take a few seconds while the server spins back up.
Click Here to view Code Snippet
# Importing load_parquet function from read_data module
from read_data import load_parquet
cars = load_parquet('clean_data', 'clean_data_after_eda.parquet')
cars.head()Click Here to view Code Snippet
# Creating Features and Target Variable
X = cars.drop('price', axis=1)
y = cars['price']
# Splitting Data into Training and Testing Set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)Click Here to view Code Snippet
# Pipeline for Nominal Column
nominal_cols = ['fuel_type','transmission','brand']
nominal_trf = Pipeline(steps=[
('ohe', OneHotEncoder(sparse_output=False, handle_unknown='ignore'))
])# Pipeline for Ordinal Column
ordinal_cols = ['owner']
ordinal_categories = [['Others','3rd owner','2nd owner','1st owner']]
ordinal_trf = Pipeline(steps=[
('oe', OrdinalEncoder(categories=ordinal_categories))
])# Pipeline for Numerical Column
numerical_cols = ['km_driven','year','engine_capacity']
numerical_trf = Pipeline(steps=[
('scaler', RobustScaler())
])# Adding Everything into ColumnTransformer
ctf = ColumnTransformer(transformers=[
('nominal', nominal_trf, nominal_cols),
('ordinal', ordinal_trf, ordinal_cols),
('scaling', numerical_trf, numerical_cols)
], remainder='passthrough', n_jobs=-1)Click Here to view Code Snippet
# Models Dictionary
models = {
'LR' : LinearRegression(n_jobs=-1),
'KNN' : KNeighborsRegressor(n_jobs=-1),
'DT' : DecisionTreeRegressor(random_state=42),
'RF' : RandomForestRegressor(random_state=42, n_jobs=-1),
'GB' : GradientBoostingRegressor(random_state=42),
'XGB' : XGBRegressor(random_state=42, n_jobs=-1)
}# Computing Average Error and R2-Score through Cross-Validation
results = {}
for name, model in models.items():
pipe = Pipeline(steps=[
('preprocessor', ctf),
('model', model)
])
k = KFold(n_splits=5, shuffle=True, random_state=42)
cv_results = cross_validate(estimator=pipe, X=X_train, y=y_train, cv=k, scoring={'mae':'neg_mean_absolute_error','r2':'r2'}, n_jobs=-1, return_train_score=False)
results[name] = {'avg_error': -cv_results['test_mae'].mean(),'avg_score': cv_results['test_r2'].mean()}
print()
print(f'Model : {name}')
print('-'*40)
print(f'Average Error : {-cv_results['test_mae'].mean():.2f}')
print(f'Standard Deviation of Error : {cv_results['test_mae'].std():.2f}')
print(f'Average R2-Score : {cv_results['test_r2'].mean():.2f}')
print(f'Standard Deviation of R2-Score : {cv_results['test_r2'].std():.2f}')Model : LR
----------------------------------------
Average Error : 123190.02
Standard Deviation of Error : 6445.18
Average R2-Score : 0.77
Standard Deviation of R2-Score : 0.01
Model : KNN
----------------------------------------
Average Error : 115572.16
Standard Deviation of Error : 3883.19
Average R2-Score : 0.79
Standard Deviation of R2-Score : 0.00
Model : DT
----------------------------------------
Average Error : 118466.64
Standard Deviation of Error : 4490.62
Average R2-Score : 0.76
Standard Deviation of R2-Score : 0.03
Model : RF
----------------------------------------
Average Error : 90811.20
Standard Deviation of Error : 2335.09
Average R2-Score : 0.86
Standard Deviation of R2-Score : 0.01
Model : GB
----------------------------------------
Average Error : 98056.52
Standard Deviation of Error : 3001.29
Average R2-Score : 0.85
Standard Deviation of R2-Score : 0.01
Model : XGB
----------------------------------------
Average Error : 91595.94
Standard Deviation of Error : 2640.02
Average R2-Score : 0.86
Standard Deviation of R2-Score : 0.02
Click Here to view Code Snippet
# Plotting Metric Comparision Graph
results_df = pd.DataFrame(results)
fig, ax = plt.subplots(ncols=1, nrows=2, figsize=(12,8))
sns.barplot(x=results_df.iloc[0,:].sort_values().index.to_list(), y=results_df.iloc[0,:].sort_values().values, ax=ax[0])
ax[0].set_title('Average Error Comparision (Lower is Better)')
ax[0].set_ylabel('Error')
for container in ax[0].containers:
ax[0].bar_label(container, fmt='%.0f')
sns.barplot(x=results_df.iloc[1,:].sort_values().index.to_list(), y=results_df.iloc[1,:].sort_values().values, ax=ax[1])
ax[1].set_title('Average R2-Score Comparision (Higher is Better)')
ax[1].set_ylabel('R2-Score')
for container in ax[1].containers:
ax[1].bar_label(container, fmt='%.2f')
plt.tight_layout()
plt.show()
Click Here to view Code Snippet
# Assigning Base Model for StackingRegressor
base_model = [('rf', rf),('xgb', xgb),('gb', gb)]
# Structure of StackingRegressor
stack = StackingRegressor(
estimators=base_model,
final_estimator=meta_model,
passthrough=False,
cv=k, n_jobs=-1
)
# Final Pipeline with StackingRegressor
pipe = Pipeline(steps=[
('preprocessor', ctf),
('model', stack)
])# Average Error and R2-Score through Cross-Validation
cv_results = cross_validate(estimator=pipe, X=X_train, y=y_train, cv=k, scoring={'mae':'neg_mean_absolute_error','r2':'r2'}, n_jobs=-1)
print(f"Average Error : {-cv_results['test_mae'].mean():.2f}")
print(f"Standard Deviatacion of Error : {cv_results['test_mae'].std():.2f}")
print(f"Average R2-Score : {cv_results['test_r2'].mean():.2f}")
print(f"Standard Deviation of R2-Score : {cv_results['test_r2'].std():.2f}")Average Error : 87885.34
Standard Deviatacion of Error : 1279.54
Average R2-Score : 0.87
Standard Deviation of R2-Score : 0.01
![]() |
![]() |
|---|
| R2-Score Curve | Error Curve |
|---|---|
![]() |
![]() |
Click Here to view Code Snippet
# Parameter Distribution
param_dist = {
'model__rf__n_estimators': [200, 300],
'model__rf__max_depth': [10, 20],
'model__rf__min_samples_leaf': [3, 5],
'model__rf__min_samples_split': [5, 7],
'model__xgb__n_estimators': [200, 300],
'model__xgb__learning_rate': [0.05, 0.1],
'model__xgb__max_depth': [2, 4],
'model__xgb__subsample': [0.5, 0.75],
'model__xgb__colsample_bytree': [0.5, 0.75],
'model__gb__n_estimators': [100, 200],
'model__gb__learning_rate': [0.05, 0.1],
'model__gb__max_depth': [2, 4],
'model__gb__subsample': [0.5, 0.75],
'model__final_estimator__alpha': [0.1, 10.0],
'model__final_estimator__l1_ratio': [0.0, 1.0]
}
# RandomizedSearch Object with Cross-Validation
rcv = RandomizedSearchCV(estimator=pipe, param_distributions=param_dist, cv=k, scoring='neg_mean_absolute_error', n_iter=30, n_jobs=-1, random_state=42)
# Fitting the RandomizedSearch Object
rcv.fit(X_train, y_train)
# Best Estimator
best_model = rcv.best_estimator_| Before Tuning | After Tuning |
|---|---|
![]() |
![]() |
| R2-Score Curve (Before Tuning) | R2-Score Curve (After Tuning) |
|---|---|
![]() |
![]() |
| Error Curve (Before Tuning) | Error Curve (After Tuning) |
|---|---|
![]() |
![]() |
- I wanted to use real-world data instead of a toy dataset, as it better represents messy, real-life scenarios.
- However, Cars24 loads its content dynamically using JavaScript, meaning a simple HTTP request is not enough.
- I used Selenium to simulate a real browser, ensuring the page was fully loaded before scraping.
- Once the content was rendered, I used BeautifulSoup to efficiently parse the HTML.
- This approach allowed me to reliably capture complete car details.
- The raw scraped dataset was large and consumed unnecessary memory.
- Loading it repeatedly during experimentation became slow and inefficient.
- I optimized memory usage by downcasting data types.
- I stored the dataset in Parquet format, which compresses data without losing information.
- This enabled much faster read/write performance compared to CSV.
- If preprocessing is applied to the entire dataset, test data can leak into the training process.
- This creates overly optimistic results and reduces the model's ability to generalize.
- I implemented Scikit-learn Pipeline and ColumnTransformer to apply preprocessing only on training data.
- This kept the test data completely unseen during preprocessing, preventing leakage.
- Even after building the ML pipeline, it remained offline and could only be used locally.
- There was no way to send inputs and get predictions over the web or from other applications.
- The model depended on the local system and could not serve predictions to external users or services.
- I deployed the ML model as an API using FastAPI.
- This allowed users and applications to send inputs and receive predictions in real time.
- I added a
/predictendpoint for predictions and a/healthendpoint for monitoring API status. - I implemented input validation and rate limiting to prevent misuse and ensure stability under load.
- These improvements made the API accessible, reliable, and production-ready.
- Even if the API works correctly, non-technical users may still find it difficult to test and use.
- This limits adoption and accessibility.
- I created an HTML/CSS/JS frontend that sends requests to the API and displays predictions instantly.
- I also included an example payload in Swagger UI so users can test the API with minimal effort.
- Installing dependencies and setting up the environment manually is time-consuming and error-prone.
- This becomes worse across different machines and operating systems.
- Sharing the project also becomes difficult, since others must replicate the exact setup.
- I created a multi-stage Dockerfile.
- It builds the FastAPI application, installs dependencies, and copies only required files into the final image.
- I used a
.dockerignorefile to exclude unnecessary files and keep the image lightweight. - This allows the project to run consistently on any system with Docker installed.
- It eliminates dependency mismatches and OS-specific issues.
- Same Docker image can be used to deploy on Render, Docker Hub or run locally with a single docker command.
- Built and deployed an end-to-end ML pipeline using FastAPI to serve real-time used car price predictions.
- Reduced dataset memory usage by 90%, improving pipeline efficiency and lowering storage needs.
- Converted data to Parquet, reducing load time and speeding up processing.
- Evaluated multiple regression models using cross-validation for fair comparison.
- Selected the best-performing model(s) to improve prediction accuracy and consistency.
- Achieved 30% lower MAE and 12% higher R2 score, improving prediction quality.
- Better accuracy helped support more competitive and confident pricing decisions.
- Reduced prediction error variance by 70%, ensuring stable and reliable predictions.
- More consistent predictions reduced pricing mistakes and built customer trust.
AutoIQ/
|
├── api/ # FastAPI Code for making Predictions
│ ├── main.py
│ └── config.py
│
├── clean_data/ # Cleaned Dataset (Parquet Format)
│ └── clean_data.parquet
| └── ...
│
├── images/ # Images for Frontend Interface
│ ├── favicon.png
│ └── hero_image.png
│
├── models/ # Serialized Components for Prediction
│ ├── pipe.pkl
│ └── model_freq.pkl
│
├── notebooks/ # Jupyter Notebooks for Project Developement
│ └── data_cleaning.ipynb
| └── ...
│
├── scrape_code/ # Web Scraping Notebook
│ └── scrape_code.ipynb
│
├── scrape_data/ # Scraped Dataset (CSV Format)
│ └── scrape_data.csv
│
├── utils/ # Reusable Python Functions (utils Package)
│ ├── __init__.py
│ ├── web_scraping.py
│ └── helpers.py
| └── ...
│
├── .dockerignore # All files and folders ignored by Docker while building Docker Image
├── .gitignore # All files and folders ignored by Git while pushing code to GitHub
├── Dockerfile # Instructions for building the Docker Image
├── index.html # Frontend HTML File
├── style.css # Frontend CSS File
├── script.js # Frontend JS File
├── requirements.txt # List of required libraries for the Project
├── LICENSE # License specifying permissions and usage rights
└── README.md # Detailed documentation of the Project
This project is licensed under the MIT License. You are free to use and modify the code as needed.


















