GitHub - themrityunjaypathak/Pickify: Smart movie picks, based on what you love

Pickify : Movie Recommender System

Problem Statement

With the rise of streaming services, viewers now have access to thousands of movies across platforms.
As a result, many viewers spend more time browsing than actually watching.
This problem can lead to frustration, lower satisfaction, and reduced watch time on the platform.
Ultimately, this impacts both user experience and business performance.

Overview

Built a content-based movie recommender system with a modular design and proper version control.
It processes metadata from 5,000+ movies to recommend the top 5 similar movies based on a user-selected title.
System uses CountVectorizer for text vectorization and cosine_similarity to compute movie similarity.

How it Works

The dataset contains metadata for each movie, including title, keywords, genres, cast, crew, and overview.

All the features are combined into a new column called tags to create a unified representation for each movie.

Text preprocessing is applied to the tags column :
- All text is converted to lowercase (e.g., "Action, Thriller" becomes "action, thriller").
- Spaces between words are removed (e.g., "action movie" becomes "actionmovie").
- Stemming is performed to reduce words to their root form (e.g., "running" becomes "run").

CountVectorizer is used to convert the tags column into numerical feature vectors.
cosine_similarity is used to calculate similarity between the vector representations of all the movies.
The resulting similarity matrix is serialized and saved as a .pkl file for efficient loading during recommendation.
A Streamlit web application is built to provide an interactive interface for movie selection and recommendation :
- User selects a movie from the dropdown list.
- System recommend top 5 most similar movies based on the similarity scores.
Movie posters are fetched using the TMDB API to enhance the visual appeal of the recommendations.

Workflow

flowchart LR

  subgraph DP["Data Preparation"]
    direction TB
    A["Movie metadata dataset"] --> B["Clean and preprocess data"] --> C["Extract features"] --> D["Generate tags"]
  end

  subgraph FE["Feature Engineering"]
    direction TB
    E["Text preprocessing"] --> F["Vectorize tags (CountVectorizer)"]
  end

  subgraph RE["Recommendation Engine"]
    direction TB
    G["Compute similarity (cosine_similarity)"] --> H["Save similarity matrix (.pkl)"]
  end

  subgraph APP["Application Layer"]
    direction TB
    I["Streamlit app loads .pkl"] --> J["User selects a movie"] --> K["Recommend Top 5 movies"] --> L["Fetch posters (TMDB API)"] --> M["Display recommendations"]
  end

  DP --> FE --> RE --> APP

  %% Minimal dark mode styling (GitHub-safe)
  classDef block fill:#0d1117,stroke:#30363d,stroke-width:1px,color:#ffffff;
  classDef step  fill:#161b22,stroke:#8b949e,stroke-width:1px,color:#ffffff;

  class A,B,C,D,E,F,G,H,I,J,K,L,M step;
  class DP,FE,RE,APP block;

Features

1. Modular Design with Reusable Code

The project follows a modular approach by organizing modules into a dedicated utils/ directory.
Each module in the utils/ directory is responsible for a specific task and includes :
- Clear docstrings explaining functionality, expected inputs/outputs, returns, and raises.
- Robust exception handling for better debugging and maintainability.
Following the DRY (Don't Repeat Yourself) principle, this design :
- Reuses functions across notebooks and scripts without rewriting code.
- Saves development time and reduces redundancy.
The utils/ directory also includes an __init__.py file, which serves a few important purposes in Python :
- The __init__.py file tells Python to treat the directory as a package, not just a regular folder.
- Without it, Python won't recognize the folder as a package.

Add `utils` to PYTHONPATH

To access these utility modules anywhere in the project, add the following snippet at the top of your script :

import sys, os
sys.path.append(os.path.abspath("../utils"))

Example

This is one of the functions I added to my project as the export_data.py module in the utils/ directory.

View Example Function

import os
import pandas as pd

def export_as_csv(dataframe, folder_name, file_name):
    """
    Exports a pandas DataFrame as a CSV file to a specified folder.

    Parameters:
        dataframe (pd.DataFrame): The DataFrame to export.
        folder_name (str): Name of the folder where CSV file will be saved.
        file_name (str): Name of the CSV file. Must end with ".csv" extension.

    Returns:
        None

    Raises:
        TypeError: If input is not a pandas DataFrame.
        ValueError: If file_name does not end with ".csv" extension.
        FileNotFoundError: If folder does not exist.
    """
    try:
        if not isinstance(dataframe, pd.DataFrame):
            raise TypeError("Input must be a pandas DataFrame.")
        if not file_name.lower().endswith(".csv"):
            raise ValueError("File name must end with '.csv' extension.")
        
        current_dir = os.getcwd()
        parent_dir = os.path.dirname(current_dir)
        folder_path = os.path.join(parent_dir, folder_name)
        file_path = os.path.join(folder_path, file_name)

        if not os.path.isdir(folder_path):
            raise FileNotFoundError(f"Folder '{folder_name}' does not exist.")

        dataframe.to_csv(file_path, index=False)
        print(f"Successfully exported the DataFrame as '{file_name}'")
    except TypeError as e:
        print(e)
    except ValueError as e:
        print(e)
    except FileNotFoundError as e:
        print(e)

2. Dynamic Path Handling with `os.path`

Instead of hardcoding file paths, the project uses Python's built-in os module to handle paths dynamically.
This improves code flexibility, ensuring it runs smoothly across different systems and environments.

Key Benefits

Automatically adapts to the system's directory structure.
Prevents FileNotFoundError caused by rigid, hardcoded paths.
Makes deployment and collaboration easier without manual path updates.

View Example Function

current_dir = os.getcwd()
parent_dir = os.path.dirname(current_dir)
folder_path = os.path.join(parent_dir, folder_name)
file_path = os.path.join(folder_path, file_name)

3. Clean Commit History with `nbstripout`

Integrated nbstripout with Git to automatically remove Jupyter notebook outputs before committing.
It helps maintain a clean and readable commit history by :
- Avoiding large, unreadable diffs caused by cell outputs.
- Keeping only code and markdown content under version control.
Especially useful when pushing to remote repositories, as it reduces clutter and improves readability.

Setup `nbstripout` for Jupyter Notebooks

1. Install `nbstripout`

Install it in your virtual environment using pip.

pip install nbstripout

2. Enable it for your Git Repository

This sets up a Git filter to strip notebook output automatically on commits.

nbstripout --install

3. Verify it's Working

Commit a notebook and confirm that outputs are removed from the committed version.

4. Secure API Key with `st.secrets`

The project uses Streamlit's st.secrets feature to securely manage the TMDB API key during development.
A secrets.toml file is placed inside the .streamlit/ directory, storing the API key in the following format :

[tmdb]
api_key = "your_api_key_here"

The API key is accessed in code using :

api_key = st.secrets["tmdb"]["api_key"]

Caution

The secrets.toml file should not be pushed to a public repository to avoid exposing sensitive credentials.

You can add it to .gitignore to ensure it's excluded from version control.

When deploying to Streamlit, the API key must be added via the GUI, not through the secrets.toml file.

5. Accessing Large Files with `gdown`

In the project, a similarity matrix is computed to recommend movies.
However, due to its high dimensionality, the matrix becomes very large and exceeds GitHub's size limitations.
GitHub restricts uploads larger than 100MB in public repositories, making it unsuitable for storing large files.
While Git LFS (Large File Storage) is one option, it can be complex to configure and manage.

Solution with `gdown`

To address this issue, the matrix file is :
- Uploaded to Google Drive.
- Downloaded at runtime using the gdown library.
- Stored locally on the Streamlit server when the app runs.
This approach ensures :
- Compatibility with GitHub without needing Git LFS.
- Hassle-free experience for cloning the repository or running the app across different environments.

View Example Function

import os
import gdown
import pickle

# Step 1 : Define the Google Drive file ID
file_id = "your_file_id_here"

# Step 2 : Set the desired file name for the downloaded file
output = "similarity.pkl"

# Step 3 : Construct the direct download URL from the file ID
url = f"https://drive.google.com/uc?id={file_id}"

# Step 4 : Check if the file already exists locally
# If not, download it from Google Drive using gdown
if not os.path.exists(output):
    gdown.download(url, output, quiet=False)

# Step 5 : Open the downloaded file in read binary mode
# and load the similarity matrix using pickle
with open("similarity.pkl", "rb") as f:
    similarity = pickle.load(f)

Setup

Follow these steps carefully to set up and run the project on your local machine :

1. Clone the Repository

First, you need to clone the project from GitHub to your local system.

git clone https://github.com/themrityunjaypathak/Pickify.git

2. Setup a Virtual Environment

To avoid version conflicts and keep your project isolated, create a virtual environment.

On Windows :

python -m venv .venv

On macOS/Linux :

python3 -m venv .venv

3. Activate the Virtual Environment

After setting up the virtual environment, activate it to begin installing dependencies.

On Windows :

.\.venv\Scripts\activate

On macOS/Linux :

source .venv/bin/activate

4. Install the Project Dependencies

Now, install all the required libraries inside your virtual environment using the requirements.txt file.

pip install -r requirements.txt

Tip

It's a good idea to upgrade pip before installing dependencies to avoid compatibility issues.

pip install --upgrade pip

5. Setup Streamlit Configuration (Optional)

Note

The .streamlit/ folder contains Streamlit configuration settings.

However, it is not necessary to include it in your project unless required.

config.toml

The config.toml file contains configuration settings such as the server settings, theme preferences, etc.

[theme]
base="dark"
primaryColor="#FD3A84"
backgroundColor="#020200"

secrets.toml

The secrets.toml file contains sensitive information like API keys, database credentials, etc.

[tmdb]
api_key = "your_tmdb_api_key_here"

Important

Make sure not to commit your secrets.toml to GitHub or any public repositories.

You can add it to .gitignore to ensure it's excluded from version control.

6. Run the Streamlit Application

After everything is setup, you can run the Streamlit application :

streamlit run app.py

7. Deactivate Virtual Environment

Once you're done working, you can deactivate your virtual environment :

deactivate

Folder Structure

Pickify/
|
├── .streamlit/             # Streamlit Configuration Files
├── raw_data/               # Original Dataset
├── clean_data/             # Preprocessed and Cleaned Dataset
├── notebooks/              # Jupyter Notebooks for Preprocessing and Vectorization
├── images/                 # Images used in the Streamlit Application
├── utils/                  # Modular Python Scripts
├── app.py                  # Main Streamlit Application
├── requirements.txt        # List of required libraries for the Project
├── README.md               # Detailed documentation of the Project
├── LICENSE                 # License specifying permissions and usage rights
├── .gitignore              # All files and folders excluded from Git Tracking

Challenges & Solutions

Challenge 1 : Keeping Commits Clean

Solution : Integrated nbstripout with Git to strip Jupyter notebook outputs automatically before commits.

Challenge 2 : Managing Large Files

Solution : Used Google Drive to host the serialized similarity matrix and downloaded it with gdown at runtime.

Challenge 3 : Hiding Sensitive API Keys

Solution : Used Streamlit st.secrets to securely store and access TMDB API credentials.

Challenge 4 : Reusability and Scalability

Solution : Structured the project with modular scripts inside the utils/ package.

Challenge 5 : Dynamic File Paths

Solution : Used Python's os module for dynamic and platform-independent path handling.

Impact

Deployed the system as a Streamlit web app, used by 100+ users to discover personalized recommendations.
Increased user engagement and watch time through fast recommendations in under 3 seconds.
Reduced the time users spend choosing what to watch by instantly recommending the top 5 similar movies.

Future Improvements

1. Enhanced Tag Generation

Currently, tags are generated equally from cast, crew, keywords, genres, and overview.
We can improve this by applying feature importance or weighting certain features.
This can be done by multiplying certain column values to give them higher importance.

2. User Preferences Integration

Add user-based data to provide more personalized recommendations.
Collaborative filtering can suggest movies based on similar user behaviour.
This will make the recommender system more user-centric.

3. Real-Time Data Updates

Fetch movie data from external sources to ensure the movie database is always up to date.
This would allow recommending the latest releases and removing outdated movies automatically.

4. Improved Similarity Metrics

Instead of just cosine similarity, we can experiment with other advanced similarity measures,
Like Jaccard similarity, TF-IDF, or Word2Vec for capturing semantic meaning in movie descriptions.

5. Interactive User Interface

Enhance the user experience by providing filters to choose movies based on genres, actors, or directors.

License

This project is licensed under the MIT License. You are free to use and modify the code as needed.

^ Scroll to Top ^

Name		Name	Last commit message	Last commit date
Latest commit History 83 Commits
.streamlit		.streamlit
clean_data		clean_data
images		images
notebooks		notebooks
raw_data		raw_data
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
app.py		app.py
requirements.txt		requirements.txt

License

themrityunjaypathak/Pickify

Folders and files

Latest commit

History

Repository files navigation

Pickify : Movie Recommender System

Table of Contents

Problem Statement

Overview

How it Works

Workflow

Features

1. Modular Design with Reusable Code

Add utils to PYTHONPATH

Example

2. Dynamic Path Handling with os.path

Key Benefits

3. Clean Commit History with nbstripout

Setup nbstripout for Jupyter Notebooks

1. Install nbstripout

2. Enable it for your Git Repository

3. Verify it's Working

4. Secure API Key with st.secrets

5. Accessing Large Files with gdown

Solution with gdown

Setup

1. Clone the Repository

2. Setup a Virtual Environment

3. Activate the Virtual Environment

4. Install the Project Dependencies

5. Setup Streamlit Configuration (Optional)

config.toml

secrets.toml

6. Run the Streamlit Application

7. Deactivate Virtual Environment

Folder Structure

Challenges & Solutions

Challenge 1 : Keeping Commits Clean

Challenge 2 : Managing Large Files

Challenge 3 : Hiding Sensitive API Keys

Challenge 4 : Reusability and Scalability

Challenge 5 : Dynamic File Paths

Impact

Future Improvements

1. Enhanced Tag Generation

2. User Preferences Integration

3. Real-Time Data Updates

4. Improved Similarity Metrics

5. Interactive User Interface

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Add `utils` to PYTHONPATH

2. Dynamic Path Handling with `os.path`

3. Clean Commit History with `nbstripout`

Setup `nbstripout` for Jupyter Notebooks

1. Install `nbstripout`

4. Secure API Key with `st.secrets`

5. Accessing Large Files with `gdown`

Solution with `gdown`

Packages