Skip to content

Lidizz/imdb-sentiment-analysis

Repository files navigation

IMDB Sentiment Analysis: Classic ML vs Deep Learning

Binary sentiment classification on 50,000 IMDB movie reviews.
Compares four classic ML models (TF-IDF features) against an LSTM deep learning model and a pre-trained Transformer baseline.

Course project: AI/ML Compulsory Activity
Dataset: IMDB 50K Movie Reviews


Table of Contents


Project Overview

We build a binary sentiment classifier that labels IMDB movie reviews as positive or negative.
The project progresses from interpretable classic ML baselines to a sequence-aware deep learning model, and finishes with a zero-training pre-trained Transformer comparison.

Key questions we answer:

  • Can a simple TF-IDF + Logistic Regression baseline match a neural network?
  • What does an LSTM capture that bag-of-words models cannot?
  • How does a purpose-trained model compare to a pre-trained Transformer out of the box?

Repository Structure

imdb-sentiment-analysis/
├── README.md
├── requirements.txt              # All dependencies (see Setup)
├── .gitignore
│
├── data/                              # Not committed (see Dataset section below)
│   ├── IMDB Dataset.csv               # Raw 50K reviews (download from Kaggle)
│   └── imdb_preprocessed.csv          # Generated by notebook 02
│
├── notebooks/
│   ├── 01_data_exploration.ipynb      # EDA: class balance, review lengths, word frequencies
│   ├── 02_preprocessing.ipynb         # Text cleaning pipeline walkthrough
│   ├── 03_classic_ml_models.ipynb     # TF-IDF + LogReg, Naive Bayes, SVM, Random Forest
│   ├── 04_deep_learning_model.ipynb   # Word embeddings + LSTM (Keras/TensorFlow)
│   └── 05_model_comparison.ipynb      # Side-by-side evaluation + pre-trained baseline
│
├── src/
│   ├── __init__.py
│   ├── config.py                 # Shared paths and constants
│   ├── data_loader.py            # Load CSV, apply preprocessing, train/val/test split
│   ├── preprocessing.py          # HTML removal, lowercasing, stopwords, lemmatisation
│   ├── features.py               # TF-IDF vectoriser + Keras tokeniser/padding
│   ├── classic_models.py         # Train and evaluate LogReg, NB, SVM, Random Forest
│   ├── dl_model.py               # Build, train, and evaluate LSTM model
│   └── evaluation.py             # Metrics, confusion matrices, comparison tables
│
├── models/                       # Saved model files (git-ignored, generated by notebooks)
│   └── .gitkeep                  # (tfidf_vectorizer, logistic_regression, linear_svm,
│                                 #  naive_bayes, random_forest, lstm_final, tokenizer_lstm)
│
├── results/
│   ├── figures/                  # Generated plots (14 PNG files)
│   └── metrics/                  # Saved metric CSVs (6 files)
│
├── report/
│   └── report.pdf                # Final written report
│
├── presentation/
│   └── slides.pdf                # Presentation slides
│
├── reflections/
│   ├── ls_reflection.pdf
│   └── cws_reflection.pdf
│
├── app/                          # Streamlit presentation app
│   ├── app.py                    # Entry point: streamlit run app/app.py
│   ├── _shared.py                # Shared loaders and formatting helpers
│   └── pages/
│       ├── 1_Preprocessing.py    # Interactive preprocessing demo
│       ├── 2_Classic_ML.py       # Feature engineering and classic model results
│       ├── 3_Deep_Learning.py    # LSTM architecture, training curves, results
│       ├── 4_Model_Comparison.py # Full comparison, error analysis, DistilBERT mismatch
│       └── 5_Live_Demo.py        # Real-time prediction from all 5 models
│
├── run_pipeline.py               # Runs all notebooks sequentially (with --only / --from flags)
├── download_nltk_data.py         # One-time NLTK corpora download
└── COLAB_SETUP.md                # Google Colab setup guide (GPU acceleration)

Setup

Step 1 - Install Python 3.11

This project requires Python 3.11 exactly.

Why 3.11 specifically?
tensorflow-cpu (used for the LSTM model) does not have a release for Python 3.12 or 3.13 yet.
Python 3.11 has pre-built wheels for every package in this project, meaning nothing needs to compile from source - installation is fast and requires no C compiler.

Download: python-3.11.9-amd64.exe (Windows 64-bit)

During installation - important settings:

If you are running the installer for the first time (no other Python installed):

  • Check "Add Python to PATH"
  • Click Install Now

If you already have another Python version installed (e.g. 3.12 or 3.13):

  • Verify that "Add Python to PATH" is unchecked

  • Click Customize installation

  • On the Advanced Options screen, uncheck "Add Python to environment variables"

  • Leave the install location as the default (C:\Users\<you>\AppData\Local\Programs\Python\Python311)

  • Click Install

    This keeps your existing Python version as the system default and lets you target 3.11 explicitly using the py launcher (see Step 3 below).

Verify the installation by opening a new terminal and running:

py --list

You should see both versions listed, for example:

-V:3.13 *  Python 3.13.x    <- your existing default (untouched)
-V:3.11    Python 3.11.9    <- newly installed

Step 2 - Clone the repository

git clone https://github.com/Lidizz/imdb-sentiment-analysis.git
cd imdb-sentiment-analysis

Step 3 - Create a virtual environment using Python 3.11

A virtual environment isolates this project's packages from the rest of your system.

If Python 3.11 is your only / default Python:

python -m venv .venv

If you have multiple Python versions installed (e.g. also 3.12 or 3.13):

Use the py launcher to target 3.11 explicitly - otherwise the wrong Python version will be used:

py -3.11 -m venv .venv

Verify the venv is using 3.11 before continuing:

# Activate first
.venv\Scripts\activate        # Windows
source .venv/bin/activate     # Mac / Linux

# Then check
python --version
# Expected output: Python 3.11.9

If the output shows any other version, delete the .venv folder and repeat this step using py -3.11 -m venv .venv.


Step 4 - Install dependencies

pip install -r requirements.txt

This will download and install all packages. Expected install time is 5–15 minutes depending on your connection.

Install size: roughly 1–1.5 GB total.
The two largest packages are tensorflow-cpu (~500 MB) and the Hugging Face model weights (~250 MB).


Step 5 - Download NLTK data (one-time)

Run the following Python script from the project's root

python download_nltk_data.py

This downloads the stopword list and WordNet lemmatizer data used in the text preprocessing pipeline (~20 MB total, saved to your home directory).


Dataset

This project uses the IMDB Dataset of 50K Movie Reviews.

Download instructions:

  1. Go to https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews
  2. Download
  3. Download dataset as zip (27 MB)
  4. Extract the file IMDB Dataset.csv
  5. Place it in this data/ folder

The dataset is not committed to this repository.

Dataset summary:

Property Value
Total reviews 50,000
Columns review (text), sentiment (positive/negative)
Source Kaggle - Lakshmi Srinivas
Original paper Maas et al. (2011), ACL

Presentation App

An interactive Streamlit app that walks through the full project: preprocessing demo, model results, comparison charts, and a live sentiment prediction demo.

streamlit run app/app.py

Pages: Overview → Preprocessing → Classic ML → Deep Learning → Model Comparison → Live Demo

Requires the notebooks to have been run first so models and result CSVs exist in models/ and results/.


Usage - Run Notebooks in Order

(Optional) Start Jupyter:
NOTE: if you are using an IDE such as PyCharm or a text editor such as Visual Studio Code with added extensions, you do not need to manually launch the application local web server. The command will open the interface in your browser if needed.

jupyter notebook

Then run the notebooks in sequence:

# Notebook What it does
1 01_data_exploration.ipynb Load data, check class balance, plot review length distributions, word frequency analysis
2 02_preprocessing.ipynb Step-by-step text cleaning: HTML removal → lowercasing → stopwords → lemmatisation
3 03_classic_ml_models.ipynb TF-IDF feature extraction, train and evaluate LogReg / Naive Bayes / SVM / Random Forest
4 04_deep_learning_model.ipynb Word embeddings, LSTM architecture, training curves, evaluation on test set
5 05_model_comparison.ipynb All models side-by-side, ROC curves, error analysis, pre-trained Transformer baseline

Each notebook imports reusable logic from src/, no duplicated code between notebooks.

  • NOTE: The EDA notebook (01_data_exploration.ipynb) is a lightweight standalone.

Workflow: notebooks vs src/

  • src/ = implementation layer

    • Contains reusable functions (preprocessing, features, training, evaluation)
    • Acts as the single source of truth for project logic
  • notebooks/ = presentation + analysis layer

    • Demonstrates each pipeline stage with visuals, before/after examples, and commentary
    • Calls functions from src/ instead of redefining them inline
  • Users

    • Run notebooks in order and inspect outputs
    • No need to manually edit files in src/ for normal usage
  • Why this structure is used

    • Better maintainability (one fix in src/ applies everywhere)
    • Better reproducibility (same logic reused across notebooks)

Models Implemented

Model Approach Features
Logistic Regression Classic ML TF-IDF (unigrams + bigrams)
Multinomial Naive Bayes Classic ML TF-IDF
Linear SVM Classic ML TF-IDF
Random Forest Ensemble ML TF-IDF + GridSearchCV tuning
LSTM Deep Learning Trainable word embeddings
Pre-trained Transformer Zero-shot baseline Hugging Face distilbert-base-uncased-finetuned-sst-2-english

Text preprocessing pipeline (applied to all models):

  1. Remove HTML tags (<br /> etc.)
  2. Lowercase
  3. Remove punctuation and digits
  4. Remove stopwords (NLTK English list)
  5. Lemmatise (WordNet Lemmatizer)

Results

All models evaluated on the same held-out test set (7,500 rows, 15% stratified split, random_state=42).

Model Accuracy F1-Score Notes
DistilBERT (raw text) ~89.3% ~0.890 Zero-shot inference, no fine-tuning
Logistic Regression (TF-IDF) ~89.2% ~0.893 Best trained model
LSTM (Keras) ~88.7% ~0.888 Sequence-aware deep learning
Linear SVM (TF-IDF) ~88.7% ~0.887 Close to LogReg
Naive Bayes (TF-IDF) ~86.9% ~0.871 Fast probabilistic baseline
Random Forest (TF-IDF) ~85.7% ~0.857 GridSearchCV tuned
DistilBERT (preprocessed text) ~77.4% ~0.727 Preprocessing mismatch (see notebook 05)

Key finding: A simple TF-IDF + Logistic Regression (89.2%) matches a zero-shot DistilBERT transformer (89.3%) on this task — showing that word-choice signal dominates for IMDB sentiment. DistilBERT's low score on preprocessed text (77.4%) reflects input format incompatibility, not architectural weakness.


Authors

  • Lidor Shachar
  • Christin Wøien Skattum

Course: AI3000R-1 Artificial Intelligence for Business Applications - Spring 2026

About

Repository for educational purposes. Designed around a compulsory assignment from the IT and Information Systems undergraduate program at the University of South-Eastern Norway.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors