IMDB Sentiment Analysis: Classic ML vs Deep Learning

Binary sentiment classification on 50,000 IMDB movie reviews.
Compares four classic ML models (TF-IDF features) against an LSTM deep learning model and a pre-trained Transformer baseline.

Course project: AI/ML Compulsory Activity
Dataset: IMDB 50K Movie Reviews

Project Overview

We build a binary sentiment classifier that labels IMDB movie reviews as positive or negative.
The project progresses from interpretable classic ML baselines to a sequence-aware deep learning model, and finishes with a zero-training pre-trained Transformer comparison.

Key questions we answer:

Can a simple TF-IDF + Logistic Regression baseline match a neural network?
What does an LSTM capture that bag-of-words models cannot?
How does a purpose-trained model compare to a pre-trained Transformer out of the box?

Repository Structure

imdb-sentiment-analysis/
├── README.md
├── requirements.txt              # All dependencies (see Setup)
├── .gitignore
│
├── data/                              # Not committed (see Dataset section below)
│   ├── IMDB Dataset.csv               # Raw 50K reviews (download from Kaggle)
│   └── imdb_preprocessed.csv          # Generated by notebook 02
│
├── notebooks/
│   ├── 01_data_exploration.ipynb      # EDA: class balance, review lengths, word frequencies
│   ├── 02_preprocessing.ipynb         # Text cleaning pipeline walkthrough
│   ├── 03_classic_ml_models.ipynb     # TF-IDF + LogReg, Naive Bayes, SVM, Random Forest
│   ├── 04_deep_learning_model.ipynb   # Word embeddings + LSTM (Keras/TensorFlow)
│   └── 05_model_comparison.ipynb      # Side-by-side evaluation + pre-trained baseline
│
├── src/
│   ├── __init__.py
│   ├── config.py                 # Shared paths and constants
│   ├── data_loader.py            # Load CSV, apply preprocessing, train/val/test split
│   ├── preprocessing.py          # HTML removal, lowercasing, stopwords, lemmatisation
│   ├── features.py               # TF-IDF vectoriser + Keras tokeniser/padding
│   ├── classic_models.py         # Train and evaluate LogReg, NB, SVM, Random Forest
│   ├── dl_model.py               # Build, train, and evaluate LSTM model
│   └── evaluation.py             # Metrics, confusion matrices, comparison tables
│
├── models/                       # Saved model files (git-ignored, generated by notebooks)
│   └── .gitkeep                  # (tfidf_vectorizer, logistic_regression, linear_svm,
│                                 #  naive_bayes, random_forest, lstm_final, tokenizer_lstm)
│
├── results/
│   ├── figures/                  # Generated plots (14 PNG files)
│   └── metrics/                  # Saved metric CSVs (6 files)
│
├── report/
│   └── report.pdf                # Final written report
│
├── presentation/
│   └── slides.pdf                # Presentation slides
│
├── reflections/
│   ├── ls_reflection.pdf
│   └── cws_reflection.pdf
│
├── app/                          # Streamlit presentation app
│   ├── app.py                    # Entry point: streamlit run app/app.py
│   ├── _shared.py                # Shared loaders and formatting helpers
│   └── pages/
│       ├── 1_Preprocessing.py    # Interactive preprocessing demo
│       ├── 2_Classic_ML.py       # Feature engineering and classic model results
│       ├── 3_Deep_Learning.py    # LSTM architecture, training curves, results
│       ├── 4_Model_Comparison.py # Full comparison, error analysis, DistilBERT mismatch
│       └── 5_Live_Demo.py        # Real-time prediction from all 5 models
│
├── run_pipeline.py               # Runs all notebooks sequentially (with --only / --from flags)
├── download_nltk_data.py         # One-time NLTK corpora download
└── COLAB_SETUP.md                # Google Colab setup guide (GPU acceleration)

Setup

Step 1 - Install Python 3.11

This project requires Python 3.11 exactly.

Why 3.11 specifically?
tensorflow-cpu (used for the LSTM model) does not have a release for Python 3.12 or 3.13 yet.
Python 3.11 has pre-built wheels for every package in this project, meaning nothing needs to compile from source - installation is fast and requires no C compiler.

Download: python-3.11.9-amd64.exe (Windows 64-bit)

During installation - important settings:

If you are running the installer for the first time (no other Python installed):

Check "Add Python to PATH"
Click Install Now

If you already have another Python version installed (e.g. 3.12 or 3.13):

Verify that "Add Python to PATH" is unchecked
Click Customize installation
On the Advanced Options screen, uncheck "Add Python to environment variables"
Leave the install location as the default (C:\Users\<you>\AppData\Local\Programs\Python\Python311)
Click Install

This keeps your existing Python version as the system default and lets you target 3.11 explicitly using the py launcher (see Step 3 below).

Verify the installation by opening a new terminal and running:

py --list

You should see both versions listed, for example:

-V:3.13 *  Python 3.13.x    <- your existing default (untouched)
-V:3.11    Python 3.11.9    <- newly installed

Step 2 - Clone the repository

git clone https://github.com/Lidizz/imdb-sentiment-analysis.git
cd imdb-sentiment-analysis

Step 3 - Create a virtual environment using Python 3.11

A virtual environment isolates this project's packages from the rest of your system.

If Python 3.11 is your only / default Python:

python -m venv .venv

If you have multiple Python versions installed (e.g. also 3.12 or 3.13):

Use the py launcher to target 3.11 explicitly - otherwise the wrong Python version will be used:

py -3.11 -m venv .venv

Verify the venv is using 3.11 before continuing:

# Activate first
.venv\Scripts\activate        # Windows
source .venv/bin/activate     # Mac / Linux

# Then check
python --version
# Expected output: Python 3.11.9

If the output shows any other version, delete the .venv folder and repeat this step using py -3.11 -m venv .venv.

Step 4 - Install dependencies

pip install -r requirements.txt

This will download and install all packages. Expected install time is 5–15 minutes depending on your connection.

Install size: roughly 1–1.5 GB total.
The two largest packages are tensorflow-cpu (~500 MB) and the Hugging Face model weights (~250 MB).

Step 5 - Download NLTK data (one-time)

Run the following Python script from the project's root

python download_nltk_data.py

This downloads the stopword list and WordNet lemmatizer data used in the text preprocessing pipeline (~20 MB total, saved to your home directory).

Dataset

This project uses the IMDB Dataset of 50K Movie Reviews.

Download instructions:

Go to https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews
Download
Download dataset as zip (27 MB)
Extract the file IMDB Dataset.csv
Place it in this data/ folder

The dataset is not committed to this repository.

Dataset summary:

Property	Value
Total reviews	50,000
Columns	`review` (text), `sentiment` (positive/negative)
Source	Kaggle - Lakshmi Srinivas
Original paper	Maas et al. (2011), ACL

Presentation App

An interactive Streamlit app that walks through the full project: preprocessing demo, model results, comparison charts, and a live sentiment prediction demo.

streamlit run app/app.py

Pages: Overview → Preprocessing → Classic ML → Deep Learning → Model Comparison → Live Demo

Requires the notebooks to have been run first so models and result CSVs exist in models/ and results/.

Usage - Run Notebooks in Order

(Optional) Start Jupyter:
NOTE: if you are using an IDE such as PyCharm or a text editor such as Visual Studio Code with added extensions, you do not need to manually launch the application local web server. The command will open the interface in your browser if needed.

jupyter notebook

Then run the notebooks in sequence:

#	Notebook	What it does
1	`01_data_exploration.ipynb`	Load data, check class balance, plot review length distributions, word frequency analysis
2	`02_preprocessing.ipynb`	Step-by-step text cleaning: HTML removal → lowercasing → stopwords → lemmatisation
3	`03_classic_ml_models.ipynb`	TF-IDF feature extraction, train and evaluate LogReg / Naive Bayes / SVM / Random Forest
4	`04_deep_learning_model.ipynb`	Word embeddings, LSTM architecture, training curves, evaluation on test set
5	`05_model_comparison.ipynb`	All models side-by-side, ROC curves, error analysis, pre-trained Transformer baseline

Each notebook imports reusable logic from src/, no duplicated code between notebooks.

NOTE: The EDA notebook (01_data_exploration.ipynb) is a lightweight standalone.

Workflow: notebooks vs `src/`

src/ = implementation layer
- Contains reusable functions (preprocessing, features, training, evaluation)
- Acts as the single source of truth for project logic
notebooks/ = presentation + analysis layer
- Demonstrates each pipeline stage with visuals, before/after examples, and commentary
- Calls functions from src/ instead of redefining them inline
Users
- Run notebooks in order and inspect outputs
- No need to manually edit files in src/ for normal usage
Why this structure is used
- Better maintainability (one fix in src/ applies everywhere)
- Better reproducibility (same logic reused across notebooks)

Models Implemented

Model	Approach	Features
Logistic Regression	Classic ML	TF-IDF (unigrams + bigrams)
Multinomial Naive Bayes	Classic ML	TF-IDF
Linear SVM	Classic ML	TF-IDF
Random Forest	Ensemble ML	TF-IDF + GridSearchCV tuning
LSTM	Deep Learning	Trainable word embeddings
Pre-trained Transformer	Zero-shot baseline	Hugging Face `distilbert-base-uncased-finetuned-sst-2-english`

Text preprocessing pipeline (applied to all models):

Remove HTML tags (<br /> etc.)
Lowercase
Remove punctuation and digits
Remove stopwords (NLTK English list)
Lemmatise (WordNet Lemmatizer)

Results

All models evaluated on the same held-out test set (7,500 rows, 15% stratified split, random_state=42).

Model	Accuracy	F1-Score	Notes
DistilBERT (raw text)	~89.3%	~0.890	Zero-shot inference, no fine-tuning
Logistic Regression (TF-IDF)	~89.2%	~0.893	Best trained model
LSTM (Keras)	~88.7%	~0.888	Sequence-aware deep learning
Linear SVM (TF-IDF)	~88.7%	~0.887	Close to LogReg
Naive Bayes (TF-IDF)	~86.9%	~0.871	Fast probabilistic baseline
Random Forest (TF-IDF)	~85.7%	~0.857	GridSearchCV tuned
DistilBERT (preprocessed text)	~77.4%	~0.727	Preprocessing mismatch (see notebook 05)

Key finding: A simple TF-IDF + Logistic Regression (89.2%) matches a zero-shot DistilBERT transformer (89.3%) on this task — showing that word-choice signal dominates for IMDB sentiment. DistilBERT's low score on preprocessed text (77.4%) reflects input format incompatibility, not architectural weakness.

Authors

Lidor Shachar
Christin Wøien Skattum

Course: AI3000R-1 Artificial Intelligence for Business Applications - Spring 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

IMDB Sentiment Analysis: Classic ML vs Deep Learning

Table of Contents

Project Overview

Repository Structure

Setup

Step 1 - Install Python 3.11

Step 2 - Clone the repository

Step 3 - Create a virtual environment using Python 3.11

Step 4 - Install dependencies

Step 5 - Download NLTK data (one-time)

Dataset

Presentation App

Usage - Run Notebooks in Order

Workflow: notebooks vs `src/`

Models Implemented

Results

Authors

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 66 Commits
app		app
notebooks		notebooks
results		results
src		src
.gitignore		.gitignore
COLAB_SETUP.md		COLAB_SETUP.md
LICENSE		LICENSE
README.md		README.md
download_nltk_data.py		download_nltk_data.py
requirements.txt		requirements.txt
run_pipeline.py		run_pipeline.py

Folders and files

Latest commit

History

Repository files navigation

IMDB Sentiment Analysis: Classic ML vs Deep Learning

Table of Contents

Project Overview

Repository Structure

Setup

Step 1 - Install Python 3.11

Step 2 - Clone the repository

Step 3 - Create a virtual environment using Python 3.11

Step 4 - Install dependencies

Step 5 - Download NLTK data (one-time)

Dataset

Presentation App

Usage - Run Notebooks in Order

Workflow: notebooks vs src/

Models Implemented

Results

Authors

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Workflow: notebooks vs `src/`

Packages