Binary sentiment classification on 50,000 IMDB movie reviews.
Compares four classic ML models (TF-IDF features) against an LSTM deep learning model and a pre-trained Transformer baseline.
Course project: AI/ML Compulsory Activity
Dataset: IMDB 50K Movie Reviews
- Project Overview
- Repository Structure
- Setup
- Dataset
- Usage - Run Notebooks in Order
- Models Implemented
- Results
- Authors
We build a binary sentiment classifier that labels IMDB movie reviews as positive or negative.
The project progresses from interpretable classic ML baselines to a sequence-aware deep learning model, and finishes with a zero-training pre-trained Transformer comparison.
Key questions we answer:
- Can a simple TF-IDF + Logistic Regression baseline match a neural network?
- What does an LSTM capture that bag-of-words models cannot?
- How does a purpose-trained model compare to a pre-trained Transformer out of the box?
imdb-sentiment-analysis/
├── README.md
├── requirements.txt # All dependencies (see Setup)
├── .gitignore
│
├── data/ # Not committed (see Dataset section below)
│ ├── IMDB Dataset.csv # Raw 50K reviews (download from Kaggle)
│ └── imdb_preprocessed.csv # Generated by notebook 02
│
├── notebooks/
│ ├── 01_data_exploration.ipynb # EDA: class balance, review lengths, word frequencies
│ ├── 02_preprocessing.ipynb # Text cleaning pipeline walkthrough
│ ├── 03_classic_ml_models.ipynb # TF-IDF + LogReg, Naive Bayes, SVM, Random Forest
│ ├── 04_deep_learning_model.ipynb # Word embeddings + LSTM (Keras/TensorFlow)
│ └── 05_model_comparison.ipynb # Side-by-side evaluation + pre-trained baseline
│
├── src/
│ ├── __init__.py
│ ├── config.py # Shared paths and constants
│ ├── data_loader.py # Load CSV, apply preprocessing, train/val/test split
│ ├── preprocessing.py # HTML removal, lowercasing, stopwords, lemmatisation
│ ├── features.py # TF-IDF vectoriser + Keras tokeniser/padding
│ ├── classic_models.py # Train and evaluate LogReg, NB, SVM, Random Forest
│ ├── dl_model.py # Build, train, and evaluate LSTM model
│ └── evaluation.py # Metrics, confusion matrices, comparison tables
│
├── models/ # Saved model files (git-ignored, generated by notebooks)
│ └── .gitkeep # (tfidf_vectorizer, logistic_regression, linear_svm,
│ # naive_bayes, random_forest, lstm_final, tokenizer_lstm)
│
├── results/
│ ├── figures/ # Generated plots (14 PNG files)
│ └── metrics/ # Saved metric CSVs (6 files)
│
├── report/
│ └── report.pdf # Final written report
│
├── presentation/
│ └── slides.pdf # Presentation slides
│
├── reflections/
│ ├── ls_reflection.pdf
│ └── cws_reflection.pdf
│
├── app/ # Streamlit presentation app
│ ├── app.py # Entry point: streamlit run app/app.py
│ ├── _shared.py # Shared loaders and formatting helpers
│ └── pages/
│ ├── 1_Preprocessing.py # Interactive preprocessing demo
│ ├── 2_Classic_ML.py # Feature engineering and classic model results
│ ├── 3_Deep_Learning.py # LSTM architecture, training curves, results
│ ├── 4_Model_Comparison.py # Full comparison, error analysis, DistilBERT mismatch
│ └── 5_Live_Demo.py # Real-time prediction from all 5 models
│
├── run_pipeline.py # Runs all notebooks sequentially (with --only / --from flags)
├── download_nltk_data.py # One-time NLTK corpora download
└── COLAB_SETUP.md # Google Colab setup guide (GPU acceleration)
This project requires Python 3.11 exactly.
Why 3.11 specifically?
tensorflow-cpu(used for the LSTM model) does not have a release for Python 3.12 or 3.13 yet.
Python 3.11 has pre-built wheels for every package in this project, meaning nothing needs to compile from source - installation is fast and requires no C compiler.
Download: python-3.11.9-amd64.exe (Windows 64-bit)
During installation - important settings:
If you are running the installer for the first time (no other Python installed):
- Check "Add Python to PATH"
- Click Install Now
If you already have another Python version installed (e.g. 3.12 or 3.13):
-
Verify that "Add Python to PATH" is unchecked
-
Click Customize installation
-
On the Advanced Options screen, uncheck "Add Python to environment variables"
-
Leave the install location as the default (
C:\Users\<you>\AppData\Local\Programs\Python\Python311) -
Click Install
This keeps your existing Python version as the system default and lets you target 3.11 explicitly using the
pylauncher (see Step 3 below).
Verify the installation by opening a new terminal and running:
py --listYou should see both versions listed, for example:
-V:3.13 * Python 3.13.x <- your existing default (untouched)
-V:3.11 Python 3.11.9 <- newly installed
git clone https://github.com/Lidizz/imdb-sentiment-analysis.git
cd imdb-sentiment-analysisA virtual environment isolates this project's packages from the rest of your system.
If Python 3.11 is your only / default Python:
python -m venv .venvIf you have multiple Python versions installed (e.g. also 3.12 or 3.13):
Use the py launcher to target 3.11 explicitly - otherwise the wrong Python version will be used:
py -3.11 -m venv .venvVerify the venv is using 3.11 before continuing:
# Activate first
.venv\Scripts\activate # Windows
source .venv/bin/activate # Mac / Linux
# Then check
python --version
# Expected output: Python 3.11.9If the output shows any other version, delete the .venv folder and repeat this step using py -3.11 -m venv .venv.
pip install -r requirements.txtThis will download and install all packages. Expected install time is 5–15 minutes depending on your connection.
Install size: roughly 1–1.5 GB total.
The two largest packages aretensorflow-cpu(~500 MB) and theHugging Facemodel weights (~250 MB).
Run the following Python script from the project's root
python download_nltk_data.pyThis downloads the stopword list and WordNet lemmatizer data used in the text preprocessing pipeline (~20 MB total, saved to your home directory).
This project uses the IMDB Dataset of 50K Movie Reviews.
Download instructions:
- Go to https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews
- Download
- Download dataset as zip (27 MB)
- Extract the file
IMDB Dataset.csv - Place it in this
data/folder
The dataset is not committed to this repository.
Dataset summary:
| Property | Value |
|---|---|
| Total reviews | 50,000 |
| Columns | review (text), sentiment (positive/negative) |
| Source | Kaggle - Lakshmi Srinivas |
| Original paper | Maas et al. (2011), ACL |
An interactive Streamlit app that walks through the full project: preprocessing demo, model results, comparison charts, and a live sentiment prediction demo.
streamlit run app/app.pyPages: Overview → Preprocessing → Classic ML → Deep Learning → Model Comparison → Live Demo
Requires the notebooks to have been run first so models and result CSVs exist in
models/andresults/.
(Optional) Start Jupyter:
NOTE: if you are using an IDE such as PyCharm or a text editor such as Visual Studio Code with added extensions, you do not need to manually launch the application local web server. The command will open the interface in your browser if needed.
jupyter notebookThen run the notebooks in sequence:
| # | Notebook | What it does |
|---|---|---|
| 1 | 01_data_exploration.ipynb |
Load data, check class balance, plot review length distributions, word frequency analysis |
| 2 | 02_preprocessing.ipynb |
Step-by-step text cleaning: HTML removal → lowercasing → stopwords → lemmatisation |
| 3 | 03_classic_ml_models.ipynb |
TF-IDF feature extraction, train and evaluate LogReg / Naive Bayes / SVM / Random Forest |
| 4 | 04_deep_learning_model.ipynb |
Word embeddings, LSTM architecture, training curves, evaluation on test set |
| 5 | 05_model_comparison.ipynb |
All models side-by-side, ROC curves, error analysis, pre-trained Transformer baseline |
Each notebook imports reusable logic from src/, no duplicated code between notebooks.
- NOTE: The EDA notebook (
01_data_exploration.ipynb) is a lightweight standalone.
-
src/= implementation layer- Contains reusable functions (preprocessing, features, training, evaluation)
- Acts as the single source of truth for project logic
-
notebooks/= presentation + analysis layer- Demonstrates each pipeline stage with visuals, before/after examples, and commentary
- Calls functions from
src/instead of redefining them inline
-
Users
- Run notebooks in order and inspect outputs
- No need to manually edit files in
src/for normal usage
-
Why this structure is used
- Better maintainability (one fix in
src/applies everywhere) - Better reproducibility (same logic reused across notebooks)
- Better maintainability (one fix in
| Model | Approach | Features |
|---|---|---|
| Logistic Regression | Classic ML | TF-IDF (unigrams + bigrams) |
| Multinomial Naive Bayes | Classic ML | TF-IDF |
| Linear SVM | Classic ML | TF-IDF |
| Random Forest | Ensemble ML | TF-IDF + GridSearchCV tuning |
| LSTM | Deep Learning | Trainable word embeddings |
| Pre-trained Transformer | Zero-shot baseline | Hugging Face distilbert-base-uncased-finetuned-sst-2-english |
Text preprocessing pipeline (applied to all models):
- Remove HTML tags (
<br />etc.) - Lowercase
- Remove punctuation and digits
- Remove stopwords (NLTK English list)
- Lemmatise (WordNet Lemmatizer)
All models evaluated on the same held-out test set (7,500 rows, 15% stratified split, random_state=42).
| Model | Accuracy | F1-Score | Notes |
|---|---|---|---|
| DistilBERT (raw text) | ~89.3% | ~0.890 | Zero-shot inference, no fine-tuning |
| Logistic Regression (TF-IDF) | ~89.2% | ~0.893 | Best trained model |
| LSTM (Keras) | ~88.7% | ~0.888 | Sequence-aware deep learning |
| Linear SVM (TF-IDF) | ~88.7% | ~0.887 | Close to LogReg |
| Naive Bayes (TF-IDF) | ~86.9% | ~0.871 | Fast probabilistic baseline |
| Random Forest (TF-IDF) | ~85.7% | ~0.857 | GridSearchCV tuned |
| DistilBERT (preprocessed text) | ~77.4% | ~0.727 | Preprocessing mismatch (see notebook 05) |
Key finding: A simple TF-IDF + Logistic Regression (89.2%) matches a zero-shot DistilBERT transformer (89.3%) on this task — showing that word-choice signal dominates for IMDB sentiment. DistilBERT's low score on preprocessed text (77.4%) reflects input format incompatibility, not architectural weakness.
- Lidor Shachar
- Christin Wøien Skattum
Course: AI3000R-1 Artificial Intelligence for Business Applications - Spring 2026