A conversational AI chatbot that recommends books, classifies genres, finds similar reads, and answers questions about your favourite titles.
| Intent | Example |
|---|---|
| 📖 Recommendations | "Recommend a mystery novel" |
| 🎭 Vibe Search | "I want something dark and psychological" |
| 🔍 Similar Books | "Books similar to Pride and Prejudice" |
| 🤖 Genre Classification | "What genre is Dune?" |
| 📝 Summarize | "Summarize 1984" |
| ✍️ Author Lookup | "Who wrote The Road?" |
| ⭐ Opinions | "Is Harry Potter worth reading?" |
| ℹ️ Book Details | "Tell me about Dune" |
BookWorm is a pipeline of several components working together:
User messages are classified into one of 9 intents using regex patterns — RECOMMEND, SIMILAR, VIBE_SEARCH, CLASSIFY_BOOK, SUMMARIZE, AUTHOR, OPINION, DETAIL, or GREET.
A multi-pass title extractor pulls book names from natural language queries using:
- Exact substring matching against all known titles
- First-words matching (first 2–3 words of title)
- Fuzzy matching via
difflibafter stripping intent phrases - FAISS semantic fallback for ambiguous cases
A distilbert-base-uncased model fine-tuned on book descriptions to predict genres. Key training details:
- Inverse-sqrt class weighting to handle genre imbalance
- Custom
WeightedTrainerbuilt on HuggingFace'sTrainer - 80/10/10 stratified train/val/test split
- 4 epochs, learning rate
2e-5, max sequence length 256
Every book is embedded using all-MiniLM-L6-v2 (SentenceTransformers) and indexed in a FAISS flat inner-product index. This powers both vibe-based recommendation and similar-book search.
For books not in the local dataset, the chatbot queries the Open Library API to fetch author, year, and subject information.
Best Books Ever Dataset from Kaggle — containing titles, authors, genres, ratings, page counts, descriptions, and publication dates.
Preprocessing steps include duplicate removal, page count normalization, genre list parsing, and filtering out books with missing descriptions.
- Model:
distilbert-base-uncased(HuggingFace Transformers) - Embeddings:
all-MiniLM-L6-v2(SentenceTransformers) - Vector Search: FAISS (
faiss-cpu) - UI: Gradio Blocks
- External API: Open Library
- Other: PyTorch, scikit-learn, pandas, NumPy
BookWorm/
├── BookWorm.ipynb # Full training pipeline — data prep, model fine-tuning, index building
├── app.py # Gradio app (loads pretrained artifacts and serves the chatbot)
├── requirements.txt # Python dependencies
└── README.md
Note: Large artifacts are hosted on Hugging Face Spaces and are not included in this repository:
books.csv— preprocessed datasetembeddings.npy— precomputed FAISS book embeddingsgenre-classifier-final/— fine-tuned DistilBERT weights and tokenizer
- Clone the repository
git clone https://github.com/NaghamProgrammer/BookWorm.git
cd BookWorm- Install dependencies
pip install -r requirements.txt- Download the large artifacts from Hugging Face
Go to the Hugging Face Space files and manually download:
books.csvembeddings.npygenre-classifier-final/(the full directory)
Place them all in the root of the project folder.
- Launch the app
python app.pyRun the cells in BookWorm.ipynb from top to bottom. This will:
- Download the dataset from Kaggle via
kagglehub - Preprocess and save
books.csv - Fine-tune DistilBERT and save the model to
genre-classifier-final/ - Build and save the FAISS index as
embeddings.npy - Launch the Gradio app
⚠️ Retraining requires a GPU and takes approximately 20–30 minutes.
Try it instantly — no setup required:
👉 BookWorm on Hugging Face Spaces
This project is released under the MIT License.
Built with ❤️ and a lot of books.