This repository contains a comprehensive collection of NLP techniques and implementations, ranging from basic text processing to advanced machine learning applications.
NLP/
├── README.md
├── download_nltk_data.py
├── tokenization.py
├── stemming.py
├── lemmatization.py
├── bag_of_words.py
├── TF_IDF.py
├── word2vec.py
├── Pos_tagging.py
├── NER.py
└── projects/
└── spam_classifier/
├── spam_classifier.py
└── data/
└── SMSSpamCollection.txt
Before running the scripts, make sure you have the following packages installed:
pip install nltk spacy scikit-learn pandas numpy gensim-
Download NLTK Data: Run the setup script to download required NLTK data:
python download_nltk_data.py
-
Install spaCy Model: Download the English language model for spaCy:
python -m spacy download en_core_web_sm
- Purpose: Demonstrates sentence and word tokenization using NLTK
- Features:
- Sentence tokenization
- Word tokenization
- Token counting
- Usage:
python tokenization.py
- Purpose: Implements Porter Stemmer for word stemming
- Features:
- Porter Stemmer implementation
- Word normalization
- Usage:
python stemming.py
- Purpose: Demonstrates lemmatization using WordNet
- Features:
- WordNet Lemmatizer
- Part-of-speech aware lemmatization
- Usage:
python lemmatization.py
- Purpose: Implements Bag of Words (BoW) feature extraction
- Features:
- Text preprocessing (cleaning, lowercase, stopword removal)
- Lemmatization
- CountVectorizer implementation
- Usage:
python bag_of_words.py
- Purpose: Implements TF-IDF (Term Frequency-Inverse Document Frequency) feature extraction
- Features:
- TF-IDF vectorization
- Text preprocessing pipeline
- Usage:
python TF_IDF.py
- Purpose: Implements Word2Vec for word embeddings
- Features:
- Word2Vec model training
- Text preprocessing for embeddings
- Vocabulary extraction
- Usage:
python word2vec.py
- Purpose: Demonstrates Part-of-Speech (POS) tagging using spaCy
- Features:
- POS tagging
- Detailed token analysis
- POS explanations
- Usage:
python Pos_tagging.py
- Purpose: Implements Named Entity Recognition (NER) using spaCy
- Features:
- Named entity detection
- Entity type classification
- Detailed token analysis
- Entity explanations
- Usage:
python NER.py
A complete spam detection system using machine learning techniques.
- Dataset: SMS Spam Collection Dataset
- Preprocessing: Text cleaning, lemmatization, stopword removal
- Feature Extraction: Bag of Words with CountVectorizer
- Model: Multinomial Naive Bayes
- Evaluation: Confusion matrix and accuracy metrics
cd projects/spam_classifier
python spam_classifier.pyThe spam classifier uses the SMS Spam Collection Dataset located in projects/spam_classifier/data/SMSSpamCollection.txt.
- NLTK: Natural Language Toolkit for text processing
- spaCy: Advanced NLP library for linguistic analysis
- scikit-learn: Machine learning library for feature extraction and modeling
- pandas: Data manipulation and analysis
- gensim: Word embeddings and topic modeling
- NumPy: Numerical computing
['Natural', 'language', 'processing', '(', 'NLP', ')', 'is', 'a', 'field', 'of', 'artificial', 'intelligence', ...]
len of words 45
Elon ! PROPN NNP proper noun
flew ! VERB VBD verb, past tense
to ! ADP IN adposition
Mars ! PROPN NNP proper noun
yesterday ! NOUN NN noun, singular or mass
Entity Text | Label | Description
John Doe | PERSON| People, including fictional
New York City| GPE | Countries, cities, states
Google | ORG | Companies, agencies, institutions