This project addresses advanced sentiment analysis and named entity recognition (NER) for Arabic tweets related to the coronavirus pandemic. Employing modern natural language processing tools and machine learning models, this research analyzes public sentiments and identifies key entities in the Arabic social media landscape. The study leverages three traditional machine learning models and six state-of-the-art language models to provide comprehensive insights into public discourse surrounding COVID-19 in the Arab world.
- Introduction
- Objectives
- Dataset
- Methodology
- Model Development
- Results
- Conclusion
- Installation
- Usage
- Repository Structure
- References
The sentiment analysis and named entity recognition on Arabic tweets project develops a comprehensive system capable of analyzing sentiments expressed in Arabic tweets and identifying named entities within them. With the increasing popularity of social media platforms in the Arab world, understanding sentiment and extracting valuable information from tweets has become increasingly important.
This project leverages state-of-the-art natural language processing techniques and machine learning models to perform sentiment analysis and NER on Arabic tweets. By accurately identifying sentiments and extracting named entities, the system enables users to gain insights into public opinion, track trends, and understand sentiment associated with specific entities mentioned in tweets.
The primary objectives of this project are :
-
Sentiment Analysis: Utilize three machine learning models (Logistic Regression, Neural Networks, and Random Forest) to analyze sentiment expressed in Arabic tweets and classify them as positive or negative.
-
Language Models Evaluation: Explore the effectiveness of six state-of-the-art language models (BERT, AraBERT, DistilBERT, LLAMA2, GPT-3.5, and GPT-4) for sentiment analysis in Arabic language.
-
Named Entity Recognition: Implement NER techniques to identify and extract key entities mentioned in Arabic tweets, including locations, organizations, and individuals relevant to coronavirus pandemic discussions.
-
Insights Generation: Synthesize findings from sentiment analysis and NER to generate actionable insights that can inform decision-making processes, public health interventions, and communication strategies related to COVID-19.
Dataset Name: Arabic Corona Tweets Dataset
Description: The dataset contains 2400 Arabic language tweets related to the coronavirus pandemic.
Structure:
| Feature | Description | Data Type |
|---|---|---|
| Sentence | Contains 2400 Arabic tweet sentences | object |
| Class | Labels: 0 for negative sentiment, 2 for positive sentiment | int64 |
Source: Pre-annotated dataset labeled with sentiments for supervised learning.
- Pandas: Data manipulation and analysis
- NumPy: Numerical computations and array operations
- Matplotlib: Data visualization and chart creation
- Seaborn: Enhanced data visualization capabilities
- NLTK: Natural language processing tasks (tokenization, stopwords removal)
- WordCloud: Generating word clouds for word frequency visualization
- OpenAI: Accessing GPT-3.5 and GPT-4 models
- Transformers: Accessing pre-trained language models (BERT, AraBERT, DistilBERT, LLAMA2)
- Scikit-learn: Machine learning tasks (text vectorization, classification)
- Re: Pattern matching and text processing
- PrettyTable: Generating formatted tables
Arabic language presents unique challenges for NLP due to its morphological complexities and dialectal varieties. The following preprocessing steps were implemented:
-
Data Cleaning: Removal of unnecessary characters, special characters, and symbols from tweet text.
-
Normalization: Standardization of Arabic character representation. Common characters such as "أ", "آ", and "إ" were normalized to "ا".
-
Punctuation Removal: Removal of Arabic and English punctuation marks using predefined lists.
-
Tokenization: Segmentation of cleaned tweet text into individual words using regular expressions.
-
Stopwords Removal: Elimination of common Arabic words (e.g., "و", "في", "على") to reduce noise and focus on meaningful content.
These preprocessing steps were applied uniformly to both positive and negative tweet datasets to prepare the text data for analysis.
The dataset was divided into training and testing segments:
- Training Set: 70% of the data for model training
- Testing Set: 30% of the data for model evaluation
This split ensures models are evaluated on previously unseen data, facilitating an equitable assessment of their predictive accuracy.
Four quantitative metrics were used to assess model performance:
-
Accuracy: Ratio of correct predictions to total predictions
- Formula: (TP + TN) / (TP + TN + FP + FN)
-
Precision: Correctness of positive predictions
- Formula: TP / (TP + FP)
-
Recall: Proportion of correctly identified positive instances
- Formula: TP / (TP + FN)
-
F1-Score: Harmonic mean of precision and recall
- Formula: 2 × (Precision × Recall) / (Precision + Recall)
Three traditional machine learning algorithms were implemented:
-
Logistic Regression: A linear classification algorithm that estimates the probability of class membership using the logistic function. Chosen for its computational efficiency and interpretability.
-
Random Forest: An ensemble learning method based on decision trees. Constructs multiple decision trees during training and combines their predictions to improve accuracy and reduce overfitting.
-
Artificial Neural Networks (ANN): Computational models inspired by the human brain structure, composed of interconnected layers of neurons. Excels at modeling complex, non-linear relationships within data.
Two categories of language models were employed:
- BERT: Bidirectional Encoder Representations from Transformers, renowned for understanding contextual information bidirectionally
- AraBERT: BERT variant specifically tailored for Arabic language
- DistilBERT: Distilled version of BERT, balancing inference speed and memory efficiency
- LLAMA2: Open-source language model designed for Arabic text processing
- GPT-3.5: Advanced generative model with strong contextual understanding
- GPT-4: Latest advancement in GPT series with improved architecture and performance
NER implementation utilized the Abdusah/Arabert-ner model through the Hugging Face Transformers library. The process involved:
- Tokenization of input text into manageable chunks
- Processing each chunk through the NER pipeline
- Identification and extraction of named entities (locations, organizations, individuals)
- Handling of overlapping entities across text chunks
- Consolidation of results for comprehensive entity recognition
| Model | Accuracy | Precision | Recall | F1-Score |
|---|---|---|---|---|
| Logistic Regression | 88.75% | 92.21% | 84.8% | 88.37% |
| Random Forest | 86.5% | 93% | 79% | 85.5% |
| Neural Network | 89% | 93.82% | 83.74% | 88.5% |
| BERT | 80% | 80.9% | 80% | 79.85% |
| DistilBERT | 55.7% | 64% | 55.7% | 47.9% |
| AraBERT | 82.1% | 84% | 82% | 81.8% |
| LLAMA2 | 70.7% | 72.36% | 70.7% | 70% |
| GPT-3.5 | 87% | 88.2% | 87.1% | 87% |
| GPT-4 | 92.85% | 93.42% | 92.85% | 92.83% |
- Best Performing Model: GPT-4 achieved the highest accuracy at 92.85%
- Top Traditional ML Model: Neural Network demonstrated the highest accuracy among traditional models at 89%
- Strong Performers: Logistic Regression (88.75%) and GPT-3.5 (87%) showed robust performance
- Arabic-Specific Models: AraBERT (82.1%) outperformed generic BERT (80%) for Arabic text
- Lowest Performer: DistilBERT showed limited effectiveness for Arabic sentiment analysis at 55.7%
The NER analysis revealed:
- Entity Distribution: Most extracted entities were associated with negative sentiments due to the COVID-19 context
- Most Common Entity: "كورونا" (Corona) appeared most frequently in the dataset
- Entity Categories: Successfully identified locations, organizations, and individuals mentioned in tweets
- Sentiment Association: Clear distinction between entities associated with positive and negative sentiments
WordCloud visualization demonstrated the frequency and prominence of recognized entities, with font size corresponding to occurrence frequency in the dataset.
This comprehensive study evaluated sentiment analysis and Named Entity Recognition on Arabic tweets related to the coronavirus pandemic. Key achievements include:
-
Model Performance: GPT-4 achieved the highest accuracy (92.85%), followed by Neural Networks (89%) among traditional models.
-
Arabic NLP Capabilities: Successfully demonstrated effective sentiment classification and entity extraction from Arabic social media text.
-
Practical Applications: Results provide valuable insights for policymakers, healthcare professionals, and researchers addressing the COVID-19 pandemic in the Arab region.
-
Future Directions: Further research should focus on fine-tuning algorithms, incorporating dialectal variations, and expanding analysis scope to include regional differences.
The study underscores the importance of leveraging advanced NLP techniques for analyzing Arabic social media data, contributing to informed decision-making and effective crisis management strategies.
- Python 3.8 or higher
- pip package manager
pip install pandas numpy matplotlib seaborn nltk wordcloud
pip install transformers torch
pip install scikit-learn prettytable
pip install openaiFor NLTK, download required data:
import nltk
nltk.download('stopwords')
nltk.download('punkt')- Clone the repository:
git clone https://github.com/AnnasMustafaDev/NLP_Arabic_Sentiment_Analysis_Experimentation.git
cd NLP_Arabic_Sentiment_Analysis_Experimentation- Install dependencies:
pip install -r requirements.txt- Open the Jupyter notebook:
jupyter notebook arabic-sentiment-analysis.ipynb- Execute cells sequentially to reproduce the analysis
# Example: Logistic Regression
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
# Vectorize text
vectorizer = TfidfVectorizer(max_features=5000)
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)
# Train model
lr_model = LogisticRegression(max_iter=1000)
lr_model.fit(X_train_vec, y_train)
# Predict
predictions = lr_model.predict(X_test_vec)NLP_Arabic_Sentiment_Analysis_Experimentation/
├── arabic-sentiment-analysis.ipynb # Main analysis notebook
├── README.md # This file
├── requirements.txt # Python dependencies
├── data/ # Dataset directory
│ └── arabic_corona_tweets.csv # Arabic tweets dataset
├── models/ # Saved model files
├── results/ # Output results and visualizations
└── utils/ # Helper functions and utilities
- Devlin, J., et al. (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.
- Antoun, W., et al. (2020). AraBERT: Transformer-based Model for Arabic Language Understanding.
- Sanh, V., et al. (2019). DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter.
- OpenAI. (2023). GPT-4 Technical Report.
- Touvron, H., et al. (2023). Llama 2: Open Foundation and Fine-Tuned Chat Models.
This project is licensed under the MIT License - see the LICENSE file for details.
For questions or collaboration opportunities, please contact:
- GitHub: @AnnasMustafaDev
- Repository: NLP_Arabic_Sentiment_Analysis_Experimentation
This research was conducted as part of advanced natural language processing experimentation for Arabic text analysis. We acknowledge the contributions of the open-source community and the developers of the language models and libraries used in this project.