📰 Turkish News RAG Assistant

A comprehensive Retrieval-Augmented Generation (RAG) system for Turkish news aggregation, analysis, and intelligent question-answering. This system collects news from 200+ RSS sources, processes them with AI, and provides an intelligent interface for news search and fact-checking.

🌟 Features

📊 Multi-Source Data Collection: 200+ curated Turkish RSS feeds across 9 categories
🤖 Intelligent Q&A: Ask questions about Turkish news in natural language
📋 Daily Summaries: Auto-generated summaries of important daily news
🔍 Smart Search: Vector-based semantic search across all collected news
✅ Fact Checking: Integrated fact-checking tools with risk assessment
🌐 Web Interface: Beautiful Streamlit-based user interface
⚡ Real-time Processing: Continuous news collection and processing

📦 Project Structure

/news_rag/
├── main.py              # Main pipeline orchestrator
├── config.py            # Configuration and RSS feed definitions
├── rss_fetcher.py       # RSS feed collection
├── newsapi_fetcher.py   # NewsAPI integration
├── scraper.py           # Full content web scraping
├── cleaner.py           # Content cleaning and preprocessing
├── chunker.py           # Text chunking for embeddings
├── embedder.py          # Text embedding generation
├── db.py                # ChromaDB vector database
├── rag_engine.py        # RAG question-answering engine
├── validator.py         # Fact-checking and validation
├── ui.py                # Streamlit web interface
├── requirements.txt     # Python dependencies
└── README.md           # This file

🚀 Quick Start

1. Clone and Setup

git clone <your-repo>
cd AutoScraperNewsRss

# Create virtual environment
python -m venv .venv
source .venv/bin/activate  # On macOS/Linux
# .venv\Scripts\activate   # On Windows

# Install dependencies
pip install -r requirements.txt

2. Environment Configuration

Create a .env file in the project root:

# OpenAI API Key (for GPT responses)
OPENAI_API_KEY=your_openai_api_key_here

# NewsAPI Key (optional, for additional news sources)
NEWSAPI_KEY=your_newsapi_key_here

3. Run the Data Collection Pipeline

# Test mode (limited data for testing)
python main.py --test-mode

# Full pipeline (all categories)
python main.py

# Specific categories only
python main.py --categories Teknoloji Ekonomi_Finans

# Without content scraping (faster)
python main.py --no-scraping

# Custom limits
python main.py --max-articles 10 --categories Bilim

4. Launch the Web Interface

streamlit run ui.py

Open your browser to http://localhost:8501

📊 Data Sources

RSS Feed Categories (200+ sources):

🔬 Bilim (Science): 21 sources - Evrim Ağacı, Bilim Günlüğü, Popular Science TR, etc.
💻 Teknoloji (Technology): 20 sources - Teknoblog, Webrazzi, Shiftdelete, etc.
🎮 Eğlence (Entertainment): 18 sources - Webtekno, Geekyapar, ListeList, etc.
🤔 Felsefe (Philosophy): 10 sources - Çekiçle Felsefe, Manifold, Terrabayt, etc.
⚽ Spor (Sports): 18 sources - NTV Spor, A Spor, Fotomaç, etc.
📰 Gündem (Current Affairs): 45 sources - BBC Türkçe, Hürriyet, CNN Türk, etc.
💰 Ekonomi ve Finans (Economy): 15 sources - Investing.com TR, Dünya, etc.
🏢 İş Dünyası (Business): 9 sources - İşin Detayı categories
🎨 Yaşam ve Kültür (Lifestyle): 15 sources - Culture, health, lifestyle content

Additional Sources:

NewsAPI: Dynamic Turkish news with language=tr filter
Web Scraping: Full article content extraction when RSS is incomplete

🛠️ Technical Architecture

Core Components:

Data Collection Layer
- RSS feed parsing with feedparser
- NewsAPI integration
- Robust web scraping with fallback strategies
Content Processing Pipeline
- HTML cleaning and noise removal
- Turkish-specific text preprocessing
- Smart text chunking (300-500 tokens)
AI/ML Layer
- Multilingual sentence embeddings
- ChromaDB vector storage
- OpenAI GPT integration for responses
RAG Engine
- Semantic similarity search
- Context-aware response generation
- Source attribution and confidence scoring
Fact-Checking Module
- Risk pattern detection
- Google Fact Check Explorer integration
- Turkish fact-checking resources

💡 Usage Examples

Question Answering

from rag_engine import RAGEngine

rag = RAGEngine()

# Ask about current events
result = rag.answer_question("Bugün dolar ne kadar?")
print(result['answer'])

# Category-specific queries
result = rag.answer_question(
    "Yapay zeka alanında son gelişmeler neler?", 
    category="Teknoloji"
)

Daily Summary Generation

# Generate summary of last 24 hours
summary = rag.generate_daily_summary(hours=24)
print(summary)

Fact Checking

from validator import FactChecker

checker = FactChecker()
result = checker.fact_check_article(article_data)
print(result['overall_recommendation'])

🔧 Configuration

Key Settings in `config.py`:

# Embedding Configuration
EMBEDDING_MODEL = "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2"
CHUNK_SIZE = 400
MAX_TOKENS_PER_CHUNK = 500

# LLM Configuration  
DEFAULT_LLM_MODEL = "gpt-3.5-turbo"
MAX_TOKENS = 1000

# Database
CHROMA_DB_PATH = "./chroma_db"

📱 Web Interface Features

Main Tabs:

🔍 Soru Sor (Ask Questions)
- Natural language question input
- Category and source filtering
- Confidence scoring and source attribution
📋 Günlük Özet (Daily Summary)
- Auto-generated news summaries
- Time range selection (6h, 12h, 24h, 48h)
- Downloadable markdown reports
🔎 Haber Ara (News Search)
- Semantic search across all articles
- Relevance scoring
- Direct links to original articles
✅ Doğrulama (Fact Checking)
- Text and URL validation
- Risk assessment indicators
- Fact-checking resource recommendations

🚦 Pipeline Execution

Command Line Options:

# Basic execution
python main.py

# Advanced options
python main.py \
  --categories Teknoloji Bilim \
  --max-articles 50 \
  --no-scraping \
  --test-mode

Pipeline Stages:

RSS Collection: Fetch from all configured feeds
NewsAPI Integration: Additional dynamic content
Content Scraping: Full article extraction
Content Cleaning: Remove noise, normalize text
Text Chunking: Split into optimal sizes
Embedding Generation: Create vector representations
Database Indexing: Store in ChromaDB

🔍 Monitoring and Logging

Comprehensive logging to news_pipeline.log
Pipeline execution reports with statistics
Database statistics and health checks
Error tracking and recovery

⚠️ Important Notes

Rate Limiting:

Built-in delays between requests
Respectful scraping practices
Server-friendly request patterns

Error Handling:

Graceful degradation on source failures
Retry mechanisms for transient errors
Fallback content extraction methods

Performance:

Batch processing for efficiency
Caching and deduplication
Incremental updates support

🛡️ Fact-Checking Integration

Supported Resources:

Google Fact Check Explorer
Teyit.org (Turkish fact-checking)
DoğruHaber (News verification)
AFP Fact Check (International)

Risk Assessment:

Linguistic pattern analysis
Source credibility evaluation
Misinformation indicator detection

📋 Requirements

Minimum System Requirements:

Python 3.8+
4GB RAM (8GB recommended)
2GB free disk space
Internet connection

API Keys (Optional but Recommended):

OpenAI API: For advanced language model responses
NewsAPI: For additional news source coverage

🐛 Troubleshooting

Common Issues:

ChromaDB Installation:
```
pip install --upgrade chromadb
```
SSL Certificate Errors:
```
pip install --upgrade certifi
```
Memory Issues: Reduce max_articles or use --no-scraping
Rate Limiting: Increase delays in scraper configuration

🤝 Contributing

Fork the repository
Create feature branch (git checkout -b feature/amazing-feature)
Commit changes (git commit -m 'Add amazing feature')
Push to branch (git push origin feature/amazing-feature)
Open Pull Request

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

📞 Support

For issues, questions, or contributions:

Create an issue in the repository
Check the troubleshooting section
Review the logs in news_pipeline.log

🚀 Future Enhancements

Built with ❤️ for the Turkish news ecosystem

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.env.example		.env.example
.gitignore		.gitignore
AI_RESEARCH_SUCCESS.md		AI_RESEARCH_SUCCESS.md
AUTOMATIC_FACT_CHECK_SUMMARY.md		AUTOMATIC_FACT_CHECK_SUMMARY.md
COMPLETE_SYSTEM_OVERVIEW.py		COMPLETE_SYSTEM_OVERVIEW.py
ENHANCEMENT_SUMMARY.md		ENHANCEMENT_SUMMARY.md
FINAL_STATUS_REPORT.md		FINAL_STATUS_REPORT.md
NEWSAPI_SOLUTION.md		NEWSAPI_SOLUTION.md
chunker.py		chunker.py
cleaner.py		cleaner.py
cli.py		cli.py
config.py		config.py
db.py		db.py
embedder.py		embedder.py
enhanced_ui.py		enhanced_ui.py
financial_data_fetcher.py		financial_data_fetcher.py
main.py		main.py
newsapi_fetcher.py		newsapi_fetcher.py
rag_engine.py		rag_engine.py
readme.md		readme.md
requirements.txt		requirements.txt
research_engine.py		research_engine.py
rss_fetcher.py		rss_fetcher.py
scraper.py		scraper.py
tavily_search_engine.py		tavily_search_engine.py
test_ai_research.py		test_ai_research.py
test_automatic_fact_check.py		test_automatic_fact_check.py
test_current_system.py		test_current_system.py
test_enhanced_details.py		test_enhanced_details.py
test_final_integration.py		test_final_integration.py
test_financial_integration.py		test_financial_integration.py
test_newsapi_diagnosis.py		test_newsapi_diagnosis.py
test_system.py		test_system.py
turkish_news_scraper.py		turkish_news_scraper.py
ui.py		ui.py
validator.py		validator.py

YsK-dev/AutoScraperNewsRss

Folders and files

Latest commit

History

Repository files navigation