A comprehensive Retrieval-Augmented Generation (RAG) system for Turkish news aggregation, analysis, and intelligent question-answering. This system collects news from 200+ RSS sources, processes them with AI, and provides an intelligent interface for news search and fact-checking.
- 📊 Multi-Source Data Collection: 200+ curated Turkish RSS feeds across 9 categories
- 🤖 Intelligent Q&A: Ask questions about Turkish news in natural language
- 📋 Daily Summaries: Auto-generated summaries of important daily news
- 🔍 Smart Search: Vector-based semantic search across all collected news
- ✅ Fact Checking: Integrated fact-checking tools with risk assessment
- 🌐 Web Interface: Beautiful Streamlit-based user interface
- ⚡ Real-time Processing: Continuous news collection and processing
/news_rag/
├── main.py # Main pipeline orchestrator
├── config.py # Configuration and RSS feed definitions
├── rss_fetcher.py # RSS feed collection
├── newsapi_fetcher.py # NewsAPI integration
├── scraper.py # Full content web scraping
├── cleaner.py # Content cleaning and preprocessing
├── chunker.py # Text chunking for embeddings
├── embedder.py # Text embedding generation
├── db.py # ChromaDB vector database
├── rag_engine.py # RAG question-answering engine
├── validator.py # Fact-checking and validation
├── ui.py # Streamlit web interface
├── requirements.txt # Python dependencies
└── README.md # This file
git clone <your-repo>
cd AutoScraperNewsRss
# Create virtual environment
python -m venv .venv
source .venv/bin/activate # On macOS/Linux
# .venv\Scripts\activate # On Windows
# Install dependencies
pip install -r requirements.txtCreate a .env file in the project root:
# OpenAI API Key (for GPT responses)
OPENAI_API_KEY=your_openai_api_key_here
# NewsAPI Key (optional, for additional news sources)
NEWSAPI_KEY=your_newsapi_key_here# Test mode (limited data for testing)
python main.py --test-mode
# Full pipeline (all categories)
python main.py
# Specific categories only
python main.py --categories Teknoloji Ekonomi_Finans
# Without content scraping (faster)
python main.py --no-scraping
# Custom limits
python main.py --max-articles 10 --categories Bilimstreamlit run ui.pyOpen your browser to http://localhost:8501
- 🔬 Bilim (Science): 21 sources - Evrim Ağacı, Bilim Günlüğü, Popular Science TR, etc.
- 💻 Teknoloji (Technology): 20 sources - Teknoblog, Webrazzi, Shiftdelete, etc.
- 🎮 Eğlence (Entertainment): 18 sources - Webtekno, Geekyapar, ListeList, etc.
- 🤔 Felsefe (Philosophy): 10 sources - Çekiçle Felsefe, Manifold, Terrabayt, etc.
- ⚽ Spor (Sports): 18 sources - NTV Spor, A Spor, Fotomaç, etc.
- 📰 Gündem (Current Affairs): 45 sources - BBC Türkçe, Hürriyet, CNN Türk, etc.
- 💰 Ekonomi ve Finans (Economy): 15 sources - Investing.com TR, Dünya, etc.
- 🏢 İş Dünyası (Business): 9 sources - İşin Detayı categories
- 🎨 Yaşam ve Kültür (Lifestyle): 15 sources - Culture, health, lifestyle content
- NewsAPI: Dynamic Turkish news with
language=trfilter - Web Scraping: Full article content extraction when RSS is incomplete
-
Data Collection Layer
- RSS feed parsing with
feedparser - NewsAPI integration
- Robust web scraping with fallback strategies
- RSS feed parsing with
-
Content Processing Pipeline
- HTML cleaning and noise removal
- Turkish-specific text preprocessing
- Smart text chunking (300-500 tokens)
-
AI/ML Layer
- Multilingual sentence embeddings
- ChromaDB vector storage
- OpenAI GPT integration for responses
-
RAG Engine
- Semantic similarity search
- Context-aware response generation
- Source attribution and confidence scoring
-
Fact-Checking Module
- Risk pattern detection
- Google Fact Check Explorer integration
- Turkish fact-checking resources
from rag_engine import RAGEngine
rag = RAGEngine()
# Ask about current events
result = rag.answer_question("Bugün dolar ne kadar?")
print(result['answer'])
# Category-specific queries
result = rag.answer_question(
"Yapay zeka alanında son gelişmeler neler?",
category="Teknoloji"
)# Generate summary of last 24 hours
summary = rag.generate_daily_summary(hours=24)
print(summary)from validator import FactChecker
checker = FactChecker()
result = checker.fact_check_article(article_data)
print(result['overall_recommendation'])# Embedding Configuration
EMBEDDING_MODEL = "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2"
CHUNK_SIZE = 400
MAX_TOKENS_PER_CHUNK = 500
# LLM Configuration
DEFAULT_LLM_MODEL = "gpt-3.5-turbo"
MAX_TOKENS = 1000
# Database
CHROMA_DB_PATH = "./chroma_db"-
🔍 Soru Sor (Ask Questions)
- Natural language question input
- Category and source filtering
- Confidence scoring and source attribution
-
📋 Günlük Özet (Daily Summary)
- Auto-generated news summaries
- Time range selection (6h, 12h, 24h, 48h)
- Downloadable markdown reports
-
🔎 Haber Ara (News Search)
- Semantic search across all articles
- Relevance scoring
- Direct links to original articles
-
✅ Doğrulama (Fact Checking)
- Text and URL validation
- Risk assessment indicators
- Fact-checking resource recommendations
# Basic execution
python main.py
# Advanced options
python main.py \
--categories Teknoloji Bilim \
--max-articles 50 \
--no-scraping \
--test-mode- RSS Collection: Fetch from all configured feeds
- NewsAPI Integration: Additional dynamic content
- Content Scraping: Full article extraction
- Content Cleaning: Remove noise, normalize text
- Text Chunking: Split into optimal sizes
- Embedding Generation: Create vector representations
- Database Indexing: Store in ChromaDB
- Comprehensive logging to
news_pipeline.log - Pipeline execution reports with statistics
- Database statistics and health checks
- Error tracking and recovery
- Built-in delays between requests
- Respectful scraping practices
- Server-friendly request patterns
- Graceful degradation on source failures
- Retry mechanisms for transient errors
- Fallback content extraction methods
- Batch processing for efficiency
- Caching and deduplication
- Incremental updates support
- Google Fact Check Explorer
- Teyit.org (Turkish fact-checking)
- DoğruHaber (News verification)
- AFP Fact Check (International)
- Linguistic pattern analysis
- Source credibility evaluation
- Misinformation indicator detection
- Python 3.8+
- 4GB RAM (8GB recommended)
- 2GB free disk space
- Internet connection
- OpenAI API: For advanced language model responses
- NewsAPI: For additional news source coverage
-
ChromaDB Installation:
pip install --upgrade chromadb
-
SSL Certificate Errors:
pip install --upgrade certifi
-
Memory Issues: Reduce
max_articlesor use--no-scraping -
Rate Limiting: Increase delays in scraper configuration
- Fork the repository
- Create feature branch (
git checkout -b feature/amazing-feature) - Commit changes (
git commit -m 'Add amazing feature') - Push to branch (
git push origin feature/amazing-feature) - Open Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
For issues, questions, or contributions:
- Create an issue in the repository
- Check the troubleshooting section
- Review the logs in
news_pipeline.log
- Multi-language support expansion
- Real-time news alerts
- Social media integration
- Advanced analytics dashboard
- Mobile application
- API endpoints for integration
- Machine learning bias detection
- Automated fact-checking workflows
Built with ❤️ for the Turkish news ecosystem