Skip to content

YsK-dev/AutoScraperNewsRss

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

📰 Turkish News RAG Assistant

A comprehensive Retrieval-Augmented Generation (RAG) system for Turkish news aggregation, analysis, and intelligent question-answering. This system collects news from 200+ RSS sources, processes them with AI, and provides an intelligent interface for news search and fact-checking.

🌟 Features

  • 📊 Multi-Source Data Collection: 200+ curated Turkish RSS feeds across 9 categories
  • 🤖 Intelligent Q&A: Ask questions about Turkish news in natural language
  • 📋 Daily Summaries: Auto-generated summaries of important daily news
  • 🔍 Smart Search: Vector-based semantic search across all collected news
  • ✅ Fact Checking: Integrated fact-checking tools with risk assessment
  • 🌐 Web Interface: Beautiful Streamlit-based user interface
  • ⚡ Real-time Processing: Continuous news collection and processing

📦 Project Structure

/news_rag/
├── main.py              # Main pipeline orchestrator
├── config.py            # Configuration and RSS feed definitions
├── rss_fetcher.py       # RSS feed collection
├── newsapi_fetcher.py   # NewsAPI integration
├── scraper.py           # Full content web scraping
├── cleaner.py           # Content cleaning and preprocessing
├── chunker.py           # Text chunking for embeddings
├── embedder.py          # Text embedding generation
├── db.py                # ChromaDB vector database
├── rag_engine.py        # RAG question-answering engine
├── validator.py         # Fact-checking and validation
├── ui.py                # Streamlit web interface
├── requirements.txt     # Python dependencies
└── README.md           # This file

🚀 Quick Start

1. Clone and Setup

git clone <your-repo>
cd AutoScraperNewsRss

# Create virtual environment
python -m venv .venv
source .venv/bin/activate  # On macOS/Linux
# .venv\Scripts\activate   # On Windows

# Install dependencies
pip install -r requirements.txt

2. Environment Configuration

Create a .env file in the project root:

# OpenAI API Key (for GPT responses)
OPENAI_API_KEY=your_openai_api_key_here

# NewsAPI Key (optional, for additional news sources)
NEWSAPI_KEY=your_newsapi_key_here

3. Run the Data Collection Pipeline

# Test mode (limited data for testing)
python main.py --test-mode

# Full pipeline (all categories)
python main.py

# Specific categories only
python main.py --categories Teknoloji Ekonomi_Finans

# Without content scraping (faster)
python main.py --no-scraping

# Custom limits
python main.py --max-articles 10 --categories Bilim

4. Launch the Web Interface

streamlit run ui.py

Open your browser to http://localhost:8501

📊 Data Sources

RSS Feed Categories (200+ sources):

  • 🔬 Bilim (Science): 21 sources - Evrim Ağacı, Bilim Günlüğü, Popular Science TR, etc.
  • 💻 Teknoloji (Technology): 20 sources - Teknoblog, Webrazzi, Shiftdelete, etc.
  • 🎮 Eğlence (Entertainment): 18 sources - Webtekno, Geekyapar, ListeList, etc.
  • 🤔 Felsefe (Philosophy): 10 sources - Çekiçle Felsefe, Manifold, Terrabayt, etc.
  • ⚽ Spor (Sports): 18 sources - NTV Spor, A Spor, Fotomaç, etc.
  • 📰 Gündem (Current Affairs): 45 sources - BBC Türkçe, Hürriyet, CNN Türk, etc.
  • 💰 Ekonomi ve Finans (Economy): 15 sources - Investing.com TR, Dünya, etc.
  • 🏢 İş Dünyası (Business): 9 sources - İşin Detayı categories
  • 🎨 Yaşam ve Kültür (Lifestyle): 15 sources - Culture, health, lifestyle content

Additional Sources:

  • NewsAPI: Dynamic Turkish news with language=tr filter
  • Web Scraping: Full article content extraction when RSS is incomplete

🛠️ Technical Architecture

Core Components:

  1. Data Collection Layer

    • RSS feed parsing with feedparser
    • NewsAPI integration
    • Robust web scraping with fallback strategies
  2. Content Processing Pipeline

    • HTML cleaning and noise removal
    • Turkish-specific text preprocessing
    • Smart text chunking (300-500 tokens)
  3. AI/ML Layer

    • Multilingual sentence embeddings
    • ChromaDB vector storage
    • OpenAI GPT integration for responses
  4. RAG Engine

    • Semantic similarity search
    • Context-aware response generation
    • Source attribution and confidence scoring
  5. Fact-Checking Module

    • Risk pattern detection
    • Google Fact Check Explorer integration
    • Turkish fact-checking resources

💡 Usage Examples

Question Answering

from rag_engine import RAGEngine

rag = RAGEngine()

# Ask about current events
result = rag.answer_question("Bugün dolar ne kadar?")
print(result['answer'])

# Category-specific queries
result = rag.answer_question(
    "Yapay zeka alanında son gelişmeler neler?", 
    category="Teknoloji"
)

Daily Summary Generation

# Generate summary of last 24 hours
summary = rag.generate_daily_summary(hours=24)
print(summary)

Fact Checking

from validator import FactChecker

checker = FactChecker()
result = checker.fact_check_article(article_data)
print(result['overall_recommendation'])

🔧 Configuration

Key Settings in config.py:

# Embedding Configuration
EMBEDDING_MODEL = "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2"
CHUNK_SIZE = 400
MAX_TOKENS_PER_CHUNK = 500

# LLM Configuration  
DEFAULT_LLM_MODEL = "gpt-3.5-turbo"
MAX_TOKENS = 1000

# Database
CHROMA_DB_PATH = "./chroma_db"

📱 Web Interface Features

Main Tabs:

  1. 🔍 Soru Sor (Ask Questions)

    • Natural language question input
    • Category and source filtering
    • Confidence scoring and source attribution
  2. 📋 Günlük Özet (Daily Summary)

    • Auto-generated news summaries
    • Time range selection (6h, 12h, 24h, 48h)
    • Downloadable markdown reports
  3. 🔎 Haber Ara (News Search)

    • Semantic search across all articles
    • Relevance scoring
    • Direct links to original articles
  4. ✅ Doğrulama (Fact Checking)

    • Text and URL validation
    • Risk assessment indicators
    • Fact-checking resource recommendations

🚦 Pipeline Execution

Command Line Options:

# Basic execution
python main.py

# Advanced options
python main.py \
  --categories Teknoloji Bilim \
  --max-articles 50 \
  --no-scraping \
  --test-mode

Pipeline Stages:

  1. RSS Collection: Fetch from all configured feeds
  2. NewsAPI Integration: Additional dynamic content
  3. Content Scraping: Full article extraction
  4. Content Cleaning: Remove noise, normalize text
  5. Text Chunking: Split into optimal sizes
  6. Embedding Generation: Create vector representations
  7. Database Indexing: Store in ChromaDB

🔍 Monitoring and Logging

  • Comprehensive logging to news_pipeline.log
  • Pipeline execution reports with statistics
  • Database statistics and health checks
  • Error tracking and recovery

⚠️ Important Notes

Rate Limiting:

  • Built-in delays between requests
  • Respectful scraping practices
  • Server-friendly request patterns

Error Handling:

  • Graceful degradation on source failures
  • Retry mechanisms for transient errors
  • Fallback content extraction methods

Performance:

  • Batch processing for efficiency
  • Caching and deduplication
  • Incremental updates support

🛡️ Fact-Checking Integration

Supported Resources:

  • Google Fact Check Explorer
  • Teyit.org (Turkish fact-checking)
  • DoğruHaber (News verification)
  • AFP Fact Check (International)

Risk Assessment:

  • Linguistic pattern analysis
  • Source credibility evaluation
  • Misinformation indicator detection

📋 Requirements

Minimum System Requirements:

  • Python 3.8+
  • 4GB RAM (8GB recommended)
  • 2GB free disk space
  • Internet connection

API Keys (Optional but Recommended):

  • OpenAI API: For advanced language model responses
  • NewsAPI: For additional news source coverage

🐛 Troubleshooting

Common Issues:

  1. ChromaDB Installation:

    pip install --upgrade chromadb
  2. SSL Certificate Errors:

    pip install --upgrade certifi
  3. Memory Issues: Reduce max_articles or use --no-scraping

  4. Rate Limiting: Increase delays in scraper configuration

🤝 Contributing

  1. Fork the repository
  2. Create feature branch (git checkout -b feature/amazing-feature)
  3. Commit changes (git commit -m 'Add amazing feature')
  4. Push to branch (git push origin feature/amazing-feature)
  5. Open Pull Request

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

📞 Support

For issues, questions, or contributions:

  • Create an issue in the repository
  • Check the troubleshooting section
  • Review the logs in news_pipeline.log

🚀 Future Enhancements

  • Multi-language support expansion
  • Real-time news alerts
  • Social media integration
  • Advanced analytics dashboard
  • Mobile application
  • API endpoints for integration
  • Machine learning bias detection
  • Automated fact-checking workflows

Built with ❤️ for the Turkish news ecosystem

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages