Skip to content

KarlYu130/Meta-Threads-Scraper

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

67 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Threads Scraper

The original repo shipped without the core scraping code. I can’t wait to use it, so I built my own version here.

Current usable features

  • Scrape Threads posts by username (online or offline).
  • Offline mode with deterministic sample data for testing.
  • Online mode that tries a fast HTTP parse and falls back to Playwright scrolling.
  • Optional login flow using a persistent Playwright profile.
  • Export to JSON and CSV, plus a cleaned CSV in data/processed.
  • Configurable settings via config/settings.yaml.
  • Basic proxy support (if enabled in settings).

Note: Reply scraping is still in progress and not reliable yet.

Future plan

  • Scrape replies in parallel with smarter resource management and better session handling.

How to use

  1. Clone the repo:
    • git clone <REPO_URL>
  2. Open a terminal and go into the project folder:
    • cd Threads-Scraper
  3. Install Python dependencies:
    • python -m pip install -r requirements.txt
  4. If you want live scraping (not offline), install Playwright browsers:
    • python -m playwright install chromium
  5. Open config/settings.yaml and review the settings.
  6. (Optional but recommended) Log in once to lift public visibility limits:
    • Run: python src/main.py --login --profile-dir data/playwright-profile
    • A browser opens. Log into Threads, then return to the terminal and press Enter.
  7. Run a scrape:
    • Example: python src/main.py --usernames example_user --limit 100
    • Example (two users + logged-in profile): python src/main.py --usernames example_user another_user --limit 50 --profile-dir data/playwright-profile
    • Example (offline test data): python src/main.py --offline --usernames example_user --limit 20
  8. Find your results in output/ and data/processed.

Login notes:

  • Scraping without login is supported, but Threads often limits how many posts you can see in public mode.
  • Logging in via a persistent Playwright profile usually increases the number of posts collected.

Settings guide (config/settings.yaml)

General:

  • base_url: Threads base URL.
  • timeout: request/page timeout in seconds.
  • use_offline: true to use local sample data instead of live scraping.
  • use_proxies: true to use proxies from data/raw/proxies.json.
  • limit: max posts per user (best-effort; public mode may return fewer).
  • dump_raw_items: true to save raw payloads to data/raw for debugging.

Replies:

  • scrape_replies: enable reply scraping.
  • replies_limit: max replies per thread (best-effort).
  • replies_workers: parallel workers for replies (limited when using a persistent profile).
  • skip_zero_replies: skip threads with reply_count = 0.
  • replies_use_persistent_profile: force replies to use a single persistent profile (disables parallelism).

Online mode tuning:

  • max_scrolls / replies_max_scrolls: how many scroll cycles to attempt.
  • scroll_pause_ms / replies_scroll_pause_ms: delay after each scroll to let requests finish.
  • page_settle_ms / replies_page_settle_ms: initial wait after first load.
  • stagnant_scrolls / replies_stagnant_scrolls: stop after N scrolls with no new items.

Login/session:

  • playwright_headless: false to see the browser (needed for login).
  • playwright_user_data_dir: path to the saved browser profile for persistent login.
  • cookie: optional raw Cookie header (can also be set via THREADS_COOKIE in .env).

Defaults:

  • usernames: fallback list used when no --usernames are provided.

Contributing

Contributions are welcome.