The original repo shipped without the core scraping code. I can’t wait to use it, so I built my own version here.
- Scrape Threads posts by username (online or offline).
- Offline mode with deterministic sample data for testing.
- Online mode that tries a fast HTTP parse and falls back to Playwright scrolling.
- Optional login flow using a persistent Playwright profile.
- Export to JSON and CSV, plus a cleaned CSV in data/processed.
- Configurable settings via config/settings.yaml.
- Basic proxy support (if enabled in settings).
Note: Reply scraping is still in progress and not reliable yet.
- Scrape replies in parallel with smarter resource management and better session handling.
- Clone the repo:
- git clone <REPO_URL>
- Open a terminal and go into the project folder:
- cd Threads-Scraper
- Install Python dependencies:
- python -m pip install -r requirements.txt
- If you want live scraping (not offline), install Playwright browsers:
- python -m playwright install chromium
- Open config/settings.yaml and review the settings.
- (Optional but recommended) Log in once to lift public visibility limits:
- Run: python src/main.py --login --profile-dir data/playwright-profile
- A browser opens. Log into Threads, then return to the terminal and press Enter.
- Run a scrape:
- Example: python src/main.py --usernames example_user --limit 100
- Example (two users + logged-in profile): python src/main.py --usernames example_user another_user --limit 50 --profile-dir data/playwright-profile
- Example (offline test data): python src/main.py --offline --usernames example_user --limit 20
- Find your results in output/ and data/processed.
Login notes:
- Scraping without login is supported, but Threads often limits how many posts you can see in public mode.
- Logging in via a persistent Playwright profile usually increases the number of posts collected.
General:
- base_url: Threads base URL.
- timeout: request/page timeout in seconds.
- use_offline: true to use local sample data instead of live scraping.
- use_proxies: true to use proxies from data/raw/proxies.json.
- limit: max posts per user (best-effort; public mode may return fewer).
- dump_raw_items: true to save raw payloads to data/raw for debugging.
Replies:
- scrape_replies: enable reply scraping.
- replies_limit: max replies per thread (best-effort).
- replies_workers: parallel workers for replies (limited when using a persistent profile).
- skip_zero_replies: skip threads with reply_count = 0.
- replies_use_persistent_profile: force replies to use a single persistent profile (disables parallelism).
Online mode tuning:
- max_scrolls / replies_max_scrolls: how many scroll cycles to attempt.
- scroll_pause_ms / replies_scroll_pause_ms: delay after each scroll to let requests finish.
- page_settle_ms / replies_page_settle_ms: initial wait after first load.
- stagnant_scrolls / replies_stagnant_scrolls: stop after N scrolls with no new items.
Login/session:
- playwright_headless: false to see the browser (needed for login).
- playwright_user_data_dir: path to the saved browser profile for persistent login.
- cookie: optional raw Cookie header (can also be set via THREADS_COOKIE in .env).
Defaults:
- usernames: fallback list used when no --usernames are provided.
Contributions are welcome.