Tools for tracking and studying LLM web search behavior by:
- running automated web-search experiments (ChatGPT / Claude),
- extracting queries + visited URLs from session logs (HAR/SSE),
- scraping search engine result pages (SERPs),
- matching/evaluating accessed URLs vs. SERP rankings,
- generating stats + visualizations.
-
LLM Search Automation (
LLM_search_automation/)- 2.1. Pipeline Overview
- 2.2. Requirements
- 2.3. Data Preparation (
data/) - 2.4. GPT Automation (
GPT/) - 2.5. Claude Automation (
Claude/) - 2.6. Outputs & Conventions
- 2.7. Troubleshooting & Tips
-
- 4.1. Organize Results Directory
- 4.2. Aggregate to JSON (
a.py) - 4.3. Run Notebooks
LLM_search_automation/ # ChatGPT/Claude web-search automation (HAR + SSE capture)
src/ # SERP scraping, matching, evaluation, datasets, results
artifacts/ # Scripts for parsing data from previous runs, plus links to past run data
stats.ipynb # Analysis notebook (uses aggregated JSON)
accessed_urls_stats.ipynb # Analysis notebook (uses aggregated JSON)
- Prompt: what you send to the LLM.
- Query: what the LLM (or you) submits to a search engine.
You are strongly encouraged to use a VPN/proxy when scraping SERPs to reduce the chance of IP blocking.
Run LLM web-search experiments and capture:
- HAR network logs
- SSE (streamed deltas) where available
- answers (or reconstructed answers for Claude)
Supported:
- ChatGPT (chatgpt.com)
- Claude (claude.ai)
- Prepare ORCAS-I per-label CSVs
- ChatGPT: query with web search; capture HAR + SSE + answers
- Claude: query (web on by default); capture HAR + SSE; reconstruct answers from SSE
Notes:
- Claude deletion: GUI Select all → Delete all chats (no script)
- Claude answers: reconstructed post-run from SSE because UI save isn’t reliable
- Node.js ≥ 18
- Python ≥ 3.9
- Chrome/Chromium
Install deps:
npm install puppeteer-extra puppeteer-extra-plugin-stealth puppeteer-har csv-parse dotenv
pip install pandas- Input:
ORCAS-I-gold.tsv(must includelabel_manual) - Script:
create_csvs.py→ shuffles + splits intodata/by_label_csvs/
cd LLM_search_automation/data
python create_csvs.py
cd ..Organize dataset CSVs for a run:
mkdir -p dataset
cp data/by_label_csvs/*.csv dataset/Create LLM_search_automation/GPT/.env:
OPENAI_EMAIL=your_email@example.com
OPENAI_PASSWORD=your_passwordRun:
cd LLM_search_automation/GPT
node sign_in.jsSession persists in ./puppeteer-profile (complete 2FA if prompted).
const modelName = "gpt-5";
const datasetRunNum = "1";
const csvJobs = [
{ file: `./dataset/ORCAS-I-gold_label_Abstain.csv`, outDir: `abstain_${modelName}_${datasetRunNum}` },
{ file: `./dataset/ORCAS-I-gold_label_Factual.csv`, outDir: `factual_${modelName}_${datasetRunNum}` }
];node index.jsWhat it does:
- enables Web search in UI (“+ → More → Web search”)
- sends each CSV query as a prompt
- saves per-prompt HAR (SSE injected) + response
- writes
prompts.jsonl,prompts.txt,source_meta.txt
node delete_chats.jscd LLM_search_automation/Claude
node sign_in.jsSession persists in ./puppeteer-profile-claude.
const modelName = "opus-4.1";
const datasetRunNum = "1";
const csvJobs = [
{ file: `./dataset/ORCAS-I-gold_label_Factual.csv`, outDir: `factual_${modelName}_${datasetRunNum}` }
];node index.jsCaptures per-prompt HAR (SSE injected) + visible answer (may be [no answer captured] pre-reconstruction).
python reconstruct_answers.pyParses Claude HAR SSE deltas, concatenates final text, and replaces [no answer captured] in response files.
For each (category, model, run):
<category>_<model>_<run>/
<category>_hars_<model>_<run>/ # per-prompt HAR (SSE injected)
<category>_responses_<model>_<run>/ # response-*.txt
prompts.jsonl
prompts.txt
source_meta.txt
- Selectors change: update selectors/text lookups if DOM shifts.
- Timeouts/flakiness: increase timeouts, add sleeps, or run smaller CSVs.
- Concurrency: prefer serial prompts with one visible browser.
- Claude deletion: GUI Select all → Delete all chats.
src/ contains:
- SERP scrapers (Bing/Google/Brave)
- tooling to process HAR files
- matching/evaluation code (accessed URLs vs SERP ranks)
datasets/andresults/folders
If you don’t want to run automation, download pre-generated HAR files from the Google Drive link referenced in:
src/dataset/README.md
After downloading:
- extract HAR files →
src/datasets/ - extract result folders →
src/results/organized by category (abstain/,factual/, etc.)
main.py:
- parses HAR files from LLM web-search sessions
- extracts search queries
- scrapes SERP results for those queries
- evaluates accessed URLs against SERP results
- writes CSV outputs + evaluation reports
python3 main.py --har-files dellsupport.har beetjuice.har -s google bing -m 500 -i 50 -o results--har-files(required): list of.harfiles to process-s, --search-engines: default['bing','google']— choose frombing,google,brave,ddg-m, --max-se-index: max SERP rank index to scrape up to (default: 250)-i, --index-interval: batch size for scraping (recommend ≤ 50) (default: 50)-o, --output-dir: output directory (default:outputs)-l, --logs-print: enable detailed logging (default: False)
Example:
outputs/harname_20260121_120000/
harname_idx_engine_query.csv
urls_to_eval_20260121_120000.txt
evaluation_results_20260121_120000.txt
query_meta.json
Available scrapers:
bing_scraper.py— WebScrapingAPI (WSA_API_KEY) or Oxylabs (OXY_USERNAME/OXY_PASSWORD)google_scraper.py— serper.dev (API_KEY)brave_scraper.py— WebScrapingAPI or Brave API (BRAVE_API_KEY)
Create a .env file in the repo root:
# Google (serper.dev) — REQUIRED for --search-engines google
API_KEY=your_serper_api_key_here
# Bing & Brave (WebScrapingAPI) — OPTIONAL, used if OXY credentials not set
WSA_API_KEY=your_webscrapingapi_key_here
# Bing (Oxylabs Proxy) — OPTIONAL alternative to WebScrapingAPI
OXY_USERNAME=your_oxylabs_username
OXY_PASSWORD=your_oxylabs_password
# Brave (Brave Official API) — OPTIONAL, used as fallback
BRAVE_API_KEY=your_brave_api_key_hereProviders:
- serper.dev: https://serper.dev/
- WebScrapingAPI: https://www.webscrapingapi.com/
- Oxylabs: https://oxylabs.io/
- Brave API: https://api.search.brave.com/
Stats generation is:
- organize
main.pyoutputs intosrc/results/<category>/... - run
a.pyto createaggregated_data.json - analyze in notebooks
Place main.py outputs under category folders:
src/results/
abstain/
network-logs-prompt-1_20260121_120000/
abstain_1_bing_query.csv
abstain_1_google_query.csv
abstain_1_brave_query.csv
urls_to_eval_20260121_120000.txt
evaluation_results_20260121_120000.txt
query_meta.json
factual/
network-logs-prompt-*_*/
...
*.csv— SERP results from scrapersurls_to_eval_*.txt— deduped URLs accessed during sessionevaluation_results_*.txt— evaluation reportquery_meta.json(or*_meta.json) — cited URLs + search strings
Run:
cd src/results
python a.pyCreates:
src/results/aggregated_data.json
{
"abstain": {
"network-logs-prompt-1_20260121_120000": {
"urls_from_prompt": ["https://example.com/page1"],
"urls_cited": ["https://example.com/page1"],
"search_string": ["query one", "query two"],
"bing_urls": [
{ "url": "https://example.com/page1", "page_title": "Page Title", "rank": 1, "search_string_num": 1 }
],
"google_urls": [],
"brave_urls": []
}
}
}In stats.ipynb and accessed_urls_stats.ipynb:
import json
with open('aggregated_data.json', 'r', encoding='utf-8') as f:
data = json.load(f)
category = "abstain"
prompt_id = "network-logs-prompt-1_20260121_120000"
entry = data[category][prompt_id]
urls_cited = entry["urls_cited"]
search_queries = entry["search_string"]
bing_results = entry["bing_urls"]
google_results = entry["google_urls"]
brave_results = entry["brave_urls"]Example analyses:
- accessed URLs vs SERP coverage (by rank)
- citation patterns by category
- search engine performance comparisons
- accessed vs cited URL overlap
The /artifacts directory contains helper scripts for parsing data generated by previous runs of the system, along with Google Drive links for downloading the corresponding datasets from those runs.
Detailed information about the available data; including download links, file formats, data breakdowns, and usage instructions; is provided in the README.md inside the artifacts directory, which you can find here.
If you use this research or the associated data/tools in your work, please cite the following paper:
https://doi.org/10.1145/3774904.3792278
BibTeX Code snippet
@inproceedings{sayyidali2026llm,
author = {Sayyid-Ali, Abdur-Rahman Ibrahim and Khan, Daanish Uddin and Bhatti, Naveed Anwar},
title = {Are LLM Web Search Engines Sustainable? A Web-Measurement Study of Real-Time Fetching},
booktitle = {Proceedings of the ACM Web Conference 2026 (WWW '26)},
year = {2026},
address = {Dubai, United Arab Emirates},
publisher = {ACM},
doi = {10.1145/3774904.3792278},
url = {https://doi.org/10.1145/3774904.3792278}
}