A modern, modular SEO analysis toolkit for Python. Run focused page-level audits or full site crawls, capture technical/content issues with clear severities, and export structured data for reporting. Built with extensibility in mind and designed for practical, actionable insights.
- On-page, Technical, and Content analyzers with unified scoring
- Full Site Audit (Ahrefs-style) with concurrency, filtering, and exports
- LLM/AI directives checklist (llms.txt / ai.txt)
- Optional Lighthouse/CrUX metrics via PageSpeed Insights API
- Duplicate detection across titles, descriptions, and visible text
- Link graph, redirect chains/loops, status distribution, and internal link suggestions
- REST API (Flask) and rich CLI with mobile-first and JS rendering options
- Overview
- Features
- Project Structure
- Quick Start
- CLI Usage
- API Usage
- Configuration
- Output & Exports
- Optional Dependencies
- Roadmap
- Contributing & License
The analyzer is split into focused subpackages: on_page, technical, content, scoring, and site_audit. Each module exposes a small, well-defined surface and can be extended independently. The CLI supports both single-page analysis and site-wide crawling with concurrency and filters. Results are JSON-first with optional CSV exports for pages, issues, and link edges.
-
On-Page Analysis
- Title/meta description presence and lengths, duplication hints
- Heading structure (H1–H6), multiple H1 detection
- Image alt, responsive patterns, basic layout red flags
- Link audit (internal/external, broken links, rel, unsafe cross-origin)
- Content stats (word count, paragraphs, lorem ipsum)
- URL structure (length, depth, case), deprecated tags, inline CSS
- Social tags (Open Graph, Twitter Cards), favicon
-
Technical SEO
- Crawlability/Indexability: doctype, charset, viewport, AMP, language, hreflang, canonical, robots meta, structured data (JSON-LD/Microdata)
- Network & Headers: HTTP version, HSTS, server signature, cache headers, CDN hints
- Performance: DOM size, gzip, TTFB, optional PSI (Lighthouse/CrUX)
- Security: HTTPS usage, mixed content, plaintext emails, meta refresh
- Site-level: redirects chain trace, custom 404, directory browsing, SPF, ads.txt
- Assets: caching headers for CSS/JS/images; minification heuristics for CSS/JS
- LLMs:
llms.txt/ai.txtdetection and checklist with recommendations
-
Content Analysis
- Keyword extraction and target keyword usage
- Readability (Flesch Reading Ease)
- Text-to-HTML ratio
- Spellcheck (optional dependency)
-
Scoring
- Category scores (On-Page, Technical, Content) and overall score
- Configurable weights and category emphasis
-
Full Site Audit
- Crawler with robots.txt respect, include/exclude filters, subdomain toggle, depth/page caps, rate limiting, and optional JS rendering for discovery
- Concurrency for per-page analysis
- Issues with severity (error/warning/notice) across HTTP, redirects, sitemap, canonical, indexing, content/meta, links, international, performance, and security
- Status distribution, redirect loops, duplicate titles/meta/visible text, internal link graph (in/out degree) and heuristic internal link suggestions
- Optional exports:
pages.csv,issues.csv,edges.csv
seo-analyzer/
├── app.py # CLI & API entrypoint
├── requirements.txt
├── modules/
│ ├── __init__.py
│ ├── base_module.py # Base with session, retries, headers
│ ├── on_page/
│ │ ├── __init__.py
│ │ ├── analyzer.py # On-page orchestrator
│ │ ├── text_utils.py
│ │ ├── title_meta.py
│ │ ├── headings_links_images.py
│ │ └── social_misc.py
│ ├── technical/
│ │ ├── __init__.py
│ │ ├── analyzer.py # Technical orchestrator
│ │ ├── network.py
│ │ ├── html_core.py
│ │ ├── metrics.py
│ │ ├── site_checks.py
│ │ ├── assets.py
│ │ ├── llms_txt.py # LLMs/AI directives checklist
│ │ └── performance_api.py # PageSpeed Insights (optional)
│ ├── content/
│ │ ├── __init__.py
│ │ ├── analyzer.py
│ │ ├── text_utils.py
│ │ ├── keywords.py
│ │ ├── readability.py
│ │ ├── ratio.py
│ │ └── spellcheck.py
│ ├── scoring/
│ │ ├── __init__.py
│ │ ├── analyzer.py
│ │ ├── weights.py
│ │ ├── util.py
│ │ ├── on_page.py
│ │ ├── technical.py
│ │ └── content.py
│ └── site_audit/
│ ├── __init__.py
│ ├── crawler.py # Discovery crawler
│ ├── render.py # Optional Playwright renderer
│ ├── audit.py # Crawl + analyze + aggregate
│ ├── issues.py # Issue model & derivation
│ ├── duplication.py # Duplicate grouping helpers
│ ├── sitemap.py # Sitemap parsing & bucketing
│ ├── export.py # CSV exporters
│ └── compare.py # Diff between audit reports
└── README.md
- Python env
- Python 3.8+
- Optional:
python -m venv venv && source venv/bin/activate
- Install
pip install -r requirements.txt- Optional dependencies:
- Playwright (JS rendering):
pip install playwright && playwright install - PSI (Lighthouse/CrUX): needs a Google API key (config below)
- Playwright (JS rendering):
- Single-Page Audit (CLI)
python app.py https://www.example.com- Saves report to
reports/seo_report_<domain>_<timestamp>.json
- Full Site Audit (CLI)
- Example (mobile UA, filters, concurrency, exports):
python app.py https://www.example.com \
--full-audit --max-pages 200 --max-depth 3 \
--respect-robots --rate-limit 1.5 --workers 6 --mobile \
--export-csv reports/example_audit \
--include-path /blog --exclude-path re:^/admin --render-js
- Output:
- JSON report at
reports/site_audit_<domain>_<timestamp>.json - If
--export-csvprovided:pages.csv,issues.csv,edges.csv
- JSON report at
- Single page:
python app.py <URL> [--keywords ...] [--config path.json] [--output json|txt]
- Full site audit:
--full-audit: enable crawl + multi-page analysis--max-pages,--max-depth,--rate-limit--include-subdomains,--same-domain-only,--respect-robots/--ignore-robots--include-path,--exclude-path(prefix orre:<pattern>; repeatable)--workers(concurrent analysis),--mobile(mobile UA),--render-js(Playwright)--auth-user,--auth-passfor basic auth--export-csv <dir>for CSVs--compare-report <file>to diff two site audit JSONs
Run without a URL to start the API:
python app.py- POST/GET
http://127.0.0.1:5000/analyze?url=https://www.example.com - Optional
keywords(CSV or JSON array) - Response mirrors the single-page JSON structure.
Config may be supplied via --config path.json or edited in app.py’s DEFAULT_CONFIG.
Example snippet:
{
"OnPageAnalyzer": {
"title_min_length": 20,
"title_max_length": 70,
"desc_min_length": 70,
"desc_max_length": 160
},
"TechnicalSEOAnalyzer": {
"enable_pagespeed_insights": true,
"psi_api_key": "YOUR_GOOGLE_API_KEY",
"psi_strategy": "mobile",
"max_inline_js_to_check_minification": 3,
"max_js_to_check_minification": 10
},
"ContentAnalyzer": {
"top_n_keywords_count": 10,
"spellcheck_language": "en"
},
"ScoringModule": {
"weights": {},
"category_weights": { "OnPage": 0.40, "Technical": 0.35, "Content": 0.25 }
},
"FullSiteAudit": {
"max_pages": 150,
"max_depth": 3,
"respect_robots": true,
"same_domain_only": true,
"include_subdomains": false,
"rate_limit_rps": 1.5,
"workers": 6,
"include_paths": ["/blog"],
"exclude_paths": ["re:^/admin"],
"render_js": true
},
"Global": {
"request_timeout": 12,
"user_agent": "Mozilla/5.0 ...",
"accept_language": "en-US,en;q=0.8",
"http_retries_total": 2,
"http_backoff_factor": 0.2,
"http_status_forcelist": [429,500,502,503,504],
"http_allowed_retry_methods": ["HEAD","GET","OPTIONS"]
}
}-
Single-page JSON (top-level):
seo_attributes.OnPageAnalyzer(title/meta, headings, links, images, content stats, URL checks)seo_attributes.TechnicalSEOAnalyzer(headers, protocol, indexability, structured data, assets, PSI if enabled, llms.txt, redirects, robots/sitemap, SPF, ads.txt)seo_attributes.ContentAnalyzer(keywords, readability, ratio, spelling)seo_attributes.ScoringModule(category and overall scores)
-
Site audit JSON:
site_audit.summary: status distribution, redirect loops, health score, duplicate groups, link graph metrics, sitemap summary, aggregate scoressite_audit.pages: list of per-URL page results (same structure as single-page attributes)site_audit.issues: flattened issues withurl,code,title,severity,category,detailssite_audit.config_used: crawl and worker config used; optionalexportswith CSV paths
-
CSVs (if
--export-csv):pages.csv: URL, scores, HTTP status, TTFB, canonical/sitemap flags, schema, word count, H1 count, links, title, meta descriptionissues.csv: URL, code, title, severity, category, detailsedges.csv: source, target, rel (internal link graph)
pyspellchecker: content spell checksdnspython: SPF lookupPillow: optional image-related utilitiesflask: API modeplaywright: optional JS rendering for discovery (--render-js)- PageSpeed Insights: requires Google API key (
enable_pagespeed_insights)
- Rendered HTML analysis (Playwright) for per-page analyzers and JS error capture
- Deeper structured data validation (rule-based, 190+ checks)
- Expanded issue catalog and weighting
- PSI/CrUX integration into health scoring/outlier detection
- GSC/GA integrations and IndexNow submissions
- Segmented crawling and richer URL detail panels
- Contributions welcome! Please open issues/PRs for features and fixes.
- MIT License. See
LICENSE.