A powerful, standalone Python tool for extracting all URLs and file extensions from a target website without relying on external tools like gau or urlfinder.
Author: ArkhAngelLifeJiggy
-
π― Comprehensive URL Discovery: Extracts URLs from multiple sources:
- Wayback Machine (historical URLs)
- Common Crawl (archived URLs)
- Live website crawling
- JavaScript file analysis
-
π Advanced URL Categorization:
- Known URLs: Publicly available URLs from archives
- Hidden URLs: URLs found in JavaScript files and comments
- Internal URLs: URLs within the same domain
- External URLs: URLs pointing to other domains
-
π Categorized File Extension Analysis:
- JavaScript:
.js,.jsx,.ts,.coffee,.vue - HTML:
.html,.htm,.xhtml,.xml - CSS:
.css,.scss,.sass,.less - Images:
.png,.jpg,.jpeg,.gif,.webp,.svg,.ico - Documents:
.pdf,.doc,.docx,.txt,.md,.rtf - Archives:
.zip,.rar,.tar,.gz,.7z,.bz2 - Media:
.mp4,.mp3,.avi,.mov,.wmv,.flv - Other: All remaining extensions
- JavaScript:
-
π‘οΈ Advanced WAF Bypass & Anti-Detection:
- User agent rotation (5+ realistic browsers)
- IP spoofing with X-Forwarded-For headers
- Smart delaying mechanisms with randomization
- Custom proxy support (HTTP/HTTPS)
- Cache control and connection header randomization
- Request retry mechanism with exponential backoff
-
β Advanced Validation & Filtering:
- False positive detection and removal (
data:,javascript:,mailto:) - MD5-based duplicate prevention with hash collision detection
- URL normalization and validation with regex patterns
- Domain-based filtering with subdomain support
- Extension-based filtering (include/exclude specific file types)
- Pattern matching with regex include/exclude rules
- False positive detection and removal (
-
π Comprehensive Logging System:
- Automatic timestamped log file generation
- INFO, WARNING, ERROR level logging with context
- Detailed operation tracking with performance metrics
- Custom log file path support
- Structured logging for debugging and analysis
-
π¨ Beautiful Interactive Interface:
- Colorful terminal output with 8-color support
- Real-time progress indicators and status updates
- Professional ASCII banner with author credits
- Categorized results display with statistics
- Quiet mode and stats-only output options
-
π Multiple Output Formats:
- JSON: Structured data with categorized results and metadata
- CSV: Spreadsheet-compatible format with URL, source, type, extension columns
- XML: Machine-readable format with proper schema
- Statistics Only: Summary output for automation
- Custom Formatting: Flexible output options
-
βοΈ Advanced Configuration Options (20+ parameters):
- Crawling:
max-pages,max-depth,concurrency,timeout - Security:
waf-bypass,user-agent,proxy,delay,retries - Filtering:
exclude-extensions,include-only,exclude-pattern - Output:
csv,xml,quiet,stats-only,save-html - Debugging:
verbose,log-file,no-color
- Crawling:
-
β‘ Pure Python Implementation: No external dependencies on Go tools
-
π§ Configurable Parameters: Fine-grained control over all aspects
-
π¦ PyPI Package: Ready for
pip install url-extractor-pro
- Install Python dependencies:
pip install -r requirements.txtpython url_extractor.py https://example.compython url_extractor.py https://example.com \
--output results.json \
--verbose \
--waf-bypass \
--delay 2.0 \
--max-pages 200 \
--max-depth 4 \
--threads 20| Option | Description | Default |
|---|---|---|
target_url |
Target URL to extract URLs from (required) | - |
-o, --output |
Output file for results (JSON/CSV/XML) | - |
-v, --verbose |
Enable verbose output | False |
-p, --max-pages |
Maximum pages to crawl | 100 |
-d, --max-depth |
Maximum crawl depth | 3 |
-t, --threads |
Number of threads for concurrent processing | 10 |
--delay |
Delay between requests in seconds | 1.0 |
--waf-bypass |
Enable WAF bypass techniques | False |
--user-agent |
Custom user agent string | Random |
--proxy |
HTTP proxy (http://proxy:port) | - |
--timeout |
Request timeout in seconds | 30 |
--max-js-files |
Maximum JS files to analyze | 20 |
--concurrency |
Number of concurrent requests | 5 |
--retries |
Number of retries for failed requests | 3 |
--exclude-extensions |
Comma-separated extensions to exclude | - |
--include-only |
Include only URLs containing this string | - |
--exclude-pattern |
Exclude URLs matching regex pattern | - |
--save-html |
Save HTML responses for analysis | False |
--quiet |
Suppress all output except results | False |
--stats-only |
Show only statistics, no URLs | False |
--csv |
Output in CSV format | False |
--xml |
Output in XML format | False |
--log-file |
Custom log file path | Auto-generated |
--no-color |
Disable colored output | False |
The tool provides a comprehensive summary including:
- Total number of unique URLs found
- URLs categorized by type (known, hidden, internal, external)
- File extensions organized by category
- Top extensions with counts
- Optional JSON export with all URLs and metadata
{
"target_url": "https://example.com",
"domain": "example.com",
"timestamp": "2025-09-17T13:00:00.000000",
"total_urls": 150,
"categorized_urls": {
"known": ["https://example.com/page1", ...],
"hidden": ["https://example.com/api/secret", ...],
"internal": ["https://example.com/about", ...],
"external": ["https://cdn.example.com/script.js", ...]
},
"categorized_extensions": {
"javascript": {"js": 25, "jsx": 5},
"html": {"html": 10, "xml": 2},
"css": {"css": 15, "scss": 3},
"images": {"png": 45, "jpg": 30},
"documents": {"pdf": 5, "txt": 8},
"archives": {"zip": 2, "gz": 1},
"media": {"mp4": 3, "mp3": 7},
"other": {"json": 12, "xml": 8}
},
"all_urls": ["https://example.com/", ...],
"statistics": {
"total_known": 120,
"total_hidden": 15,
"total_internal": 135,
"total_external": 15,
"total_extensions": 89
}
}- Wayback Machine: Queries the Internet Archive for historical URLs
- Common Crawl: Fetches URLs from the Common Crawl index
- Website Crawling: Performs breadth-first crawling of the target site
- JavaScript Analysis: Parses JS files to extract embedded URLs
- URL Normalization: Converts relative URLs to absolute URLs
- Validation & Filtering: Removes false positives and duplicates
- Categorization: Classifies URLs by type and source
- Extension Analysis: Groups file extensions by category
python url_extractor.py https://bugcrowd.com --waf-bypass --delay 1.5 -vpython url_extractor.py https://example.com \
-v -p 500 -d 5 \
-o results.json \
--log-file custom_scan.logpython url_extractor.py https://example.com \
-p 50 -d 2 -t 5 \
--no-colorpython url_extractor.py https://target.com \
--waf-bypass \
--delay 3.0 \
--max-pages 1000 \
--max-depth 5 \
--threads 30 \
--verbose \
--output security_scan.json# Filter specific file types and use multiple output formats
python url_extractor.py https://site.com \
--exclude-extensions pdf,doc,zip,rar \
--include-only api \
--csv --xml --json \
--proxy http://127.0.0.1:8080 \
--user-agent "Custom Security Scanner v1.0"
# Quiet mode for automation with custom logging
python url_extractor.py https://target.com \
--quiet \
--stats-only \
--log-file /var/log/url_scan.log \
--exclude-pattern "\.(jpg|png|gif)$" \
--concurrency 10 \
--timeout 60pip install url-extractor-pro
url-extractor-pro https://example.com --waf-bypassrequests>=2.25.0: HTTP client for web requestsbeautifulsoup4>=4.9.0: HTML parsinglxml>=4.6.0: XML/HTML parser (optional, for better performance)
The tool automatically creates detailed log files with timestamps:
2025-09-17 14:07:08,349 - INFO - Starting URL Extractor v5.0 by ArkhAngelLifeJiggy
2025-09-17 14:07:08,350 - INFO - Target: https://bugcrowd.com
2025-09-17 14:08:32,058 - INFO - Wayback Machine: 201482 valid URLs found
2025-09-17 14:08:32,974 - INFO - Common Crawl: 0 valid URLs found
2025-09-17 14:08:35,831 - INFO - Live Crawling: 140 valid URLs found
- User Agent Rotation: Cycles through 5+ realistic browser signatures
- IP Spoofing: Randomized X-Forwarded-For and X-Real-IP headers
- Smart Delaying: Configurable delays with randomization to avoid detection
- Header Randomization: Varies cache-control, connection, and other headers
- Request Pattern Variation: Mimics human browsing behavior
-
False Positive Detection: Automatically filters out:
data:URLsjavascript:URLsmailto:linkstel:links- Fragment-only URLs (#)
- Chrome/Safari internal URLs
-
Duplicate Prevention: MD5 hash-based deduplication ensures no repeated URLs
-
URL Validation: Ensures all URLs have valid schemes and netloc
-
Domain Filtering: Respects same-domain boundaries for internal/external classification
This project is open source and available under the MIT License.
Contributions are welcome! Please feel free to submit issues and pull requests.
This tool is for educational and research purposes only. Always respect website terms of service and robots.txt files when crawling websites. Use responsibly and ethically.