A Bash script designed to download HTML content from a specified URL and parse it to extract URLs, src attributes, and href attributes. The results are organized and saved into a dedicated directory for the target site.
- HTML Downloader: Fetches the raw HTML.
- Attribute Extraction: Parses the HTML to identify and extract:
- Full URLs (
httpandhttpspatterns). - Source attributes (
src="..."). - Hyperlink references (
href="..."). - Found DNS list.
- IP Addresses (IPv4 and IPv6).
- Full URLs (
A directory is created, based on the target domain to store the downloaded HTML and extracted lists.
git clone https://github.com/andrerodrig/htmlparser.git
cd htmlparserInstall with ./install.sh:
chmod +x install.sh
./install.shchmod +x uninstall.sh
./uninstall.shhtmlparser <URL>
Example:
htmlparser https://www.google.comThis will create a directory named google in the current working directory, containing:
google.html: The downloaded HTML content.found_urls.txt: A list of all extracted URLs.found_srcs.txt: A list of all extractedsrcattributes.found_hrefs.txt: A list of all extractedhrefattributes.found_dnses.txt: A list of all found DNS entries.found_ipv4.txt: A list of all found IP addresses.found_ipv6.txt: A list of all found IP addresses.found_servers.txt: Some other servers found.