GitHub - day50-dev/wget-js-sites: Web crawl sites with javascript, react, vue, and typescript

Ever crawl a site and get a bunch of garbage back? Content that relies on JavaScript, pages of navigational headers, sidebars and footers... and your content just... not... there...

Crawl a site. Get the stuff you want. Every time.

wgetjs, part of the DAY50 suite of open-source tools for AI workflows, is for javascript sites. it's backwards compatible with wget

Unlike traditional wget, wgetjs uses a headless Chrome browser to render JavaScript, ensuring you capture the fully-rendered HTML that SPAs and dynamic sites generate.

Does it work for everything?

You can do all kinds of stupid things with JS. This only covers most of them.

Usage

# Basic recursive download
wgetjs -rc -np http://somesite.com/

# With parallel workers
wgetjs -rc -np -P 4 http://somesite.com/

# Limit recursion depth
wgetjs -rc -l 3 http://somesite.com/

# Only download HTML and PDF files
wgetjs -rc -A html,pdf http://somesite.com/

# Exclude images and certain directories
wgetjs -rc -R jpg,png,gif -X /admin,/private http://somesite.com/

# Only follow links within specific domains
wgetjs -rc -D example.com,cdn.example.com http://example.com/

# Accept only URLs matching a pattern
wgetjs -rc --accept-regex='/docs/.*' http://somesite.com/

# Convert downloaded HTML to markdown
wgetjs -rc --post='text/html:markitdown {} > {}.md' http://example.com/

# Convert images to webp and HTML to markdown
wgetjs -rc --post='image/*:convert {} {}.webp\ntext/html:markitdown {} > {}.md' http://example.com/

# Read post-processing rules from a file
wgetjs -rc --post=@post-rules.txt http://example.com/

Options

Download Control

Option	Description
`-rc`	Recursive download (follow links)
`-np`	No parent — don't ascend to parent directories
`-P N`	Number of parallel workers (default: 1)
`-l N`, `--level=N`	Max recursion depth (default: 5, use `0` or `inf` for infinite)
`-w SECONDS`	Wait SECONDS between retrievals
`--waitretry=SECONDS`	Wait 1..SECONDS between retries (backoff)
`--random-wait`	Wait 0.5×WAIT to 1.5×WAIT secs (randomized)

URL Filtering

Option	Description
`-A LIST`, `--accept=LIST`	Accept extensions/patterns (e.g., `html,pdf`, `*.mp3`)
`-R LIST`, `--reject=LIST`	Reject extensions/patterns (e.g., `jpg,png`, `*.[zb]p2`)
`--accept-regex=REGEX`	Accept URLs matching regex
`--reject-regex=REGEX`	Reject URLs matching regex
`--regex-type=TYPE`	Regex type: `posix` (default) or `pcre`
`-D LIST`, `--domains=LIST`	Accepted domains (requires `-H`)
`--exclude-domains=LIST`	Rejected domains
`-H`, `--span-hosts`	Follow links to other hosts
`-L`, `--relative`	Follow relative links only
`-I LIST`, `--include-dirs=LIST`	Include directories (wildcards supported)
`-X LIST`, `--exclude-dirs=LIST`	Exclude directories (wildcards supported)
`--ignore-case`	Ignore case when matching patterns
`--post=RULES`	Run post-processing commands scoped by MIME type (see below)

Other

Option	Description
`-E`, `--adjust-extension`	Append `.html` to filenames that don't already have an HTML extension
`--method=CMD`	Custom command to fetch HTML (alternative to Chrome)
`-h, --help`	Show help

Post-Processing (`--post`)

Run commands on downloaded files, scoped by MIME type. Format is mimetype:command where {} is replaced with the file path.

# Single rule
--post='text/html:markitdown {} > {}.md'

# Multiple rules (separated by \n)
--post='image/*:convert {} {}.webp\ntext/html:markitdown {} > {}.md'

# Rules from a file (one per line, # comments supported)
--post=@rules.txt

MIME patterns support * globbing — image/* matches image/png, image/jpeg, etc. Use * or */* to match everything.

Example rules file:

# Convert HTML to markdown
text/html: markitdown {} > {}.md

# Compress all images to webp
image/*: convert {} {}.webp

Custom Fetch Method

By default, wgetjs uses headless Chrome via the Chrome DevTools Protocol. If you prefer a different renderer, use --method with a command that outputs HTML to stdout:

# URL is appended by default
wgetjs --method="lightpanda --dump markdown" http://somesite.com/

# Or use {} for explicit URL placement (like --post)
wgetjs --method="curl -s {} | tidy -q -" http://somesite.com/
wgetjs --method="playwright fetch {} --html" http://somesite.com/

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
.gitignore		.gitignore
README.md		README.md
package.json		package.json
wgetjs		wgetjs

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Crawl a site. Get the stuff you want. Every time.

Usage

Options

Download Control

URL Filtering

Other

Post-Processing (`--post`)

Custom Fetch Method

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Crawl a site. Get the stuff you want. Every time.

Usage

Options

Download Control

URL Filtering

Other

Post-Processing (--post)

Custom Fetch Method

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Post-Processing (`--post`)

Packages