The HTML scraper passes allow_fragments = False to urllib.parse.urljoin (via wpull.scraper.util.urljoin_safe and wpull.url.urljoin). This causes urllib.parse to not treat # as a special character, instead including it in the path or query. As a result, on a <a href="#baz">, it treats #baz as a relative path, which then produces the wrong URL:
>>> wpull.scraper.util.urljoin_safe('https://example.org/foo/bar.html', '#baz', allow_fragments = False)
'https://example.org/foo/#baz'
>>> wpull.scraper.util.urljoin_safe('https://example.org/foo/bar.html?quuz', '#baz', allow_fragments = False)
'https://example.org/foo/#baz'
This is relatively minor. The only potential impact is that it might add a URL to the queue that wouldn't otherwise be discovered (effectively a very weak form of the parent directories part of #78).
The intent behind setting that flag may have been to handle arbitrary schemes that might not have fragments. But fragments have been in the general URL spec (RFC 3986) for over 20 years now, and # was already mentioned as an unsafe character for the same reason in RFC 1738 from 1994. It seems safe to always accept fragments.
The HTML scraper passes
allow_fragments = Falsetourllib.parse.urljoin(viawpull.scraper.util.urljoin_safeandwpull.url.urljoin). This causesurllib.parseto not treat#as a special character, instead including it in the path or query. As a result, on a<a href="#baz">, it treats#bazas a relative path, which then produces the wrong URL:This is relatively minor. The only potential impact is that it might add a URL to the queue that wouldn't otherwise be discovered (effectively a very weak form of the parent directories part of #78).
The intent behind setting that flag may have been to handle arbitrary schemes that might not have fragments. But fragments have been in the general URL spec (RFC 3986) for over 20 years now, and
#was already mentioned as an unsafe character for the same reason in RFC 1738 from 1994. It seems safe to always accept fragments.