Skip to content

Links with only a fragment are handled incorrectly #501

@JustAnotherArchivist

Description

@JustAnotherArchivist

The HTML scraper passes allow_fragments = False to urllib.parse.urljoin (via wpull.scraper.util.urljoin_safe and wpull.url.urljoin). This causes urllib.parse to not treat # as a special character, instead including it in the path or query. As a result, on a <a href="#baz">, it treats #baz as a relative path, which then produces the wrong URL:

>>> wpull.scraper.util.urljoin_safe('https://example.org/foo/bar.html', '#baz', allow_fragments = False)
'https://example.org/foo/#baz'
>>> wpull.scraper.util.urljoin_safe('https://example.org/foo/bar.html?quuz', '#baz', allow_fragments = False)
'https://example.org/foo/#baz'

This is relatively minor. The only potential impact is that it might add a URL to the queue that wouldn't otherwise be discovered (effectively a very weak form of the parent directories part of #78).

The intent behind setting that flag may have been to handle arbitrary schemes that might not have fragments. But fragments have been in the general URL spec (RFC 3986) for over 20 years now, and # was already mentioned as an unsafe character for the same reason in RFC 1738 from 1994. It seems safe to always accept fragments.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions