Links with only a fragment are handled incorrectly

The HTML scraper passes `allow_fragments = False` to `urllib.parse.urljoin` (via `wpull.scraper.util.urljoin_safe` and `wpull.url.urljoin`). This causes `urllib.parse` to not treat `#` as a special character, instead including it in the path or query. As a result, on a `<a href="#baz">`, it treats `#baz` as a relative path, which then produces the wrong URL:

```
>>> wpull.scraper.util.urljoin_safe('https://example.org/foo/bar.html', '#baz', allow_fragments = False)
'https://example.org/foo/#baz'
>>> wpull.scraper.util.urljoin_safe('https://example.org/foo/bar.html?quuz', '#baz', allow_fragments = False)
'https://example.org/foo/#baz'
```

This is relatively minor. The only potential impact is that it might add a URL to the queue that wouldn't otherwise be discovered (effectively a very weak form of the parent directories part of #78).

The intent behind setting that flag may have been to handle arbitrary schemes that might not have fragments. But fragments have been in the general URL spec (RFC 3986) for over 20 years now, and `#` was already mentioned as an unsafe character for the same reason in RFC 1738 from 1994. It seems safe to always accept fragments.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Links with only a fragment are handled incorrectly #501

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Links with only a fragment are handled incorrectly #501

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions