feat(utils): add `discoverValidSitemaps` utility by foxt451 · Pull Request #3339 · apify/crawlee

foxt451 · 2026-01-15T13:04:52Z

Related to apify/actor-scraper#220. I'm developing generic sitemap scraper and it's going to share a big utility function (main chunk of logic) with wcc - discoverValidSitemaps. I've asked @barjin if I could factor it out and he told this util could fit into crawlee. It's mainly copied from wcc, but to keep the dependencies unchanged, it's using got-scraping to check for url existence instead of impit (I think it doesn't matter for sitemaps), and urlExists is inlined (until we don't add http client to these utils in v4 as @barjin told me). It's also turned into an async generator. Let me know if you see a better place for this util.

barjin

lgtm @foxt451 , thank you!

I'll let @janbuchar have a second look as the original author of this, but if it's a direct port from WCC, I think we can merge safely.

I have just these nits:

barjin · 2026-01-16T10:38:10Z

+    it('extracts sitemap from robots.txt', async () => {
+        nock('http://sitemap-discovery.com')
+            .get('/robots.txt')
+            .reply(200, 'Sitemap: http://sitemap-discovery.com/sitemap.xml')


Can we change this so the robots.txt-referenced sitemap is not the well-known /sitemap.xml? This example passes even if robots.txt is missing (see test below).

barjin · 2026-01-16T10:43:54Z

+
+/**
+ * Given a list of URLs, discover related sitemap files for these domains by checking the `robots.txt` file,
+ * the default `sitemap.xml` file and the URLs themselves.


nit: this doesn't mention sitemap.txt

@barjin

Related to https://github.com/apify/apify-sdk-js/issues/486. I'm [developing generic sitemap scraper](apify/actor-scraper#205) and it's going to share a big utility function (main chunk of logic) with wcc - `discoverValidSitemaps`. I've asked @barjin if I could factor it out and he told this util could fit into crawlee. It's mainly copied from wcc, but to keep the dependencies unchanged, it's using got-scraping to check for url existence instead of impit (I think it doesn't matter for sitemaps), and `urlExists` is inlined (until we don't add http client to these utils in v4 as @barjin told me). It's also turned into an async generator. Let me know if you see a better place for this util.

Rebases #3339 and #3370 on top of `v4` and adds `HttpClient` support for discoverValidSitemaps. Related to the discussion under apify/actor-scraper#214 --------- Co-authored-by: Sviatozar Petrenko <svpetrenko123@gmail.com>

foxt451 added 5 commits January 15, 2026 15:01

feat(utils): add discoverValidSitemaps utility

d5cec84

fix: remove circular deps

8c3de80

fix: remove unused imports

8c20634

fix: use reduce instead of groupBy for Nodev18

aa2161c

chore: formatting

6ef28e3

foxt451 marked this pull request as ready for review January 16, 2026 10:31

foxt451 requested a review from B4nan January 16, 2026 10:31

barjin reviewed Jan 16, 2026

View reviewed changes

chore: pr remarks

7388f72

foxt451 requested review from janbuchar and removed request for B4nan January 16, 2026 18:34

nicklamonov requested a review from nikitachapovskii-dev January 19, 2026 08:21

nicklamonov added the t-tooling Issues with this label are in the ownership of the tooling team. label Jan 19, 2026

nicklamonov assigned foxt451 Jan 19, 2026

foxt451 merged commit 29f52ed into master Jan 19, 2026
9 checks passed

foxt451 deleted the feat/discover-valid-sitemaps branch January 19, 2026 13:46

barjin mentioned this pull request Feb 6, 2026

chore: add discoverValidSitemaps utility #3392

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(utils): add `discoverValidSitemaps` utility#3339

feat(utils): add `discoverValidSitemaps` utility#3339
foxt451 merged 6 commits intomasterfrom
feat/discover-valid-sitemaps

foxt451 commented Jan 15, 2026 •

edited

Loading

Uh oh!

barjin left a comment

Uh oh!

barjin Jan 16, 2026

Uh oh!

barjin Jan 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

foxt451 commented Jan 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

barjin left a comment

Choose a reason for hiding this comment

Uh oh!

barjin Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

barjin Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

foxt451 commented Jan 15, 2026 •

edited

Loading