Skip to content

feat: add discover_valid_sitemaps utility#1777

Merged
Pijukatel merged 6 commits intoapify:masterfrom
Mantisus:discover-valid-sitemaps
Mar 6, 2026
Merged

feat: add discover_valid_sitemaps utility#1777
Pijukatel merged 6 commits intoapify:masterfrom
Mantisus:discover-valid-sitemaps

Conversation

@Mantisus
Copy link
Collaborator

@Mantisus Mantisus commented Mar 4, 2026

Description

  • Add discover_valid_sitemaps utility to search for sitemaps of websites for the provided URLs.

Issues

Testing

  • Add new unit tests

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR ports/introduces a Python discover_valid_sitemaps helper to discover sitemap URLs for a set of input URLs (robots.txt sitemaps, direct sitemap URLs, and common sitemap paths), aligning with issue #1740.

Changes:

  • Add discover_valid_sitemaps() (plus internal helpers/constants) to orchestrate sitemap discovery per-hostname and deduplicate results.
  • Extend common sitemap probing to include /sitemap_index.xml and add is_status_code_successful() for status evaluation.
  • Add unit tests covering robots.txt discovery, common-path probing, input URL detection, deduplication, and multi-domain behavior.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

File Description
src/crawlee/_utils/sitemap.py Implements sitemap discovery orchestration, common-path probing, and async generator merging.
src/crawlee/_utils/web.py Adds a helper to classify 2xx/3xx responses as “successful”.
tests/unit/_utils/test_sitemap.py Adds unit tests for the new sitemap discovery utility with mocked HTTP behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Mantisus and others added 4 commits March 4, 2026 23:54
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copy link
Collaborator

@vdusek vdusek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments

@Mantisus Mantisus requested a review from vdusek March 6, 2026 12:30
Copy link
Collaborator

@vdusek vdusek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Pijukatel Pijukatel merged commit 872447b into apify:master Mar 6, 2026
30 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add discover valid sitemaps utility (port from JS)

4 participants