Skip to content

Adopt asyncio.TaskGroup for structured concurrency #1722

@vdusek

Description

@vdusek

Summary

Once we drop Python 3.10 support, we can adopt asyncio.TaskGroup (Python 3.11+) for structured concurrency. This issue tracks places in the codebase where TaskGroup would eliminate manual task creation, cancellation, and cleanup boilerplate.

High Priority

1. Autoscaled pool orchestration

  • File: src/crawlee/_autoscaling/autoscaled_pool.py (~lines 121-149)
  • Pattern: Complex manual task creation, cascading cancellation, and ~20 lines of cleanup boilerplate in run().
  • Win: TaskGroup handles cascading cancellation and cleanup automatically, eliminating most of the try/except/finally block.

2. Wait utility

  • File: src/crawlee/_utils/wait.py (lines 67-86)
  • Pattern: wait_for_all_tasks_for_finish reimplements what TaskGroup provides — waiting for tasks with manual cancellation and exception handling in finally.
  • Win: Much of this function's logic becomes unnecessary.

3. Playwright infinite scroll helper

  • File: src/crawlee/crawlers/_playwright/_utils.py (lines 68-78)
  • Pattern: create_task + try/finally with manual cancel and suppress(CancelledError).
  • Win: Textbook TaskGroup context manager pattern — automatic cleanup on exit.

4. Recurring task stop

  • File: src/crawlee/_utils/recurring_task.py (line 73)
  • Pattern: Manual cancel() + gather(..., return_exceptions=True) in stop().
  • Win: TaskGroup context manager exit handles this automatically.

Medium Priority

5. Event manager listener tasks

  • File: src/crawlee/events/_event_manager.py (lines 182-202, 257-266)
  • Pattern: Individual listener task creation/cleanup in listener_wrapper, and gathering all listener tasks in close with exception logging.
  • Win: Structured concurrency simplifies both spots.

6. Error snapshotter

  • File: src/crawlee/statistics/_error_snapshotter.py (lines 50-59)
  • Pattern: Conditional task building with a mutable list, then gather.
  • Win: Cleaner conditional task creation inside a TaskGroup context.

7. Sitemap request loader lifecycle

  • File: src/crawlee/request_loaders/_sitemap_request_loader.py (lines 159, 348-359)
  • Pattern: Background task creation + manual abort with cancel and suppress(CancelledError).
  • Win: Task lifecycle tied to a TaskGroup context.

8. Request queue batch processing

  • File: src/crawlee/storages/_request_queue.py (lines 201-207)
  • Pattern: Background batch task with a done-callback for self-removal from a list.
  • Win: TaskGroup manages the lifecycle more cleanly, eliminating the callback.

Low Priority

9. Robots.txt handling (3 crawlers)

  • Files:
    • src/crawlee/crawlers/_basic/_basic_crawler.py (lines 826-829)
    • src/crawlee/crawlers/_abstract_http/_abstract_http_crawler.py (lines 234-237)
    • src/crawlee/crawlers/_playwright/_playwright_crawler.py (lines 440-443)
  • Pattern: [create_task(...) for request in skipped] + gather(*), repeated identically 3 times.
  • Win: Minor simplification with TaskGroup; also a chance to deduplicate.

10. Browser pool page creation

  • File: src/crawlee/browsers/_browser_pool.py (lines 284-285)
  • Pattern: Simple gather for creating pages concurrently.
  • Win: Minor — slightly more idiomatic.

Caveat

TaskGroup propagates exceptions from child tasks as an ExceptionGroup. Several of these patterns intentionally suppress or log exceptions (e.g., return_exceptions=True in gather, event manager's exception swallowing). Those spots will need except* syntax or careful handling during migration.

Metadata

Metadata

Assignees

No one assigned

    Labels

    t-toolingIssues with this label are in the ownership of the tooling team.

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions