-
Notifications
You must be signed in to change notification settings - Fork 618
Open
Labels
t-toolingIssues with this label are in the ownership of the tooling team.Issues with this label are in the ownership of the tooling team.
Milestone
Description
Summary
Once we drop Python 3.10 support, we can adopt asyncio.TaskGroup (Python 3.11+) for structured concurrency. This issue tracks places in the codebase where TaskGroup would eliminate manual task creation, cancellation, and cleanup boilerplate.
High Priority
1. Autoscaled pool orchestration
- File:
src/crawlee/_autoscaling/autoscaled_pool.py(~lines 121-149) - Pattern: Complex manual task creation, cascading cancellation, and ~20 lines of cleanup boilerplate in
run(). - Win:
TaskGrouphandles cascading cancellation and cleanup automatically, eliminating most of thetry/except/finallyblock.
2. Wait utility
- File:
src/crawlee/_utils/wait.py(lines 67-86) - Pattern:
wait_for_all_tasks_for_finishreimplements whatTaskGroupprovides — waiting for tasks with manual cancellation and exception handling infinally. - Win: Much of this function's logic becomes unnecessary.
3. Playwright infinite scroll helper
- File:
src/crawlee/crawlers/_playwright/_utils.py(lines 68-78) - Pattern:
create_task+try/finallywith manual cancel andsuppress(CancelledError). - Win: Textbook
TaskGroupcontext manager pattern — automatic cleanup on exit.
4. Recurring task stop
- File:
src/crawlee/_utils/recurring_task.py(line 73) - Pattern: Manual
cancel()+gather(..., return_exceptions=True)instop(). - Win:
TaskGroupcontext manager exit handles this automatically.
Medium Priority
5. Event manager listener tasks
- File:
src/crawlee/events/_event_manager.py(lines 182-202, 257-266) - Pattern: Individual listener task creation/cleanup in
listener_wrapper, and gathering all listener tasks inclosewith exception logging. - Win: Structured concurrency simplifies both spots.
6. Error snapshotter
- File:
src/crawlee/statistics/_error_snapshotter.py(lines 50-59) - Pattern: Conditional task building with a mutable list, then
gather. - Win: Cleaner conditional task creation inside a
TaskGroupcontext.
7. Sitemap request loader lifecycle
- File:
src/crawlee/request_loaders/_sitemap_request_loader.py(lines 159, 348-359) - Pattern: Background task creation + manual abort with cancel and
suppress(CancelledError). - Win: Task lifecycle tied to a
TaskGroupcontext.
8. Request queue batch processing
- File:
src/crawlee/storages/_request_queue.py(lines 201-207) - Pattern: Background batch task with a done-callback for self-removal from a list.
- Win:
TaskGroupmanages the lifecycle more cleanly, eliminating the callback.
Low Priority
9. Robots.txt handling (3 crawlers)
- Files:
src/crawlee/crawlers/_basic/_basic_crawler.py(lines 826-829)src/crawlee/crawlers/_abstract_http/_abstract_http_crawler.py(lines 234-237)src/crawlee/crawlers/_playwright/_playwright_crawler.py(lines 440-443)
- Pattern:
[create_task(...) for request in skipped]+gather(*), repeated identically 3 times. - Win: Minor simplification with
TaskGroup; also a chance to deduplicate.
10. Browser pool page creation
- File:
src/crawlee/browsers/_browser_pool.py(lines 284-285) - Pattern: Simple
gatherfor creating pages concurrently. - Win: Minor — slightly more idiomatic.
Caveat
TaskGroup propagates exceptions from child tasks as an ExceptionGroup. Several of these patterns intentionally suppress or log exceptions (e.g., return_exceptions=True in gather, event manager's exception swallowing). Those spots will need except* syntax or careful handling during migration.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
t-toolingIssues with this label are in the ownership of the tooling team.Issues with this label are in the ownership of the tooling team.