-
Notifications
You must be signed in to change notification settings - Fork 107
Open
Description
Summary
Create comprehensive benchmarks comparing all cursor types with focus on:
- Result set fetching performance
- Memory usage with different chunk options
- Comparison with AWS Wrangler
Background
Related issues have highlighted the need for clearer performance guidance:
- larger than memory queries? #61: Questions about handling larger-than-memory queries
- SQLAlchemy + Pandas very slow when compared to AWS Wrangler #254: Performance comparison with AWS Wrangler
The existing benchmarks (benchmarks/20180915/, benchmarks/20220201/) are outdated and don't cover the full range of cursor options now available.
Benchmark Scope
Cursors to Test
- Cursor / DictCursor
- PandasCursor (with/without chunksize)
- ArrowCursor (arraysize, unload options)
- PolarsCursor (with/without chunksize)
- S3FSCursor
Metrics to Measure
| Category | Metrics |
|---|---|
| Speed | Query execution time, result set fetch time |
| Memory | Peak memory usage, memory with chunk options |
| Comparison | Side-by-side with AWS Wrangler |
| Scale | Small, medium, large dataset behavior |
Data Source
Use the public PyPI download statistics dataset from BigQuery, which can be exported to S3 for Athena queries. This provides realistic, reproducible test data at various scales.
Reference: https://console.cloud.google.com/marketplace/product/gcp-public-data-pypi/pypi
Expected Deliverables
- Updated benchmark scripts in
benchmarks/directory - Documentation with:
- Performance comparison tables
- Memory usage guidance
- Recommendations for cursor selection based on use case
- README explaining how to reproduce benchmarks
Notes
Data preparation is required before implementation. The benchmark results should help users choose the appropriate cursor for their use case and understand trade-offs.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels