Skip to content

Comprehensive cursor benchmark with memory and performance metrics #644

@laughingman7743

Description

@laughingman7743

Summary

Create comprehensive benchmarks comparing all cursor types with focus on:

  • Result set fetching performance
  • Memory usage with different chunk options
  • Comparison with AWS Wrangler

Background

Related issues have highlighted the need for clearer performance guidance:

The existing benchmarks (benchmarks/20180915/, benchmarks/20220201/) are outdated and don't cover the full range of cursor options now available.

Benchmark Scope

Cursors to Test

  • Cursor / DictCursor
  • PandasCursor (with/without chunksize)
  • ArrowCursor (arraysize, unload options)
  • PolarsCursor (with/without chunksize)
  • S3FSCursor

Metrics to Measure

Category Metrics
Speed Query execution time, result set fetch time
Memory Peak memory usage, memory with chunk options
Comparison Side-by-side with AWS Wrangler
Scale Small, medium, large dataset behavior

Data Source

Use the public PyPI download statistics dataset from BigQuery, which can be exported to S3 for Athena queries. This provides realistic, reproducible test data at various scales.

Reference: https://console.cloud.google.com/marketplace/product/gcp-public-data-pypi/pypi

Expected Deliverables

  1. Updated benchmark scripts in benchmarks/ directory
  2. Documentation with:
    • Performance comparison tables
    • Memory usage guidance
    • Recommendations for cursor selection based on use case
  3. README explaining how to reproduce benchmarks

Notes

Data preparation is required before implementation. The benchmark results should help users choose the appropriate cursor for their use case and understand trade-offs.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions