What happens?
DuckDBPyRelation::ToArrowCapsule consumes 2x the memory of ToArrowTableInternal. It fully materializes the results before constructing the arrow table, not leveraging a PhysicalArrowCollector.
edit 1: the ADBC driver has the same behavior: ...adbc... cur.execute(query).fetch_arrow_table() consumes more than 2x the memory of conn.sql(query).fetch_arrow_table().
Some rough test results (edit: for a 1GB Arrow Table):
| Test |
Time |
Peak |
Table Size |
Unchanged - ExecuteOrThrow(); |
5.818s |
2.36GB |
1.08GB |
Streaming - ExecuteOrThrow(true); |
4.682s |
1.22GB |
1.08GB |
| PhysicalArrowCollector + Stream (code not shown) |
3.937s |
1.23GB |
1.08GB |
| FetchArrowTable (see ToArrowTableInternal) |
4.447s |
1.17GB |
1.08GB |
Streaming
The simplest change is changing the materialization to streaming - ExecuteOrThrow(true);, this alone halves the memory consumption.
PhysicalArrowCollector
A more complete change would be to refactor ToArrowCapsule and ToArrowTableInternal to both use common PhysicalArrowCollector code. A light wrapper to construct an ArrowArrayStream would be needed in DuckDBPyResult::FetchArrowCapsule.
Related optimization - ToPolars
I noticed this when looking at ToPolars memory consumption in duckdb/duckdb#19356. If the above issue is solved, then Polars dataframe construction can be done directly without the ToArrowTableInternal step at all.
A simple change like the below would significantly improve memory and performance:
|
if (!lazy) { |
|
auto arrow = ToArrowTableInternal(batch_size, true); |
|
return py::cast<PolarsDataFrame>(pybind11::module_::import("polars").attr("DataFrame")(arrow)); |
|
} |
=>
auto polars_module = pybind11::module_::import("polars");
return py::cast<PolarsDataFrame>(polars_module.attr("DataFrame")(*this));
To Reproduce
import os
import time
from resource import getrusage, RUSAGE_SELF
import duckdb
import pyarrow as pa
query = "SELECT r, r||'foo' AS foo FROM range(100000000) t(r) WHERE r % 2 = 0"
start = time.perf_counter()
rel = duckdb.sql(query)
table = pa.table(rel) # Direct import from Arrow C Stream capsule
elapsed = time.perf_counter() - start
peak_gb = getrusage(RUSAGE_SELF).ru_maxrss * 1024 / (1024**3)
table_gb = table.nbytes / (1024**3)
print(f"Time: {elapsed:.3f}s Peak: {peak_gb:.2f}GB Table: {table_gb:.2f}GB")
OS:
Linux
DuckDB Package Version:
1.4.1, 1.5.0+dev110
Python Version:
3.13
Full Name:
Paul T
Affiliation:
Iqmo
What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.
I have tested with a source build
Did you include all relevant data sets for reproducing the issue?
Yes
Did you include all code required to reproduce the issue?
Did you include all relevant configuration to reproduce the issue?
What happens?
DuckDBPyRelation::ToArrowCapsule consumes 2x the memory of ToArrowTableInternal. It fully materializes the results before constructing the arrow table, not leveraging a PhysicalArrowCollector.
edit 1: the ADBC driver has the same behavior:
...adbc... cur.execute(query).fetch_arrow_table()consumes more than 2x the memory ofconn.sql(query).fetch_arrow_table().Some rough test results (edit: for a 1GB Arrow Table):
ExecuteOrThrow();ExecuteOrThrow(true);Streaming
The simplest change is changing the materialization to streaming -
ExecuteOrThrow(true);, this alone halves the memory consumption.duckdb-python/src/duckdb_py/pyrelation.cpp
Line 967 in bbd4389
PhysicalArrowCollector
A more complete change would be to refactor ToArrowCapsule and ToArrowTableInternal to both use common PhysicalArrowCollector code. A light wrapper to construct an ArrowArrayStream would be needed in DuckDBPyResult::FetchArrowCapsule.
Related optimization - ToPolars
I noticed this when looking at ToPolars memory consumption in duckdb/duckdb#19356. If the above issue is solved, then Polars dataframe construction can be done directly without the ToArrowTableInternal step at all.
A simple change like the below would significantly improve memory and performance:
duckdb-python/src/duckdb_py/pyrelation.cpp
Lines 976 to 979 in bbd4389
=>
To Reproduce
OS:
Linux
DuckDB Package Version:
1.4.1, 1.5.0+dev110
Python Version:
3.13
Full Name:
Paul T
Affiliation:
Iqmo
What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.
I have tested with a source build
Did you include all relevant data sets for reproducing the issue?
Yes
Did you include all code required to reproduce the issue?
Did you include all relevant configuration to reproduce the issue?