-
Notifications
You must be signed in to change notification settings - Fork 2k
Description
Is your feature request related to a problem or challenge?
While looking at traces for the morsel driven scan
@Dandandan and I noticed there is substantial potential work done on other blocking threads just to read data
I was thinking that for local files, we might do substantially better using mmap / the kernel page cache
Describe the solution you'd like
Specifically, a new wrapper for ObjectStore that will memory map files when they are opened, and then provide a way to read (using zero copy Bytes::slice) from those memory mapped files.
I am not sure it would actually be faster -- so the first thing to do would be to code it up and try it out / see how fast we can get it
Describe alternatives you've considered
Use the mmap2 crate as shown in this example: https://github.com/apache/arrow-rs/blob/main/arrow/examples/
zero_copy_ipc.rs
Open the file, meory map it, and then turn it into Bytes that is then zero copied when requested. At first keep
all mmap files. We will eventually implement a cap / LRU for the number of open mmaps, but to start we can just
keep them all open and see how that goes.
https://github.com/apache/arrow-rs/blob/main/arrow/examples/zero_copy_ipc.rs#L46-L47
Then we will wire this into datafusio-cli here so it is used for file urls
https://github.com/alamb/datafusion/blob/7ef62b988d19c75e737b57f1491cfc1cd9222466/datafusion-cli/src/
object_storage.rs#L567-L566
I think the mmap file object store should not wrap another object store, but instead it could implement all functions directly. Use the LocalFileSystem as an example for how to implement the various methods if needed
Additional context
No response