Skip to content

Change datafusion-cli file access to use mmap #21159

@alamb

Description

@alamb

Is your feature request related to a problem or challenge?

While looking at traces for the morsel driven scan

@Dandandan and I noticed there is substantial potential work done on other blocking threads just to read data

I was thinking that for local files, we might do substantially better using mmap / the kernel page cache

Describe the solution you'd like

Specifically, a new wrapper for ObjectStore that will memory map files when they are opened, and then provide a way to read (using zero copy Bytes::slice) from those memory mapped files.

I am not sure it would actually be faster -- so the first thing to do would be to code it up and try it out / see how fast we can get it

Describe alternatives you've considered

Use the mmap2 crate as shown in this example: https://github.com/apache/arrow-rs/blob/main/arrow/examples/
zero_copy_ipc.rs

Open the file, meory map it, and then turn it into Bytes that is then zero copied when requested. At first keep
all mmap files. We will eventually implement a cap / LRU for the number of open mmaps, but to start we can just
keep them all open and see how that goes.

https://github.com/apache/arrow-rs/blob/main/arrow/examples/zero_copy_ipc.rs#L46-L47

Then we will wire this into datafusio-cli here so it is used for file urls

https://github.com/alamb/datafusion/blob/7ef62b988d19c75e737b57f1491cfc1cd9222466/datafusion-cli/src/
object_storage.rs#L567-L566

I think the mmap file object store should not wrap another object store, but instead it could implement all functions directly. Use the LocalFileSystem as an example for how to implement the various methods if needed

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions