-
Notifications
You must be signed in to change notification settings - Fork 147
Start work on supporting filtering while parsing #656
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…allow someone to conveniently know what is being deprecated, and once fixed, switch to CSV.File for a clean upgrade
Start work on supporting custom types
Codecov Report
@@ Coverage Diff @@
## master #656 +/- ##
==========================================
- Coverage 84.47% 82.95% -1.53%
==========================================
Files 8 8
Lines 1701 1748 +47
==========================================
+ Hits 1437 1450 +13
- Misses 264 298 +34
Continue to review full report at Codecov.
|
When parsing, CSV.File/CSV.Rows require an `AbstractVector{UInt8}`. For
`Cmd` and generic `IO` arguments, it's always been a little awkward
because we basically just tried to read the whole thing or had this
weird slurp function that tried to read it all in chunks. This led to a
few surprises for users when their system ran out of memory.
This PR deprecates passing `Cmd` and generic `IO` arguments as inputs,
stating the user should instead do `read(x)` first or find another way
to pass a filename in. I think this will overall lead to less surprises
for users and requiring them to do `read(x)` themselves isn't too
onerous and gives them the chance to read it themselves if that works
for their use-case or find a more efficient way to pass the data in.
This also deprecates the `use_mmap` keyword argument, since we'll mmap
every time a filename is passed in. You can "avoid" this by calling
`read(open(file))` yourself if so needed, but `CSV.File` doesn't hold on
to a reference of the mmapped file any more (unless `lazystrings=true`),
so we shouldn't run into some of the issues we have in the past.
We also cleanup some dependencies here by moving FilePathsBase.jl and
WeakRefStrings.jl to test-only dependencies, since they're not used
internally in CSV.jl anymore.
One last note is that `IOBuffer` is still allowed to be passed as an
input argument since it's basically just a byte buffer underneath
anyway, so we can use the data directly.
Deprecate CSV.read
Deprecate writeheader
Switch from Threads-at-threads to Threads.at-spawn. Fixes #657
Deprecate Cmd and generic IO inputs to CSV.File/CSV.Rows
Deprecate the categorical keyword argument
Looks like we accumulated some performance regressions for CSV.Rows; this fixes most of them. There's still some extra allocations that happen when passing custom types, but it's not too bad and can be solve another day.
Improve CSV.Rows performance
* Added some code comments to help clarify things * Update out-dated variable name usage (e.g. tapes) * Cleaned up dependencies (WeakRefStrings is test-only now) * Added lots of tests to increase coverage * Made multithreaded chunk identification more robust by checking we have correct # of columns for 5 consecutive rows instead of just 1 * Made sure we sync Int64 sentinels in multithreaded parsing * Removed some unused functions * Made sure we're testing type promoting when multithreaded parsing * Add a `tasks::Integer` keyword argument to allow controlling how many tasks will be spawned for multithreaded parsing * Clean up keyword arg docs
Lots of cleanup
|
@NHDaly, sorry for the slightly random ping, but I'm actually diving back in to the code here and want to make a push. would you mind taking a look here if you have a moment? Happy to jump on a quick call too if that'd be easier to do a quick review of what's going on here. |
No description provided.