Skip to content

Conversation

@quinnj
Copy link
Member

@quinnj quinnj commented Jun 25, 2020

No description provided.

quinnj added 30 commits August 29, 2018 06:45
…allow someone to conveniently know what is being deprecated, and once fixed, switch to CSV.File for a clean upgrade
)

* Fix #71 by adding a keyword argument to ignore repeated delimiters
Start work on supporting custom types
@codecov
Copy link

codecov bot commented Jun 26, 2020

Codecov Report

Merging #656 into master will decrease coverage by 1.52%.
The diff coverage is 65.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #656      +/-   ##
==========================================
- Coverage   84.47%   82.95%   -1.53%     
==========================================
  Files           8        8              
  Lines        1701     1748      +47     
==========================================
+ Hits         1437     1450      +13     
- Misses        264      298      +34     
Impacted Files Coverage Δ
src/utils.jl 84.50% <ø> (ø)
src/file.jl 90.56% <63.91%> (-5.75%) ⬇️
src/rows.jl 93.22% <100.00%> (+0.05%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d74c2fc...bce612f. Read the comment docs.

quinnj added 9 commits June 25, 2020 21:56
When parsing, CSV.File/CSV.Rows require an `AbstractVector{UInt8}`. For
`Cmd` and generic `IO` arguments, it's always been a little awkward
because we basically just tried to read the whole thing or had this
weird slurp function that tried to read it all in chunks. This led to a
few surprises for users when their system ran out of memory.

This PR deprecates passing `Cmd` and generic `IO` arguments as inputs,
stating the user should instead do `read(x)` first or find another way
to pass a filename in. I think this will overall lead to less surprises
for users and requiring them to do `read(x)` themselves isn't too
onerous and gives them the chance to read it themselves if that works
for their use-case or find a more efficient way to pass the data in.

This also deprecates the `use_mmap` keyword argument, since we'll mmap
every time a filename is passed in. You can "avoid" this by calling
`read(open(file))` yourself if so needed, but `CSV.File` doesn't hold on
to a reference of the mmapped file any more (unless `lazystrings=true`),
so we shouldn't run into some of the issues we have in the past.

We also cleanup some dependencies here by moving FilePathsBase.jl and
WeakRefStrings.jl to test-only dependencies, since they're not used
internally in CSV.jl anymore.

One last note is that `IOBuffer` is still allowed to be passed as an
input argument since it's basically just a byte buffer underneath
anyway, so we can use the data directly.
quinnj added 14 commits June 26, 2020 00:52
Switch from Threads-at-threads to Threads.at-spawn. Fixes #657
Deprecate Cmd and generic IO inputs to CSV.File/CSV.Rows
Deprecate the categorical keyword argument
Looks like we accumulated some performance regressions for CSV.Rows;
this fixes most of them. There's still some extra allocations that
happen when passing custom types, but it's not too bad and can be solve
another day.
Improve CSV.Rows performance
* Added some code comments to help clarify things
* Update out-dated variable name usage (e.g. tapes)
* Cleaned up dependencies (WeakRefStrings is test-only now)
* Added lots of tests to increase coverage
* Made multithreaded chunk identification more robust by checking we
have correct # of columns for 5 consecutive rows instead of just 1
* Made sure we sync Int64 sentinels in multithreaded parsing
* Removed some unused functions
* Made sure we're testing type promoting when multithreaded parsing
* Add a `tasks::Integer` keyword argument to allow controlling how many
tasks will be spawned for multithreaded parsing
* Clean up keyword arg docs
@quinnj quinnj changed the base branch from master to main March 26, 2021 18:56
@quinnj
Copy link
Member Author

quinnj commented May 12, 2021

@NHDaly, sorry for the slightly random ping, but I'm actually diving back in to the code here and want to make a push. would you mind taking a look here if you have a moment? Happy to jump on a quick call too if that'd be easier to do a quick review of what's going on here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.