Start work on supporting filtering while parsing #656

quinnj · 2020-06-25T17:02:16Z

No description provided.

…allow someone to conveniently know what is being deprecated, and once fixed, switch to CSV.File for a clean upgrade

#182

…file

…ssing

…ord argument

) * Fix #71 by adding a keyword argument to ignore repeated delimiters

Start work on supporting custom types

codecov · 2020-06-26T00:16:25Z

Codecov Report

Merging #656 into master will decrease coverage by 1.52%.
The diff coverage is 65.00%.

@@            Coverage Diff             @@
##           master     #656      +/-   ##
==========================================
- Coverage   84.47%   82.95%   -1.53%     
==========================================
  Files           8        8              
  Lines        1701     1748      +47     
==========================================
+ Hits         1437     1450      +13     
- Misses        264      298      +34

Impacted Files	Coverage Δ
src/utils.jl	`84.50% <ø> (ø)`
src/file.jl	`90.56% <63.91%> (-5.75%)`	⬇️
src/rows.jl	`93.22% <100.00%> (+0.05%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d74c2fc...bce612f. Read the comment docs.

When parsing, CSV.File/CSV.Rows require an `AbstractVector{UInt8}`. For `Cmd` and generic `IO` arguments, it's always been a little awkward because we basically just tried to read the whole thing or had this weird slurp function that tried to read it all in chunks. This led to a few surprises for users when their system ran out of memory. This PR deprecates passing `Cmd` and generic `IO` arguments as inputs, stating the user should instead do `read(x)` first or find another way to pass a filename in. I think this will overall lead to less surprises for users and requiring them to do `read(x)` themselves isn't too onerous and gives them the chance to read it themselves if that works for their use-case or find a more efficient way to pass the data in. This also deprecates the `use_mmap` keyword argument, since we'll mmap every time a filename is passed in. You can "avoid" this by calling `read(open(file))` yourself if so needed, but `CSV.File` doesn't hold on to a reference of the mmapped file any more (unless `lazystrings=true`), so we shouldn't run into some of the issues we have in the past. We also cleanup some dependencies here by moving FilePathsBase.jl and WeakRefStrings.jl to test-only dependencies, since they're not used internally in CSV.jl anymore. One last note is that `IOBuffer` is still allowed to be passed as an input argument since it's basically just a byte buffer underneath anyway, so we can use the data directly.

Deprecate CSV.read

Deprecate writeheader

Switch from Threads-at-threads to Threads.at-spawn. Fixes #657

Deprecate Cmd and generic IO inputs to CSV.File/CSV.Rows

Deprecate the categorical keyword argument

Looks like we accumulated some performance regressions for CSV.Rows; this fixes most of them. There's still some extra allocations that happen when passing custom types, but it's not too bad and can be solve another day.

Improve CSV.Rows performance

* Added some code comments to help clarify things * Update out-dated variable name usage (e.g. tapes) * Cleaned up dependencies (WeakRefStrings is test-only now) * Added lots of tests to increase coverage * Made multithreaded chunk identification more robust by checking we have correct # of columns for 5 consecutive rows instead of just 1 * Made sure we sync Int64 sentinels in multithreaded parsing * Removed some unused functions * Made sure we're testing type promoting when multithreaded parsing * Add a `tasks::Integer` keyword argument to allow controlling how many tasks will be spawned for multithreaded parsing * Clean up keyword arg docs

Lots of cleanup

quinnj · 2021-05-12T04:34:22Z

@NHDaly, sorry for the slightly random ping, but I'm actually diving back in to the code here and want to make a push. would you mind taking a look here if you have a moment? Happy to jump on a quick call too if that'd be easier to do a quick review of what's going on here.

quinnj added 30 commits August 29, 2018 06:45

Try baking in a call to CSV.File take away overhead later

485add6

Put CSV.File initializtion call in __init__

4249e2a

Put CSV.File initializtion call in __init__

bbededa

Cleanup CSV.Source & CSV.read deprecations; the current setup should …

bd40d4d

…allow someone to conveniently know what is being deprecated, and once fixed, switch to CSV.File for a clean upgrade

A little more cleanup of deprecations

15d08ad

Add more tests around new functionality and some nice show methods

e488bbe

Add tests for CSV.write and fix a few things

706cd0d

Add tests for reading string delimiters

b6f8530

Actually add test/write.jl file

6a8a335

Bump Parsers in Manifest

f45a77d

Fix out of bounds index call

f593333

Fix 32-bit tests

415278c

Don't mmap by default on windows

74d6117

Throw on non-concrete types passed

cff0e18

Add skipto argument, which can be clearer sometimes than datarow. Fixes

beebac7

#182

Fix #249 by allowing symbol column names in types Dict

e12c333

Fix new skipto keyword argument

cbd15f1

Fix #247 by ensuring we can both append and write the header for csv …

156be65

…file

Update documentation

3cddfa3

Bump REQUIRE for Tables

1f49e29

Fix tests and docs

e12faf7

Take out rand call

08c313c

Fix Project for newly registered Tables

20586b2

Fixup Manifest

d04115b

Fix #251. Allow specifying Union types and fix Bool promotion with Mi…

b8777dc

…ssing

Explicitly implement Tables.jl interface

e62f9cc

Add support for skipping commented lines via the comment::String keyw…

a625c76

…ord argument

Track Tables.jl master for the moment

9d97d42

Update Manifest.toml

437791e

Fix #71 by adding a keyword argument to ignore repeated delimiters (#254

38f709e

) * Fix #71 by adding a keyword argument to ignore repeated delimiters

Merge pull request #649 from JuliaData/jq/lazystrings

aea79c6

Start work on supporting custom types

quinnj added 9 commits June 25, 2020 21:56

try to fix windows

4a5b69e

Deprecate CSV.read

18aef10

Clean up CSV.read deprecation

9e759f7

Deprecate writeheader

62bc3e1

fix tests

b14397a

Deprecate writeheader

7231418

Merge pull request #659 from JuliaData/jq/depread

32f58a8

Deprecate CSV.read

Merge pull request #660 from JuliaData/jq/depwriteheader

3f26b3c

Deprecate writeheader

quinnj mentioned this pull request Jun 26, 2020

Allow custom callbacks for invalid row/cell #618

Closed

quinnj added 14 commits June 26, 2020 00:52

Switch from Threads-at-threads to Threads.at-spawn. Fixes #657

352a2b1

Merge pull request #661 from JuliaData/jq/threads

039c056

Switch from Threads-at-threads to Threads.at-spawn. Fixes #657

Merge pull request #658 from JuliaData/jq/depio

95214ee

Deprecate Cmd and generic IO inputs to CSV.File/CSV.Rows

Deprecate the categorical keyword argument

91c9d57

Merge pull request #662 from JuliaData/jq/catg

b222c16

Deprecate the categorical keyword argument

Improve CSV.Rows performance

9c9088f

Looks like we accumulated some performance regressions for CSV.Rows; this fixes most of them. There's still some extra allocations that happen when passing custom types, but it's not too bad and can be solve another day.

Fix docs

2924126

Merge pull request #663 from JuliaData/jq/rowperf

3d2ad1f

Improve CSV.Rows performance

Merge pull request #664 from JuliaData/jq/cleanup

0075065

Lots of cleanup

Quick fix for testing file

d74c2fc

Start work on supporting filtering while parsing

72177e7

Wire up filtering

3d90903

fix rows

bce612f

quinnj force-pushed the jq/filter branch from 2ae06fe to bce612f Compare June 27, 2020 14:23

quinnj changed the base branch from master to main March 26, 2021 18:56

quinnj force-pushed the main branch from 04ec1cf to 4f8c505 Compare January 12, 2026 15:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Start work on supporting filtering while parsing #656

Start work on supporting filtering while parsing #656

Uh oh!

quinnj commented Jun 25, 2020

Uh oh!

codecov bot commented Jun 26, 2020 •

edited

Loading

Uh oh!

quinnj commented May 12, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Start work on supporting filtering while parsing #656

Are you sure you want to change the base?

Start work on supporting filtering while parsing #656

Uh oh!

Conversation

quinnj commented Jun 25, 2020

Uh oh!

codecov bot commented Jun 26, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

quinnj commented May 12, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

codecov bot commented Jun 26, 2020 •

edited

Loading