While working on large CSV aggregations (we're north of 1.8k input files at the moment, ending up grouped in the same destination dataset), it took me a bit of time initially to realize that while load_csv! will raise at the first opportunity if it meets a value not matching the provided dtype, calling cast will silently convert the field to nil and continue (which is also useful, of course).
Here is an up-to-date reproduction to showcase this:
Mix.install([
{:explorer, "~> 0.11.1"}
])
ExUnit.start()
defmodule Repro do
use ExUnit.Case
alias Explorer.DataFrame, as: DF
@incoming_data "field\nABC\n12.4"
test "loading from CSV seems strict" do
assert_raise RuntimeError, ~r/could not parse `ABC` as dtype `f64` at column 'field'/, fn ->
DF.load_csv!(@incoming_data, dtypes: [{:field, {:f, 64}}])
end
end
test "but casting from string, not strict" do
result = @incoming_data
|> DF.load_csv!(dtypes: [{:field, :string}])
|> DF.mutate_with(fn df ->
[field: Explorer.Series.cast(df[:field], {:f, 64})]
end)
|> Access.get(:field)
|> Explorer.Series.to_list()
# non-castable data has been translated to `nil`,
# something which can catch offguard quite a bit
assert result == [nil, 12.4]
end
end
Current notes from my exploration
- Polars has both strict & non-strict ways of doing things
- the Explorer code-base uses
strict_cast in 2 places at least
- but it does not expose strictness as an option to the end user currently
- I could not find (so far) mentions of the behaviour (silenceness) in the
cast documentation
- my understanding is that exposing this could be a bit involved (not to mention defaulting to strict if we wanted to)
I thought it would be useful to open a discussion on this, since it could very much take off guard other people (especially in Elixir, where things are usually stricter, & more typing is being introduced).
While working on large CSV aggregations (we're north of 1.8k input files at the moment, ending up grouped in the same destination dataset), it took me a bit of time initially to realize that while
load_csv!will raise at the first opportunity if it meets a value not matching the provideddtype, callingcastwill silently convert the field toniland continue (which is also useful, of course).Here is an up-to-date reproduction to showcase this:
Current notes from my exploration
strict_castin 2 places at leastcastdocumentationI thought it would be useful to open a discussion on this, since it could very much take off guard other people (especially in Elixir, where things are usually stricter, & more typing is being introduced).