Problem
In PR #886, we added cuDF dtype coercion to safe_merge() in Engine.py:
if engine_concrete == Engine.CUDF and len(left) > 0:
merge_cols = []
if on is not None:
merge_cols = [on] if isinstance(on, str) else list(on)
elif left_on is not None:
left_cols = [left_on] if isinstance(left_on, str) else list(left_on)
right_cols = [right_on] if isinstance(right_on, str) else list(right_on)
merge_cols = list(zip(left_cols, right_cols))
for col_spec in merge_cols:
if isinstance(col_spec, tuple):
left_col, right_col = col_spec
else:
left_col = right_col = col_spec
if left_col in left.columns and right_col in right.columns:
left_dtype = left[left_col].dtype
right_dtype = right[right_col].dtype
# Cast right column to match left column type if they differ
if left_dtype \!= right_dtype:
try:
right[right_col] = right[right_col].astype(left_dtype)
except (ValueError, TypeError):
pass # Let the merge fail naturally if cast is impossible
This has several issues:
- Mutates input DataFrame -
right[right_col] = ... modifies the caller's DataFrame in place
- cuDF-only - pandas/dask/dask_cudf don't get this fix even though they may have similar issues
- Silent failure - the
pass on exception hides potential problems
Why we backed it out
The mutation bug is serious - callers don't expect safe_merge to modify their input DataFrames. This can cause subtle bugs where a DataFrame is reused after a merge and has unexpectedly changed dtypes.
Context
The fix was added to handle empty cuDF DataFrames that have float64 columns due to type inference issues. When merging with a non-empty DataFrame, the dtype mismatch causes cuDF to fail.
Proposed Fix
- Use
.copy() before modifying: right = right.copy() before the dtype coercion
- Consider if this should apply to all engines, not just cuDF
- Add logging/warning instead of silent
pass
- Add tests to verify the fix works and doesn't mutate inputs
Related
The fix was reverted in PR #886 to avoid shipping a mutation bug. cuDF tests may fail until this is properly addressed.
Problem
In PR #886, we added cuDF dtype coercion to
safe_merge()inEngine.py:This has several issues:
right[right_col] = ...modifies the caller's DataFrame in placepasson exception hides potential problemsWhy we backed it out
The mutation bug is serious - callers don't expect
safe_mergeto modify their input DataFrames. This can cause subtle bugs where a DataFrame is reused after a merge and has unexpectedly changed dtypes.Context
The fix was added to handle empty cuDF DataFrames that have
float64columns due to type inference issues. When merging with a non-empty DataFrame, the dtype mismatch causes cuDF to fail.Proposed Fix
.copy()before modifying:right = right.copy()before the dtype coercionpassRelated
The fix was reverted in PR #886 to avoid shipping a mutation bug. cuDF tests may fail until this is properly addressed.