Skip to content

Conversation

@garg-khushi
Copy link

This PR fixes an inconsistency where expand_dims created object-dtype
coordinates for string inputs instead of NumPy unicode dtype.

Changes:

  • Preserve NumPy unicode dtype for string coordinates created via expand_dims
  • Prevent object-dtype propagation through concat by improving
    PandasIndex coord dtype inference
  • Add regression tests covering both cases

Fixes #11061

Copy link
Member

@jsignell jsignell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking the time to work on this @garg-khushi and sorry it's taken a bit to review. I was wondering if there might be a simpler solution as mentioned in the comments. I tried out the change that I suggested locally and just had to tweak one test:

diff --git a/xarray/tests/test_dataset.py b/xarray/tests/test_dataset.py
index d25ef5a2..6f9edae4 100644
--- a/xarray/tests/test_dataset.py
+++ b/xarray/tests/test_dataset.py
@@ -3860,7 +3860,7 @@ class TestDataset:
         # Regression test for https://github.com/pydata/xarray/issues/7493#issuecomment-1953091000
         # todo: test still needed?
         ds = Dataset().expand_dims({"time": [np.datetime64("2018-01-01", "m")]})
-        assert ds.time.dtype == np.dtype("datetime64[s]")
+        assert ds.time.dtype == np.dtype("datetime64[m]")
 
     def test_set_index(self) -> None:
         expected = create_test_multiindex()

variables.update(name_and_new_1d_var)
coord_names.add(k)
dim[k] = variables[k].size

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need any changes in this file actually.

Comment on lines 675 to +685
coord_dtype = get_valid_numpy_dtype(index)
if coord_dtype == object and index.dtype == object:
inferred = getattr(index, "inferred_type", None)
if inferred in ("string", "unicode"):
coord_dtype = np.dtype(str)
else:
data = index.to_numpy(dtype=object, copy=False)
if data.size and all(
isinstance(x, (str, np.str_)) for x in data.ravel()
):
coord_dtype = np.asarray(data, dtype=str).dtype
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if we need to be quite so precise in only targeting objects. This seems to work just as well:

Suggested change
coord_dtype = get_valid_numpy_dtype(index)
if coord_dtype == object and index.dtype == object:
inferred = getattr(index, "inferred_type", None)
if inferred in ("string", "unicode"):
coord_dtype = np.dtype(str)
else:
data = index.to_numpy(dtype=object, copy=False)
if data.size and all(
isinstance(x, (str, np.str_)) for x in data.ravel()
):
coord_dtype = np.asarray(data, dtype=str).dtype
coord_dtype = get_valid_numpy_dtype(np.asarray(array))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

expand_dims creates object dtype for string coordinates instead of inferring string dtype

2 participants