xarray.Dataset should NOT be recognized as a valid tabular data type

Currently, [`xarray.Dataset`](https://docs.xarray.dev/en/stable/generated/xarray.Dataset.html) is recognized as a valid tabular data type in some places. For example:

https://github.com/GenericMappingTools/pygmt/blob/dbbc1684ded29ec8d8be7db727abdffd62538d2b/pygmt/helpers/utils.py#L108-L109
https://github.com/GenericMappingTools/pygmt/blob/dbbc1684ded29ec8d8be7db727abdffd62538d2b/pygmt/clib/session.py#L1531-L1547

But I think it should **NOT** be like that. Here are the reasons.

**1. `xarray.Dataset` is more like a collection of xarray.DataArrays, rather than a pandas.DataFrame:**

As the [official docs](https://docs.xarray.dev/en/stable/generated/xarray.Dataset.html) says: 

> A dataset resembles an in-memory representation of a NetCDF file, and consists of variables, coordinates and attributes which together form a self describing dataset.
> 
> Dataset implements the mapping interface with keys given by variable names and values given by DataArray objects for each variable name.

`xarray.Dataset` can represent tabular data, but it's more commonly used as a data structure to hold multiple `xarray.DataArray` objects.

**2. It's unclear what/how data are passed.**

Here is an example from the official documentation:
```
>>> import numpy as np
>>> import xarray as xr
>>> import pandas as pd
>>> np.random.seed(0)
>>> temperature = 15 + 8 * np.random.randn(2, 2, 3)
>>> precipitation = 10 * np.random.rand(2, 2, 3)
>>> lon = [[-99.83, -99.32], [-99.79, -99.23]]
>>> lat = [[42.25, 42.21], [42.63, 42.59]]
>>> time = pd.date_range("2014-09-06", periods=3)
>>> reference_time = pd.Timestamp("2014-09-05")

>>> ds = xr.Dataset(
...     data_vars=dict(
...         temperature=(["x", "y", "time"], temperature),
...         precipitation=(["x", "y", "time"], precipitation),
...     ),
...     coords=dict(
...         lon=(["x", "y"], lon),
...         lat=(["x", "y"], lat),
...         time=time,
...         reference_time=reference_time,
...     ),
...     attrs=dict(description="Weather related data."),
... )
>>> ds
<xarray.Dataset> Size: 288B
Dimensions:         (x: 2, y: 2, time: 3)
Coordinates:
    lon             (x, y) float64 32B -99.83 -99.32 -99.79 -99.23
    lat             (x, y) float64 32B 42.25 42.21 42.63 42.59
  * time            (time) datetime64[ns] 24B 2014-09-06 2014-09-07 2014-09-08
    reference_time  datetime64[ns] 8B 2014-09-05
Dimensions without coordinates: x, y
Data variables:
    temperature     (x, y, time) float64 96B 29.11 18.2 22.83 ... 16.15 26.63
    precipitation   (x, y, time) float64 96B 5.68 9.256 0.7104 ... 4.615 7.805
Attributes:
    description:  Weather related data.
```
Each data variable is a `xarray.DataArray` object:
```
>>> ds.temperature
<xarray.DataArray 'temperature' (x: 2, y: 2, time: 3)> Size: 96B
array([[[29.11241877, 18.20125767, 22.82990387],
        [32.92714559, 29.94046392,  7.18177696]],

       [[22.60070734, 13.78914233, 14.17424919],
        [18.28478802, 16.15234857, 26.63418806]]])
Coordinates:
    lon             (x, y) float64 32B -99.83 -99.32 -99.79 -99.23
    lat             (x, y) float64 32B 42.25 42.21 42.63 42.59
  * time            (time) datetime64[ns] 24B 2014-09-06 2014-09-07 2014-09-08
    reference_time  datetime64[ns] 8B 2014-09-05
Dimensions without coordinates: x, y
```
Then, in [`virtualfile_in`](https://github.com/GenericMappingTools/pygmt/blob/dbbc1684ded29ec8d8be7db727abdffd62538d2b/pygmt/clib/session.py#L1597), a list of multi-dimensional (3-D in this example) `xarray.DataArray` objects are passed to GMT modules which expects a list of 1-D arrays instead. It works without errors because in `Session.put_vector`, we pass the pointer of the 2-D array to the GMT C API function, but it likely won't work if the data is not C-contiguous (e.g., a slice of a dataset). So, the actual behavior is not well defined.

So, I think `xarray.Dataset` should not be recognized as a valid tabular data type, which not only makes more sense but also can simplify our codes/tests.




	>>> from pygmt.helpers import GMTTempFile
	>>> import xarray as xr
	>>> data = xr.Dataset(
	... coords=dict(index=[0, 1, 2]),
	... data_vars=dict(
	... x=("index", [9, 8, 7]),
	... y=("index", [6, 5, 4]),
	... z=("index", [3, 2, 1]),
	... ),
	... )
	>>> with Session() as ses:
	... with ses.virtualfile_in(check_kind="vector", data=data) as fin:
	... # Send the output to a file so that we can read it
	... with GMTTempFile() as fout:
	... ses.call_module("info", fin + " ->" + fout.name)
	... print(fout.read().strip())
	<vector memory>: N = 3 <7/9> <4/6> <1/3>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

xarray.Dataset should NOT be recognized as a valid tabular data type #3146

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

	if hasattr(data, "data_vars") and len(data.data_vars) < 3: # xr.Dataset
	raise GMTInvalidInput("data must provide x, y, and z columns.")

xarray.Dataset should NOT be recognized as a valid tabular data type #3146

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions