-
Notifications
You must be signed in to change notification settings - Fork 235
Description
Currently, xarray.Dataset is recognized as a valid tabular data type in some places. For example:
Lines 108 to 109 in dbbc168
| if hasattr(data, "data_vars") and len(data.data_vars) < 3: # xr.Dataset | |
| raise GMTInvalidInput("data must provide x, y, and z columns.") |
Lines 1531 to 1547 in dbbc168
| >>> from pygmt.helpers import GMTTempFile | |
| >>> import xarray as xr | |
| >>> data = xr.Dataset( | |
| ... coords=dict(index=[0, 1, 2]), | |
| ... data_vars=dict( | |
| ... x=("index", [9, 8, 7]), | |
| ... y=("index", [6, 5, 4]), | |
| ... z=("index", [3, 2, 1]), | |
| ... ), | |
| ... ) | |
| >>> with Session() as ses: | |
| ... with ses.virtualfile_in(check_kind="vector", data=data) as fin: | |
| ... # Send the output to a file so that we can read it | |
| ... with GMTTempFile() as fout: | |
| ... ses.call_module("info", fin + " ->" + fout.name) | |
| ... print(fout.read().strip()) | |
| <vector memory>: N = 3 <7/9> <4/6> <1/3> |
But I think it should NOT be like that. Here are the reasons.
1. xarray.Dataset is more like a collection of xarray.DataArrays, rather than a pandas.DataFrame:
As the official docs says:
A dataset resembles an in-memory representation of a NetCDF file, and consists of variables, coordinates and attributes which together form a self describing dataset.
Dataset implements the mapping interface with keys given by variable names and values given by DataArray objects for each variable name.
xarray.Dataset can represent tabular data, but it's more commonly used as a data structure to hold multiple xarray.DataArray objects.
2. It's unclear what/how data are passed.
Here is an example from the official documentation:
>>> import numpy as np
>>> import xarray as xr
>>> import pandas as pd
>>> np.random.seed(0)
>>> temperature = 15 + 8 * np.random.randn(2, 2, 3)
>>> precipitation = 10 * np.random.rand(2, 2, 3)
>>> lon = [[-99.83, -99.32], [-99.79, -99.23]]
>>> lat = [[42.25, 42.21], [42.63, 42.59]]
>>> time = pd.date_range("2014-09-06", periods=3)
>>> reference_time = pd.Timestamp("2014-09-05")
>>> ds = xr.Dataset(
... data_vars=dict(
... temperature=(["x", "y", "time"], temperature),
... precipitation=(["x", "y", "time"], precipitation),
... ),
... coords=dict(
... lon=(["x", "y"], lon),
... lat=(["x", "y"], lat),
... time=time,
... reference_time=reference_time,
... ),
... attrs=dict(description="Weather related data."),
... )
>>> ds
<xarray.Dataset> Size: 288B
Dimensions: (x: 2, y: 2, time: 3)
Coordinates:
lon (x, y) float64 32B -99.83 -99.32 -99.79 -99.23
lat (x, y) float64 32B 42.25 42.21 42.63 42.59
* time (time) datetime64[ns] 24B 2014-09-06 2014-09-07 2014-09-08
reference_time datetime64[ns] 8B 2014-09-05
Dimensions without coordinates: x, y
Data variables:
temperature (x, y, time) float64 96B 29.11 18.2 22.83 ... 16.15 26.63
precipitation (x, y, time) float64 96B 5.68 9.256 0.7104 ... 4.615 7.805
Attributes:
description: Weather related data.
Each data variable is a xarray.DataArray object:
>>> ds.temperature
<xarray.DataArray 'temperature' (x: 2, y: 2, time: 3)> Size: 96B
array([[[29.11241877, 18.20125767, 22.82990387],
[32.92714559, 29.94046392, 7.18177696]],
[[22.60070734, 13.78914233, 14.17424919],
[18.28478802, 16.15234857, 26.63418806]]])
Coordinates:
lon (x, y) float64 32B -99.83 -99.32 -99.79 -99.23
lat (x, y) float64 32B 42.25 42.21 42.63 42.59
* time (time) datetime64[ns] 24B 2014-09-06 2014-09-07 2014-09-08
reference_time datetime64[ns] 8B 2014-09-05
Dimensions without coordinates: x, y
Then, in virtualfile_in, a list of multi-dimensional (3-D in this example) xarray.DataArray objects are passed to GMT modules which expects a list of 1-D arrays instead. It works without errors because in Session.put_vector, we pass the pointer of the 2-D array to the GMT C API function, but it likely won't work if the data is not C-contiguous (e.g., a slice of a dataset). So, the actual behavior is not well defined.
So, I think xarray.Dataset should not be recognized as a valid tabular data type, which not only makes more sense but also can simplify our codes/tests.