-
Notifications
You must be signed in to change notification settings - Fork 3
Description
Here are comments from a previous email we had received from an early alpha tester:
Metadata:
The first thing I wanted to do was to get to know the CMAP data. It’s easy enough to get the catalog and the variables (columns) in a table. But it would be nice to get the metadata for all the variables in a table.
This is what I came up with:
queryMetadata = function(tab) {
get_columns(tab) %>%
unlist() %>%
setdiff(c("lat", "lon", "time", "depth", "cruise")) %>%
purrr::map(function(v) get_metadata(tab, v)) %>%
bind_rows()
}
queryMetadata("tblSeaFlow")Is there a faster way of doing this? For example, by combining all the queries into one which would, I expect, lower the transaction cost / overhead of the data transfer and query?
- The next thing I want to do is to do this for all tables, e.g.,
catalog %>% select(Table_Name) %>% unique() %>% slice(5:6) %>%
purrr::map(queryMetadata)But when I do this I get the error “Unknown or uninitialised column: 'Dataset_ID'.” I haven’t figured out why yet. It’s also tremendously slow!
The code below does what I want and seems to (a) work and (b) be quite a bit faster. (Does the first 10 tables in just over half a minute. Tables 11-20 have 247 variables and takes over a minute.) Not sure why this is faster or if I am overlooking something.
queryMetadata = function(tab) {
get_columns(tab) %>%
unlist() %>%
setdiff(c("lat", "lon", "time", "depth", "cruise")) %>%
purrr::map(function(v) my_get_metadata(v, tab)) %>%
bind_rows()
}
catalog %>% select(Table_Name) %>% unique() %>% slice(1:10) %>% unlist() %>%
purrr::map(queryMetadata) %>% bind_rows() -> items1_10I see each request is a SQL query sent over https. Probably it’s not possible to write a “get_all_columns_from_all_tables” function that doesn’t amount to a whole series of separate queries. Maybe there could be a table of metadata in the database that had all this information? Or maybe I’m just making trouble. Or maybe it already exists and I just don’t know what It’s called.
Do you know a good way to cache results from expensive queries? Have you experience with R.cache? Could be useful!
Search for variables
Ultimately I’d like to be able to search for a variable such as chlorophyll, or
a unit such as cells per microliter (only possible if there is standardization
in units, but maybe that can be achieved) across all tables. This will help with
“discoverability”, I think.
Querying for actual data:
An important way I imagine using CMAP is to query several variables at specific
lat, lon, depth, time with specified tolerances, e.g., given a dataframe /
tibble with columns, lat, delta_lat, lon, delta_lon, depth, delta_depth, time, delta_time, I’d like to make one query and get all the observations in all those
boxes for a whole set of tables and variables. The “along_track” (maybe should
be get_along_track?) looks like what I want – if there is a way for a user to
supply a ship track? Or must the shiptrack be stored in the database already? I
think I recall discussion (or a request) at a previous meeting for user-supplied
ship tracks.
Querying is hard to understand
From the documentation it was not easy for me to understand get_section,
get_spacetime, get_timeseries, get_depthprofile. I think I have now figured it
out – section returns the full 4d array. Spacetime and timesries average over
depth (spacetime), depth (timeseries, but provides mean and sd). No. That’s not
right. I don’t understand what these functions do.
Depth in spacetime()
is optional in get_spacetime – what happens if it is not specified? What if it
is specified? It looks like get_spacetime averages over depth, but that isn’t
actually written in the help page, I think.
Hard to understand get_depthprofile() and get_timeseries
I had a hard time understanding the description section of the get_depthprofile
help page. From the output it looks to me as though the averaging is done over
time, not depth. “aggregated by depth” was confusing to me. Do you know what
happens with NAs when averaging or computing sd? Are they dropped?