Skip to content

gingerarmbrust/Data_Submission_FileStruc_Doc

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 

Repository files navigation

Simons CMAP Data Submission:
File Structure

Table of Contents

  1. Introduction

  2. Data Sheet

  3. Dataset Metadata

  4. Variable Metadata

  5. Bibliography








Introduction

This document describes the specifications of the data and metadata fields required for submitting datasets to the Simons CMAP database. As long as the required fields are provided the submitted data can be in any file format such as netCDF, plain text file, CSV, or Excel files. For simplicity, we have created an empty dataset template in Excel format which can be found here. Please use this template to load and submit your dataset. The data and metadata field names (e.g. time, lat, lon, short_name, long_name, ...) used in the template file have been inspired by the CF and COARDS naming conventions [1, 2, 3].

The CMAP data template consists of three sheets: data, dataset metadata, and variable metadata. Data is stored in the first sheet called “data”. The second sheet stores the dataset metadata and is called “dataset_meta_data”. Metadata associated with the variables in the dataset are entered in the third sheet, “vars_meta_data”. All columns are required unless specified otherwise. Below are a few example datasets that have been prepared using the specifications described in this document:








Data Sheet

time lat lon depth [if exists] var1 ... varn
example: 2016-5-01T15:02:00 25 -158 5 value ... value

All of the data points are stored in the "Data" sheet. Every single data point must have time and location information. The exact name and order of the time and location columns are shown in the table above. If a dataset does not have depth values (sea surface measurements) you may remove the depth column. The columns var1 ... varn represent the dataset variables (measurements). Please rename var1 ... varn to appropriate names. The format of "time", "lat", "lon", and "depth" columns are described in the following sections. Please review the example datasets listed in the introduction for more detailed information.

  • time*
    This column holds datetime values with the following format: %Y-%m-%dT%H:%M:%S
    The date and time sections are separated by a "T" character.
    Example: 2010-02-09T18:15:00
    • Year (%Y) is a four-digit value: example 2010
    • Month (%m) is a two-digit value: example 02 (for Feburary)
    • Day (%d) is a two-digit value: example 09
    • Hour (%H) is a two-digit value from 00 to 23: example 18
    • Minute (%M) is a two-digit value from 00 to 59: example 15
    • Second (%S) is a two-digit value from 00 to 59: example 00
    • Time zone: UTC

  • lat*
    This column holds the latitude values with the following characteristics:
    • Type: Numeric values from -90 to 90
    • Format: Decimal (not military grid system)
    • Unit: degree North

  • lon*
    This column holds the longitude values with the following characteristics:
    • Type: Numeric values from -180 to 180
    • Format: Decimal (not military grid system)
    • Unit: degree East

  • depth
    This column holds the depth values with the following characteristics:
    • Type: Positive numeric values. It is 0 at surface and increases at lower depth levels.
    • Format: Decimal
    • Unit: meter

  • var1 ... varn
    These columns represent the dataset variables (measurements). Please rename them to appropriate names. Please do not include units in these columns; units are recorded in the "vars_meta_data" sheet. For missing values, meaning the instances where data was not taken, leave the cell empty. Please review the example datasets listed in the introduction for more information.







Dataset Metadata

This sheet holds a list of top-level attributes about the dataset such as the dataset name and description. Below are the list of these attributes along with their descriptions. Please review the example datasets listed in the introduction for more information.

  • dataset_short_name*
    This name is meant to be used in programming codes and scripts. So it should only contain a combination of letters, numbers, and underscores. Do not use space, dash, or special characters such as <, +, %, etc. It must be shorter than 50 characters and is a required field.

    • Required: Yes
    • Constraint: Less than 50 characters

  • dataset_long_name*
    A descriptive and human-readable name for the dataset. This name will present your dataset in the CMAP catalog (Fig.1), visualization search dialog (Fig.2). Any Unicode character can be used here, but please avoid names longer than 200 characters as they might get trimmed while displayed on graphical interfaces. Please refer to dataset_description in case you would like to add a full textual description (with not length limits) for your dataset.

    • Required: Yes
    • Constraint: Less than 200 characters

Dataset Long_name in Catalog

Figure 1. A sample dataset shown in the Simons CMAP catalog. The "dataset_long_name" is enclosed in the red rectangle.

Dataset Long_name in Visualization Page

Figure 2. The "dataset_long_name" appears in the visualization page search dialog.



  • dataset_version*
    Please assign a version number or an identifier to your dataset such as "1.0.0" or "ver 1". Version identifiers will help tracking the evolution of a dataset over time.

    • Required: Yes
    • Constraint: Less than 50 characters
    • Example: 1.0

  • dataset_release_date*
    Indicates the release date of the dataset. If your dataset has been previously published or released publicly, please specify that date. Otherwise, note the date the data was submitted to CMAP.

    • Required: Yes
    • Constraint: Less than 50 characters
    • Example: 1.0

  • dataset_make*
    This is a required field specifying a broad production category for the dataset (also referred to as dataset make). It can only have a single value from a fixed set of options (observation, model, assimilation, laboratory) which are described below. This field will greatly help to find the data at CMAP by categorizing them according to their make class. Please contact us if you believe your dataset make is not consistent with any of the values below:

    • Observation: refers to any in-situ or remote sensing measurements such as measurements made during a cruise expedition, data from an in-situ sensor, or satellite observations. Observations made at laboratory settings (culture experiments) have their distinct category and do not fall in this category.

    • Model: refers to the outputs of the numerical simulations.  

    • Assimilation: refers to the products that are a blend of observations and numerical models.

    • Laboratory: refers to the observations made in a laboratory setting such as culture experiment results.

  • dataset_source*
    Specifies the group and/or the institute name of the data owner(s). It can also include any link (such as a website) to the data producers. This information will be visible in the CMAP catalog as shown in the figure below. Also, dataset_source will be annotated to any visualization made using the dataset (Fig. 4). This is a required field and its length must be less than 100 characters.

    • Required: Yes
    • Constraint: Less than 100 characters
    • Example: Armbrust Lab, University of Washington

Dataset Source in Catalog

Figure 3. A sample dataset shown in the Simons CMAP catalog. The "dataset_source" is enclosed in the red rectangle.

Dataset Source in Visualizations

Figure 4. The "dataset_source" appears in visualization made using the corresponding dataset (enclosed in the red rectangle).



  • dataset_distributor
    If your dataset has already published by a data distributor provide a link to the data distributor. Otherwise, leave this field empty. This is not a required field.

  • dataset_acknowledgement
    Specify how your dataset should be acknowleged. You may mention your funding agency, grant number, or you may ask the CMAP users to acknowledge your dataset via a certain phrase. Dataset acknowlegment will be visible in the catalog page (Fig. 5). This is not a required field.

    • Required: No (optional)
    • Constraint: No length limits


Dataset Acknowledgment in Catalog

Figure 5. A sample dataset shown in the Simons CMAP catalog. The "dataset_acknowledgement" is enclosed in the red rectangle.



  • dataset_history
    Use this field in case your dataset has evolved over time and you wish to add notes about the history of your dataset. Otherwise, leave this field empty. This is not a required field.

  • dataset_description*
    Include any description that you think will guide the reader to better understand your dataset. This can involve information about data acquisition, processing methods, figures, and links to the external contents. This field acts as the dataset documentation is visible in the Simons CMAP catalog (Fig. 6). This field is required.

    • Required: Yes
    • Constraint: No length limits

Dataset description in Catalog

Figure 6. A sample dataset shown in the Simons CMAP catalog. The "dataset_description" is accessible using the "Dataset Details" button, enclosed in the red rectangle.



  • dataset_references
    List any publications or documentation that one may cite in reference to the dataset. If there are more than one references, please put them in separate cells under the dataset_reference column. Leave this field empty if there are no publications associated with this dataset. This is not a required field.

  • climatology
    This is a flag indicating whether the dataset represents a climatological product. If your dataset is a climatological product fill this field with "1". Otherwise, leave this field blank. This is not a required field.

  • cruise_names
    If your dataset represents measurements made during a cruise expedition (or expeditions), provide a list of cruise official names here. In case your dataset is associated with more than one cruise, please put them in separate cells under the cruise_names column. If the cruises have any nicknames, please include them too. Leave this field blank if your dataset is not associated with a cruise expedition. This is not a required field.

    • Required: No (optional)
    • Constraint: No length limits
    • Example: KOK1606, Gradients 1








Variable Metadata

A dataset can contain multiple different measurements (variables). This sheet (labeled as "vars_meta_data") holds a list of top-level attributes about these variables such as the variable name, unit, and description. Each variable along with its attributes (metadata) are stored in separate rows. Below is the list of these attributes along with their descriptions. Please review the example datasets listed in the introduction for more information.

  • var_short_name*
    This name is meant to be used in programming codes and scripts. So it should only contain a combination of letters, numbers, and underscores. Do not use space, dash, or special characters such as <, +, %, etc. var_short_name will be seen in the CMAP catalog (Fig. 7), and will appear as the title of the generated figures (Fig. 8). This a required field and must be shorter than 50 characters.

    • Required: Yes
    • Constraint: Less than 50 characters

Variable short name in catalog

Figure 7. A sample dataset shown in the Simons CMAP catalog. The "var_shot_name" is highlighted in the red rectangle.

Variable short name in a figure

Figure 8. A sample figure generated in the Simons CMAP catalog. The "var_shot_name" appears at the figure's title and is highlighted in the red rectangle.



  • var_long_name*
    A descriptive and human-readable label for the variable in accordance with the CF and COARDS conventions [1, 2, 3]. This name will present your variable in the CMAP catalog (Fig. 9), visualization search dialog (Fig. 10). var_long_name can contain any unicode character, but please avoid names longer than 200 characters as they might get trimmed while displayed on graphical interfaces. Please refer to var_comment in case you would like to add a full textual description (with not length limits) for your variable.

    • Required: Yes
    • Constraint: Less than 200 characters

Variable long name in catalog

Figure 9. A sample dataset shown in the Simons CMAP catalog. The "var_long_name" is highlighted in the red rectangle.

var long name in visualization page

Figure 10. The "var_long_name" appears in the visualization page search dialog.



  • var_sensor*
    This is a required field that refers to the instrument used to produce the measurements such as CTD, fluorometer, flow cytometer, sediment trap, etc. If your dataset is the result of a field expedition but you are not sure about the name of the instrument used for the measurements, use the term "in-situ" to fill out this field. This field will significantly help to find and categorize data achieved using a similar class of instruments. var_sensor will be visible in the Simons CMAP catalog.



  • var_unit
    Specifies the variable's unit, if applicable. Leave this field blank if your variable is unitless (e.g. "station numbers" or "quality flags"). It may contain unicode characters such as subscripts and superscripts. var_unit will be visible in the Simons CMAP catalog (see Fig. 9) and in the generated visualizations (see Fig. 8). This field is not required.

    • Required: No (optional)
    • Constraint: Less than 50 characters
    • Example: ul L-1

  • var_spatial_res*
    Specifies the spatial resolution of the variable. Typically, gridded products have uniform spatial spacing (such as 0.25° X 0.25°) while field expeditions do not have a regular spatial resolution. If your variable does not have a regular spatial resolution, use the term "irregular" to fill out this field. Note that if samples are taken at a series of distinct but spatially-non-uniform stations, the spatial resolution is considered irregular. var_spatial_res may contain unicode characters such as degree symbol ( ° ) and will be visible in the Simons CMAP catalog (see Fig. 9). This field is required.

    • Required: Yes
    • Constraint: Less than 50 characters
    • Example: irregular

  • var_temporal_res*
    Specifies the temporal resolution between the subsequent measurements (such as daily, hourly, 3-minutes, etc). Typically field expedition measurements do not have a regular temporal spacing in which case you may use the term "irregular" to fill out this field. var_temporal_res will be visible in the Simons CMAP catalog (see Fig. 9). This field is required.

    • Required: Yes
    • Constraint: Less than 50 characters
    • Example: irregular

  • var_discipline*
    Indicates in which disciplines (such as Physics, Biology ...) this variable is commonly studied. You can specify more than one discipline. var_discipline will be visible in the Simons CMAP catalog (referred to as "Study Domain" in Fig. 9). This field is required.

    • Required: Yes
    • Constraint: Less than 100 characters
    • Example: BioGeoChemistry

  • visualize
    This is a flag field and can only be 0 or 1. Fill this field by 1, if you think this variable can be visualized by Simons CMAP. In principle, any variable with numeric values can be visualized while variables with string values, station numbers, or quality flags may not be the best candidates for visualization in CMAP. Please consult with the data curation team if you have any questions. This is not a required field. 



  • var_keywords*
    Every single variable in CMAP is annotated with a range of semantically related keywords making the variable a lot more findable. For example, if one looks for "PO4" a list of all phosphate data are retrieved even if they are not named "PO4". Similarly, if one searches for "MIT", CMAP returns all variables generated by MIT groups, or if one looks for "model" it should only return model outputs. These "semantic" searches are made possible using the keywords that are added to each variable. We would like to have keywords to cover all of the following areas (if applicable). Please keep in mind that you may add as many keyword as you wish to a variable; there is no limit to the number of keywords. The keywords are case-insensetive and you may add/remove them at any point (even after data ingestion). This is a required field.

    • Alternative names: other official, unofficial, abbreviation, technical (or jargon) names or notations associated with the variable.
      Examples: Nitrate, NO3, NO_3

    • Method and Instrument: Keywords related to the method and instruments used for the variable measurements.
      Examples: observation, in-situ, model, satellite, remote sensing, cruise, CTD, cytometry, ....

      Note these keywords are not mutually exclusive. For example, a CTD temperature measurement during a cruise can have all of the following keywords: observation, in-situ, cruise, CTD

    • Data Producers: Keywords associated with the lead scientist/lab name/institute name.
      Examples: UW, University of Washington, Virginia Armbrust, Ginger

    • Cruise: The official/unoffical name of the cruise(s) during which the variable has been measured, if applicable.
      Examples: KOK1606, Gradients_1, diel

    • Project name: If your data are in the context of a project, includ the project name.
      Examples: HOT, Darwin, seaflow



  • var_comment
    Use this field to communicate any detailed information about this particular variable with the users. var_comment is visible in the Simons CMAP catalog (Fig. 11). This field is not required.

    • Required: No (optional)
    • Constraint: No length limits

Variable description in Catalog

Figure 11. A sample dataset shown in the Simons CMAP catalog. The "var_commentn" is accessible using the "Comment" button, highlighted in the red rectangle.



Bibliography

References

  1. NetCDF Climate and Forecast (CF) Metadata Conventions
  2. Conventions for the standardization of NetCDF files
  3. COARDS NetCDF Conventions

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published