Skip to content

Conversation

@ehinman
Copy link
Collaborator

@ehinman ehinman commented Mar 27, 2025

This PR adds a function, documentation, testing, etc. for pulling USGS water quality data from the Samples database.

Additional features needed:

  • tests

Additional features that would be helpful:

@ehinman
Copy link
Collaborator Author

ehinman commented Mar 28, 2025

@ldecicco-USGS tagging you in case you have some time to take a look and offer feedback on this function. I used the dataRetrieval version as inspiration.

Copy link
Collaborator

@jzemmels jzemmels left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes look good to me. The actual function code looks boilerplate (in a good way). I've added some comments/suggestions to the documentation that you might consider addressing/implementing.

The start date if using a date range. Takes the format YYYY-MM-DD.
The logic is inclusive, i.e. it will also return results that
match the date. If left as None, will pull all data on or before
activityStartDateUpper, of populated.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand this sentence -- "of populated" what?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah! "if populated"!

Example: "Sample-Routine, regular".
characteristicGroup : string or list of strings, optional
Characteristic group is a broad category describing one or more
of results.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo?

Suggested change
of results.
results.

Characteristic group is a broad category describing one or more
of results.
Example: "Organics, PFAS"
characteristc : string or list of strings, optional
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
characteristc : string or list of strings, optional
characteristic : string or list of strings, optional

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really was on a roll with these typos last week. Thanks for catching.

pointLocationLatitude and pointLocationLongitude
projectIdentifier : string or list of strings, optional
Designator used to uniquely id a data collection project in
organization context.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is an "organization context?" Is there a good example?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, this was lifted directly from the Samples API descriptions or dataRetrieval. I'm going to rewrite and let me see if I can add an example.

"actgroup", "count"
projects - "project", "projectmonitoringlocationweight"
organizations - "organization", "count"
activityMediaName : string or list of strings, optional
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kinda crazy to me that the only doc pages on this type of parameter are from dataRetrieval or related packages. The WQP documentation just says something to the effect of "look at the output for more information." Why include it as a query-able parameter if the parameter's possible values aren't communicated to the user? This isn't something to solve in this PR, or even in the package overall, but I feel like this is something seriously lacking in the API documentation we expect users to read.

I found the "sample media" endpoint through the codeservice link you provided that looks to provide the actual values for these more obscure parameters. But pinging one endpoint to learn about the parameter values of a different endpoint seems overly complicated to me. Non-rhetorically: are these parameters so volatile that we couldn't provide static documentation somewhere that describes the values they can take?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you make great points. I believe @ldecicco-USGS has incorporated some of these types of lookup tables to prevent users from passing invalid options to the API, and there are some basic lookup tables in dataRetrieval too. I don't see it as too much trouble to have a set of functions called "[input name]_lookup" that a user can run to return all viable options for Samples. Doubly useful is Laura's approach of checking a user's query before it is made. I will add this as an issue (if it isn't there already) and we can tackle this as a separate PR.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup! There's actually a web service to check available parameters:

https://github.com/DOI-USGS/dataRetrieval/blob/main/R/read_USGS_samples.R#L334

The main bit:

service_options <- c("characteristicgroup", "states", "counties",
                       "countries", "sitetype", "samplemedia",
                       "characteristics", "observedproperty")
  
  check_group_req <- httr2::request("https://api.waterdata.usgs.gov") |> 
    httr2::req_url_path_append("samples-data",
                               "codeservice",
                               service)

For instance:
So, https://api.waterdata.usgs.gov/samples-data/codeservice/samplemedia shows all the available sample media options. It's probably not described in the main documentation because (I think) it's dynamically populated from the values in the data base. So, if a data provider suddenly adds "Jello", the full documentation doesn't need a manual update. I think there's a codeservice Swagger somewhere.....

Here it is:
https://api.waterdata.usgs.gov/samples-data/codeservice/docs

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Laura, I agree: while some of the options may not change very often, I still think it is a smart move to make the options a service rather than a "hard-coded" list....data managers are adding new PFAS compounds every day, for instance.

This PR links to that codeservice in the samples function documentation, but I do think Joe's point is valid that we can be nicer to the user in python and avoid sending them to the Swagger doc to find all available options. I bet it's a similar number of lines of code in python!

@ehinman ehinman marked this pull request as ready for review April 24, 2025 17:13
@thodson-usgs
Copy link
Collaborator

I made some suggestions in the following PR: ehinman#3
In general, please follow PEP8 conventions. While working on this, I also noticed that Jay's unit testing CI doesn't work correctly. We're failing on many tests, yet the action reports them as successful.

thodson-usgs and others added 3 commits April 29, 2025 15:35
Co-authored-by: Elise Hinman <121896266+ehinman@users.noreply.github.com>
@ehinman ehinman merged commit 4b3a3e8 into DOI-USGS:main May 6, 2025
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants