Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 7 additions & 10 deletions dataretrieval/waterdata/api.py
Original file line number Diff line number Diff line change
Expand Up @@ -165,8 +165,7 @@ def get_daily(
if your internet connection is spotty. The default (NA) will set the
limit to the maximum allowable limit for the service.
convert_type : boolean, optional
If True, the function will convert the data to dates and qualifier to
string vector
If True, converts columns to appropriate types.

Returns
-------
Expand Down Expand Up @@ -475,6 +474,8 @@ def get_monitoring_locations(
The returning object will be a data frame with no spatial information.
Note that the USGS Water Data APIs use camelCase "skipGeometry" in
CQL2 queries.
convert_type : boolean, optional
If True, converts columns to appropriate types.

Returns
-------
Expand Down Expand Up @@ -666,8 +667,7 @@ def get_time_series_metadata(
if your internet connection is spotty. The default (None) will set the
limit to the maximum allowable limit for the service.
convert_type : boolean, optional
If True, the function will convert the data to dates and qualifier to
string vector
If True, converts columns to appropriate types.

Returns
-------
Expand Down Expand Up @@ -842,8 +842,7 @@ def get_latest_continuous(
if your internet connection is spotty. The default (None) will set the
limit to the maximum allowable limit for the service.
convert_type : boolean, optional
If True, the function will convert the data to dates and qualifier to
string vector
If True, converts columns to appropriate types.

Returns
-------
Expand Down Expand Up @@ -1017,8 +1016,7 @@ def get_latest_daily(
if your internet connection is spotty. The default (None) will set the
limit to the maximum allowable limit for the service.
convert_type : boolean, optional
If True, the function will convert the data to dates and qualifier to
string vector
If True, converts columns to appropriate types.

Returns
-------
Expand Down Expand Up @@ -1183,8 +1181,7 @@ def get_field_measurements(
if your internet connection is spotty. The default (None) will set the
limit to the maximum allowable limit for the service.
convert_type : boolean, optional
If True, the function will convert the data to dates and qualifier to
string vector
If True, converts columns to appropriate types.

Returns
-------
Expand Down
48 changes: 33 additions & 15 deletions dataretrieval/waterdata/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -668,32 +668,48 @@ def _arrange_cols(
return df.rename(columns={"id": output_id})


def _cleanup_cols(df: pd.DataFrame, service: str = "daily") -> pd.DataFrame:
def _type_cols(df: pd.DataFrame) -> pd.DataFrame:
"""
Cleans and standardizes columns in a pandas DataFrame for water data endpoints.
Casts columns into appropriate types.

Parameters
----------
df : pd.DataFrame
The input DataFrame containing water data.
service : str, optional
The type of water data service (default is "daily").

Returns
-------
pd.DataFrame
The cleaned DataFrame with standardized columns.
The DataFrame with columns cast to appropriate types.

Notes
-----
- If the 'time' column exists and service is "daily", it is converted to date objects.
- The 'value' and 'contributing_drainage_area' columns are coerced to numeric types.
"""
if "time" in df.columns and service == "daily":
df["time"] = pd.to_datetime(df["time"]).dt.date
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This piece of the function originally was just changing the "time" column to simply a date (not a timestamp) for the daily values endpoint only, so that the user wouldn't be confused about whether the value represents a daily aggregated value (min, max, mean, etc.) or a particular measurement. This logic was initially introduced to match what R dataRetrieval was doing: https://github.com/DOI-USGS/dataRetrieval/blob/develop/R/walk_pages.R#L141

Copy link
Collaborator Author

@thodson-usgs thodson-usgs Dec 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. It was my recollection that a dt.date lacks datetime functionality, furthermore, the parsing behavior of pd.to_datetime seems to have changed. By default, it correctly omits the time information. Maybe this was a pandas update, but in the current version, it seems correct to leave "time" as a datetime object.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not seeing that, or I am misunderstanding. When I run your branch and pull from get_latest_daily, the "time" column shows up as "2025-12-01 00:00:00", whereas in the existing implementation, it shows up as "2025-12-01".

check, md = waterdata.get_latest_daily(
    monitoring_location_id="USGS-05129115",
    parameter_code="00060"
)

I like the existing implementation for daily summaries only, because the date cannot be confused with a singular measurement, and it indeed represents a "summary" value. However, if it causes problems by being inconsisent in the daily summary services, I'm open to applying a consistent rule.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, you're correct, dt.date does lack datetime functionality. It changes it to an object. Hm. It does make sense to give it a datetime type. Nevermind. We might then just want to say that the additional "00:00:00" added to it doesn't represent a singular value.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did it display 00:00:00? it didn't for me, so this behavior was probably changed at some version of pandas.

for col in ["value", "contributing_drainage_area"]:
if col in df.columns:
df[col] = pd.to_numeric(df[col], errors="coerce")
cols = set(df.columns)
numerical_cols = [
"altitude",
"altitude_accuracy",
"contributing_drainage_area",
"drainage_area",
"hole_constructed_depth",
"value",
"well_constructed_depth",
]
time_cols = [
"begin",
"begin_utc",
"construction_date",
"end",
"end_utc",
"datetime", # unused
"last_modified",
"time",
]

for col in cols.intersection(time_cols):
df[col] = pd.to_datetime(df[col], errors="coerce")

for col in cols.intersection(numerical_cols):
df[col] = pd.to_numeric(df[col], errors="coerce")

return df


Expand Down Expand Up @@ -749,8 +765,10 @@ def get_ogc_data(
)
# Manage some aspects of the returned dataset
return_list = _deal_with_empty(return_list, properties, service)

if convert_type:
return_list = _cleanup_cols(return_list, service=service)
return_list = _type_cols(return_list)

return_list = _arrange_cols(return_list, properties, output_id)
# Create metadata object from response
metadata = BaseMetadata(response)
Expand Down
Loading