Access and cite point observation data

To launch this notebook interactively in a Jupyter notebook-like browser interface, please click the “Launch Binder” button below. Note that Binder may take several minutes to launch.

Binder

This notebook provides a walk-through of some example functionality for accessing and citing point observations data and site-level metadata using hf_hydrodata’s get_point_data and get_point_metadata functions. Please see the full point module documentation for information on what data is available, our data collection process, and new features we are working on! Our Metadata Description page itemizes the fields that get returned from get_point_metadata.

[1]:
# Import packages
import sys
import os
import pandas as pd
from hf_hydrodata import register_api_pin, get_point_data, get_point_metadata, get_citations
[ ]:
# You need to register on https://hydrogen.princeton.edu/pin
# and run the following with your registered information
# before you can use the hydrodata utilities
register_api_pin("your_email", "your_pin")

Define input parameters

Note that get_point_data and get_point_metadata require mandatory parameters of dataset, variable, temporal_resolution, and aggregation (and depth_level if asking for soil moisture data). Please see the documentation for information about what point observation datasets are available and the parameters used to query them.

The hf_hydrodata API Reference includes information on what optional filtering parameters are available. These include filters for things like a geographic region or date range. Those parameters work cumulatively, so if state and site_ids are both supplied, for example, then only sites within site_ids that are also in state will be returned.

Example 1: Specify a date range and geographic bounding box

In this example, a specific start and end date are provided, along with a geographic domain. Start and end dates, if provided, must be in ‘YYYY-MM-DD’ format. If a start date is not provided, data is returned from as early as it is available. Likewise, if an end date is not provided, data is returned through as current as is available.

Geographic domain specifications, if provided, can be in the form of latitude and/or longitude bounds, a 2-digit state postal code (state=’NJ’), a specific list of site IDs (see example 2 below), or a shapefile (see example notebook “How To Filter Sites by USGS HUC Boundaries”). If no geography restriction is included, sites from the entire continental United States will be returned. In many cases, this exceeds a user’s single-request limit of 1GB. Please add additional geography and/or date filters as needed to keep requests within this limit.

[2]:
# Let's explore daily streamflow data with optional filters for a date range and bounding box.

# Request point observations data
data_df = get_point_data(dataset="usgs_nwis", variable="streamflow", temporal_resolution="daily", aggregation="mean",
                         date_start="2002-01-01", date_end="2002-01-05", latitude_range=(45, 50), longitude_range=(-75, -50))

# View first five records
data_df.head(5)
[2]:
date 01011000 01013500 01015800 01017000 01017550 01018000 01019000 01027200 01029200 ... 01046500 01129200 01010000 01010070 01010500 01014000 01018500 01021000 04264331 04294300
0 2002-01-01 9.7069 13.8104 12.9048 21.3099 0.013301 NaN 3.0847 1.98666 2.43663 ... 46.129 23.9984 11.9143 1.48292 24.0550 61.411 9.1126 21.9042 6084.5 0.2547
1 2002-01-02 9.5371 13.4142 12.0558 20.0364 0.012169 NaN 3.0564 1.91874 2.39135 ... 46.695 23.8286 11.6879 1.41500 23.4890 59.713 9.0277 21.9042 6056.2 0.2547
2 2002-01-03 9.3390 13.0746 11.5181 19.0742 0.011886 NaN 3.0281 1.88195 2.36305 ... 46.978 23.8286 11.5181 1.35840 23.0645 58.581 8.9145 21.9042 6084.5 0.2547
3 2002-01-04 9.1692 12.6501 11.0936 26.4322 0.011320 NaN 3.0564 1.83667 2.34890 ... 51.506 23.6305 11.2917 1.31312 22.6400 57.449 8.8579 21.9042 6056.2 0.2547
4 2002-01-05 8.9994 12.2822 10.6691 25.1870 0.010754 NaN 3.0281 1.79139 2.32060 ... 37.639 23.6022 11.0936 1.27633 22.2155 56.317 8.7447 21.9042 5546.8 0.2830

5 rows × 32 columns

[3]:
# Request site-level metadata for these sites (using the same filters)
metadata_df = get_point_metadata(dataset="usgs_nwis", variable="streamflow", temporal_resolution="daily", aggregation="mean",
                                 date_start="2002-01-01", date_end="2002-01-05", latitude_range=(45, 50), longitude_range=(-75, -50))

# View first five records
metadata_df.head(5)
[3]:
site_id site_name site_type agency state latitude longitude first_date_data_available last_date_data_available record_count ... doi huc8 conus1_x conus1_y conus2_x conus2_y gagesii_drainage_area gagesii_class gagesii_site_elevation usgs_drainage_area
0 01011000 Allagash River near Allagash, Maine stream gauge USGS ME 47.069722 -69.079444 1910-07-01 2023-11-30 34028 ... None 01010002 nan nan 4210 2783 3186.8440 Non-ref 187.0 1478.00
1 01013500 Fish River near Fort Kent, Maine stream gauge USGS ME 47.237500 -68.582778 1903-07-29 2023-12-01 36507 ... None 01010003 nan nan 4237 2810 2252.6960 Ref 157.0 873.00
2 01015800 Aroostook River near Masardis, Maine stream gauge USGS ME 46.523056 -68.371667 1957-09-14 2023-12-01 24185 ... None 01010004 nan nan 4276 2747 2313.7550 Non-ref 166.0 892.00
3 01017000 Aroostook River at Washburn, Maine stream gauge USGS ME 46.777222 -68.157222 1930-08-01 2023-12-01 34091 ... None 01010004 nan nan 4281 2773 4278.9070 Non-ref 131.0 1654.00
4 01017550 Williams Brook at Phair, Maine stream gauge USGS ME 46.628056 -67.953056 1999-11-01 2023-12-01 8797 ... None 01010005 nan nan 4300 2762 10.0323 Ref 176.0 3.82

5 rows × 23 columns

[4]:
# See how to cite the use of this data
get_citations(dataset="usgs_nwis")
[4]:
'Most U.S. Geological Survey (USGS) information resides in Public Domain and may be used without restriction, though they do ask that proper credit be given. An example credit statement would be: "(Product or data name) courtesy of the U.S. Geological Survey". Source: https://www.usgs.gov/information-policies-and-instructions/acknowledging-or-crediting-usgs'

Example 2: Specifying a specific site ID or list of site IDs without a time restriction

Instead of latitude/longitude bounds, data for a specific stream gauge or groundwater well can be returned with or without a date bound. Below, daily streamflow data is returned for a single site and then a select list of sites. There is no time restriction in these examples, so all data available in-house is included.

[5]:
# Request point observations data for a single site
data = get_point_data(dataset="usgs_nwis", variable="streamflow", temporal_resolution="daily", aggregation="mean", site_ids="01013500")

# View first five rows
print("First five records: ")
print(data.head(5))

# View final five rows
print("\n Final five records: ")
print(data.tail(5))
First five records:
         date  01013500
0  1903-07-29   21.5646
1  1903-07-30   21.5646
2  1903-07-31   21.5646
3  1903-08-01   19.2723
4  1903-08-02   18.1686

 Final five records:
             date  01013500
36502  2023-11-27    30.281
36503  2023-11-28    31.413
36504  2023-11-29    30.564
36505  2023-11-30    30.281
36506  2023-12-01    29.715
[6]:
# Request the metadata for that site
metadata = get_point_metadata(dataset="usgs_nwis", variable="streamflow", temporal_resolution="daily", aggregation="mean", site_ids="01013500")
metadata.head()
[6]:
site_id site_name site_type agency state latitude longitude first_date_data_available last_date_data_available record_count ... doi huc8 conus1_x conus1_y conus2_x conus2_y gagesii_drainage_area gagesii_class gagesii_site_elevation usgs_drainage_area
0 01013500 Fish River near Fort Kent, Maine stream gauge USGS ME 47.2375 -68.582778 1903-07-29 2023-12-01 36507 ... None 01010003 nan nan 4237 2810 2252.696 Ref 157.0 873.0

1 rows × 23 columns

[7]:
# Request point observations data for multiple sites
data = get_point_data(dataset="usgs_nwis", variable="streamflow", temporal_resolution="daily", aggregation="mean",
                      site_ids=["01013500", "01011000", "01029500"])

# View first five rows
print("First five records: ")
print(data.head(5))

# View final five rows
print("\n Final five records: ")
print(data.tail(5))
First five records:
         date  01011000  01013500  01029500
0  1902-10-01       NaN       NaN    19.810
1  1902-10-02       NaN       NaN    19.810
2  1902-10-03       NaN       NaN    19.810
3  1902-10-04       NaN       NaN    18.678
4  1902-10-05       NaN       NaN    17.546

 Final five records:
             date  01011000  01013500  01029500
44252  2023-11-27       NaN    30.281    41.035
44253  2023-11-28       NaN    31.413       NaN
44254  2023-11-29       NaN    30.564       NaN
44255  2023-11-30       NaN    30.281       NaN
44256  2023-12-01       NaN    29.715       NaN
[8]:
# Request the site-level attributes for those sites
metadata = get_point_metadata(dataset="usgs_nwis", variable="streamflow", temporal_resolution="daily", aggregation="mean",
                              site_ids=["01013500", "01011000", "01029500"])
metadata.head()
[8]:
site_id site_name site_type agency state latitude longitude first_date_data_available last_date_data_available record_count ... doi huc8 conus1_x conus1_y conus2_x conus2_y gagesii_drainage_area gagesii_class gagesii_site_elevation usgs_drainage_area
0 01011000 Allagash River near Allagash, Maine stream gauge USGS ME 47.069722 -69.079444 1910-07-01 2023-11-30 34028 ... None 01010002 nan nan 4210 2783 3186.844 Non-ref 187.0 1478.0
1 01013500 Fish River near Fort Kent, Maine stream gauge USGS ME 47.237500 -68.582778 1903-07-29 2023-12-01 36507 ... None 01010003 nan nan 4237 2810 2252.696 Ref 157.0 873.0
2 01029500 East Branch Penobscot River at Grindstone, Maine stream gauge USGS ME 45.730278 -68.589444 1902-10-01 2023-12-01 37315 ... None 01020002 nan nan 4293 2656 2816.295 Non-ref 93.0 837.0

3 rows × 23 columns

[9]:
# See how to cite the use of this data
get_citations(dataset="usgs_nwis")
[9]:
'Most U.S. Geological Survey (USGS) information resides in Public Domain and may be used without restriction, though they do ask that proper credit be given. An example credit statement would be: "(Product or data name) courtesy of the U.S. Geological Survey". Source: https://www.usgs.gov/information-policies-and-instructions/acknowledging-or-crediting-usgs'

Example 3: Add a restriction on the minimum number of observations per site within a requested time range

The parameter min_num_obs allows the user to further specify that a site must have a minimum number of observations within the specified time range (if one is provided).

The example below ensures that only sites that have valid streamflow data for every day of the calendar year requested get returned.

[10]:
# Request point observations data
data_df = get_point_data(dataset="usgs_nwis", variable="streamflow", temporal_resolution="daily", aggregation="mean",
                         date_start="2005-01-01", date_end="2005-12-31",
                         state="CO",
                         min_num_obs=365)

# View first five records
data_df.head(5)
[10]:
date 06614800 06620000 06701500 06701900 06707500 06708800 06709000 06709530 06710150 ... 382628104493700 382629104493000 383619104520401 383637104531301 383944104474201 384037104472001 384047104510301 384048104504901 384220104503701 391504106225200
0 2005-01-01 0.013584 3.5375 1.9244 2.19325 5.2921 0.163574 0.52072 0.55751 0.056600 ... 0.0 0.0 0.002547 0.0 0.0 0.021508 0.0 0.008490 0.0 0.004245
1 2005-01-02 0.013301 3.3960 1.9244 2.14514 5.2072 0.144896 0.48110 0.53770 0.052355 ... 0.0 0.0 0.002547 0.0 0.0 0.024621 0.0 0.008207 0.0 0.004245
2 2005-01-03 0.013301 3.3111 1.9244 2.15080 5.1506 0.128765 0.49525 0.50374 0.058015 ... 0.0 0.0 0.002547 0.0 0.0 0.023772 0.0 0.007924 0.0 0.004245
3 2005-01-04 0.013301 3.3960 1.9244 2.15080 5.0091 0.119992 0.48110 0.48110 0.051506 ... 0.0 0.0 0.002547 0.0 0.0 0.025470 0.0 0.007924 0.0 0.004245
4 2005-01-05 0.013301 3.3960 1.9244 2.23853 4.1035 0.139236 0.41601 0.50374 0.046412 ... 0.0 0.0 0.002547 0.0 0.0 0.022923 0.0 0.007924 0.0 0.004245

5 rows × 269 columns

[11]:
# NOTE: Metadata access does not support the `min_num_obs` filter because it does not inspect the data contents for the sliced date range.
# Metadata access only filters on overall data availability to be within the specified range.

# The following is an example workflow for obtaining metadata for only those sites that
# additionally satisfy the `min_num_obs` filter
metadata_df = get_point_metadata(dataset="usgs_nwis", variable="streamflow", temporal_resolution="daily", aggregation="mean",
                                 date_start="2005-01-01", date_end="2005-12-31",
                                 state="CO")

c = list(data_df.columns)
c.remove('date')
filtered_site_list = pd.DataFrame(data=c, columns=['site_id'])
filtered_metadata_df = pd.merge(filtered_site_list, metadata_df, on='site_id', how='left')
assert len(filtered_metadata_df) == data_df.shape[1]-1

# View first five records
filtered_metadata_df.head()
[11]:
site_id site_name site_type agency state latitude longitude first_date_data_available last_date_data_available record_count ... doi huc8 conus1_x conus1_y conus2_x conus2_y gagesii_drainage_area gagesii_class gagesii_site_elevation usgs_drainage_area
0 06614800 MICHIGAN RIVER NEAR CAMERON PASS, CO stream gauge USGS CO 40.496094 -105.865012 1973-10-01 2023-12-01 18322 ... None 10180001 1054 818 1481 1764 4.0284 Ref 3188.0 1.54
1 06620000 NORTH PLATTE RIVER NEAR NORTHGATE, CO stream gauge USGS CO 40.936639 -106.339194 1904-06-01 2023-12-01 39782 ... None 10180001 1020 870 1448 1817 3702.6370 Non-ref 2388.0 1431.00
2 06701500 SOUTH PLATTE RIVER BELOW CHEESMAN LAKE, CO stream gauge USGS CO 39.209157 -105.267773 1924-10-01 2007-09-29 29217 ... None 10190002 1091 671 nan nan 4557.0680 Non-ref 2081.0 1752.00
3 06701900 SOUTH PLATTE RIVER BLW BRUSH CRK NEAR TRUMBULL... stream gauge USGS CO 39.259990 -105.221938 2002-07-19 2023-12-01 7792 ... None 10190002 nan nan 1523 1627 5252.5570 Non-ref 1990.0 2028.00
4 06707500 SOUTH PLATTE RIVER AT SOUTH PLATTE, CO stream gauge USGS CO 39.409156 -105.169990 1896-01-01 2007-09-29 32959 ... None 10190002 nan nan nan nan 6689.0300 Non-ref 1901.0 2579.00

5 rows × 23 columns

[12]:
# See how to cite the use of this data
get_citations(dataset="usgs_nwis")
[12]:
'Most U.S. Geological Survey (USGS) information resides in Public Domain and may be used without restriction, though they do ask that proper credit be given. An example credit statement would be: "(Product or data name) courtesy of the U.S. Geological Survey". Source: https://www.usgs.gov/information-policies-and-instructions/acknowledging-or-crediting-usgs'