Access and cite point observation data
To launch this notebook interactively in a Jupyter notebook-like browser interface, please click the “Launch Binder” button below. Note that Binder may take several minutes to launch.
This notebook provides a walk-through of some example functionality for accessing and citing point observations data and site-level metadata using hf_hydrodata’s get_point_data and get_point_metadata functions. Please see the full point module documentation for information on what data is available, our data collection process, and new features we are working on! Our Metadata
Description page itemizes the fields that get returned from get_point_metadata.
[1]:
# Import packages
import sys
import os
import pandas as pd
from hf_hydrodata import register_api_pin, get_point_data, get_point_metadata, get_citations
[ ]:
# You need to register on https://hydrogen.princeton.edu/pin
# and run the following with your registered information
# before you can use the hydrodata utilities
register_api_pin("your_email", "your_pin")
Define input parameters
Note that get_point_data and get_point_metadata require mandatory parameters of dataset, variable, temporal_resolution, and aggregation (and depth_level if asking for soil moisture data). Please see the documentation for information about what point observation datasets are available and the parameters used to query them.
The hf_hydrodata API Reference includes information on what optional filtering parameters are available. These include filters for things like a geographic region or date range. Those parameters work cumulatively, so if state and site_ids are both supplied, for example, then only sites within site_ids that are also in state will be returned.
Example 1: Specify a date range and geographic bounding box
In this example, a specific start and end date are provided, along with a geographic domain. Start and end dates, if provided, must be in ‘YYYY-MM-DD’ format. If a start date is not provided, data is returned from as early as it is available. Likewise, if an end date is not provided, data is returned through as current as is available.
Geographic domain specifications, if provided, can be in the form of latitude and/or longitude bounds, a 2-digit state postal code (state=’NJ’), a specific list of site IDs (see example 2 below), or a shapefile (see example notebook “How To Filter Sites by USGS HUC Boundaries”). If no geography restriction is included, sites from the entire continental United States will be returned. In many
cases, this exceeds a user’s single-request limit of 1GB. Please add additional geography and/or date filters as needed to keep requests within this limit.
[2]:
# Let's explore daily streamflow data with optional filters for a date range and bounding box.
# Request point observations data
data_df = get_point_data(dataset="usgs_nwis", variable="streamflow", temporal_resolution="daily", aggregation="mean",
date_start="2002-01-01", date_end="2002-01-05", latitude_range=(45, 50), longitude_range=(-75, -50))
# View first five records
data_df.head(5)
[2]:
| date | 01011000 | 01013500 | 01015800 | 01017000 | 01017550 | 01018000 | 01019000 | 01027200 | 01029200 | ... | 01046500 | 01129200 | 01010000 | 01010070 | 01010500 | 01014000 | 01018500 | 01021000 | 04264331 | 04294300 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2002-01-01 | 9.7069 | 13.8104 | 12.9048 | 21.3099 | 0.013301 | NaN | 3.0847 | 1.98666 | 2.43663 | ... | 46.129 | 23.9984 | 11.9143 | 1.48292 | 24.0550 | 61.411 | 9.1126 | 21.9042 | 6084.5 | 0.2547 |
| 1 | 2002-01-02 | 9.5371 | 13.4142 | 12.0558 | 20.0364 | 0.012169 | NaN | 3.0564 | 1.91874 | 2.39135 | ... | 46.695 | 23.8286 | 11.6879 | 1.41500 | 23.4890 | 59.713 | 9.0277 | 21.9042 | 6056.2 | 0.2547 |
| 2 | 2002-01-03 | 9.3390 | 13.0746 | 11.5181 | 19.0742 | 0.011886 | NaN | 3.0281 | 1.88195 | 2.36305 | ... | 46.978 | 23.8286 | 11.5181 | 1.35840 | 23.0645 | 58.581 | 8.9145 | 21.9042 | 6084.5 | 0.2547 |
| 3 | 2002-01-04 | 9.1692 | 12.6501 | 11.0936 | 26.4322 | 0.011320 | NaN | 3.0564 | 1.83667 | 2.34890 | ... | 51.506 | 23.6305 | 11.2917 | 1.31312 | 22.6400 | 57.449 | 8.8579 | 21.9042 | 6056.2 | 0.2547 |
| 4 | 2002-01-05 | 8.9994 | 12.2822 | 10.6691 | 25.1870 | 0.010754 | NaN | 3.0281 | 1.79139 | 2.32060 | ... | 37.639 | 23.6022 | 11.0936 | 1.27633 | 22.2155 | 56.317 | 8.7447 | 21.9042 | 5546.8 | 0.2830 |
5 rows × 32 columns
[3]:
# Request site-level metadata for these sites (using the same filters)
metadata_df = get_point_metadata(dataset="usgs_nwis", variable="streamflow", temporal_resolution="daily", aggregation="mean",
date_start="2002-01-01", date_end="2002-01-05", latitude_range=(45, 50), longitude_range=(-75, -50))
# View first five records
metadata_df.head(5)
[3]:
| site_id | site_name | site_type | agency | state | latitude | longitude | first_date_data_available | last_date_data_available | record_count | ... | doi | huc8 | conus1_x | conus1_y | conus2_x | conus2_y | gagesii_drainage_area | gagesii_class | gagesii_site_elevation | usgs_drainage_area | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 01011000 | Allagash River near Allagash, Maine | stream gauge | USGS | ME | 47.069722 | -69.079444 | 1910-07-01 | 2023-11-30 | 34028 | ... | None | 01010002 | nan | nan | 4210 | 2783 | 3186.8440 | Non-ref | 187.0 | 1478.00 |
| 1 | 01013500 | Fish River near Fort Kent, Maine | stream gauge | USGS | ME | 47.237500 | -68.582778 | 1903-07-29 | 2023-12-01 | 36507 | ... | None | 01010003 | nan | nan | 4237 | 2810 | 2252.6960 | Ref | 157.0 | 873.00 |
| 2 | 01015800 | Aroostook River near Masardis, Maine | stream gauge | USGS | ME | 46.523056 | -68.371667 | 1957-09-14 | 2023-12-01 | 24185 | ... | None | 01010004 | nan | nan | 4276 | 2747 | 2313.7550 | Non-ref | 166.0 | 892.00 |
| 3 | 01017000 | Aroostook River at Washburn, Maine | stream gauge | USGS | ME | 46.777222 | -68.157222 | 1930-08-01 | 2023-12-01 | 34091 | ... | None | 01010004 | nan | nan | 4281 | 2773 | 4278.9070 | Non-ref | 131.0 | 1654.00 |
| 4 | 01017550 | Williams Brook at Phair, Maine | stream gauge | USGS | ME | 46.628056 | -67.953056 | 1999-11-01 | 2023-12-01 | 8797 | ... | None | 01010005 | nan | nan | 4300 | 2762 | 10.0323 | Ref | 176.0 | 3.82 |
5 rows × 23 columns
[4]:
# See how to cite the use of this data
get_citations(dataset="usgs_nwis")
[4]:
'Most U.S. Geological Survey (USGS) information resides in Public Domain and may be used without restriction, though they do ask that proper credit be given. An example credit statement would be: "(Product or data name) courtesy of the U.S. Geological Survey". Source: https://www.usgs.gov/information-policies-and-instructions/acknowledging-or-crediting-usgs'
Example 2: Specifying a specific site ID or list of site IDs without a time restriction
Instead of latitude/longitude bounds, data for a specific stream gauge or groundwater well can be returned with or without a date bound. Below, daily streamflow data is returned for a single site and then a select list of sites. There is no time restriction in these examples, so all data available in-house is included.
[5]:
# Request point observations data for a single site
data = get_point_data(dataset="usgs_nwis", variable="streamflow", temporal_resolution="daily", aggregation="mean", site_ids="01013500")
# View first five rows
print("First five records: ")
print(data.head(5))
# View final five rows
print("\n Final five records: ")
print(data.tail(5))
First five records:
date 01013500
0 1903-07-29 21.5646
1 1903-07-30 21.5646
2 1903-07-31 21.5646
3 1903-08-01 19.2723
4 1903-08-02 18.1686
Final five records:
date 01013500
36502 2023-11-27 30.281
36503 2023-11-28 31.413
36504 2023-11-29 30.564
36505 2023-11-30 30.281
36506 2023-12-01 29.715
[6]:
# Request the metadata for that site
metadata = get_point_metadata(dataset="usgs_nwis", variable="streamflow", temporal_resolution="daily", aggregation="mean", site_ids="01013500")
metadata.head()
[6]:
| site_id | site_name | site_type | agency | state | latitude | longitude | first_date_data_available | last_date_data_available | record_count | ... | doi | huc8 | conus1_x | conus1_y | conus2_x | conus2_y | gagesii_drainage_area | gagesii_class | gagesii_site_elevation | usgs_drainage_area | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 01013500 | Fish River near Fort Kent, Maine | stream gauge | USGS | ME | 47.2375 | -68.582778 | 1903-07-29 | 2023-12-01 | 36507 | ... | None | 01010003 | nan | nan | 4237 | 2810 | 2252.696 | Ref | 157.0 | 873.0 |
1 rows × 23 columns
[7]:
# Request point observations data for multiple sites
data = get_point_data(dataset="usgs_nwis", variable="streamflow", temporal_resolution="daily", aggregation="mean",
site_ids=["01013500", "01011000", "01029500"])
# View first five rows
print("First five records: ")
print(data.head(5))
# View final five rows
print("\n Final five records: ")
print(data.tail(5))
First five records:
date 01011000 01013500 01029500
0 1902-10-01 NaN NaN 19.810
1 1902-10-02 NaN NaN 19.810
2 1902-10-03 NaN NaN 19.810
3 1902-10-04 NaN NaN 18.678
4 1902-10-05 NaN NaN 17.546
Final five records:
date 01011000 01013500 01029500
44252 2023-11-27 NaN 30.281 41.035
44253 2023-11-28 NaN 31.413 NaN
44254 2023-11-29 NaN 30.564 NaN
44255 2023-11-30 NaN 30.281 NaN
44256 2023-12-01 NaN 29.715 NaN
[8]:
# Request the site-level attributes for those sites
metadata = get_point_metadata(dataset="usgs_nwis", variable="streamflow", temporal_resolution="daily", aggregation="mean",
site_ids=["01013500", "01011000", "01029500"])
metadata.head()
[8]:
| site_id | site_name | site_type | agency | state | latitude | longitude | first_date_data_available | last_date_data_available | record_count | ... | doi | huc8 | conus1_x | conus1_y | conus2_x | conus2_y | gagesii_drainage_area | gagesii_class | gagesii_site_elevation | usgs_drainage_area | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 01011000 | Allagash River near Allagash, Maine | stream gauge | USGS | ME | 47.069722 | -69.079444 | 1910-07-01 | 2023-11-30 | 34028 | ... | None | 01010002 | nan | nan | 4210 | 2783 | 3186.844 | Non-ref | 187.0 | 1478.0 |
| 1 | 01013500 | Fish River near Fort Kent, Maine | stream gauge | USGS | ME | 47.237500 | -68.582778 | 1903-07-29 | 2023-12-01 | 36507 | ... | None | 01010003 | nan | nan | 4237 | 2810 | 2252.696 | Ref | 157.0 | 873.0 |
| 2 | 01029500 | East Branch Penobscot River at Grindstone, Maine | stream gauge | USGS | ME | 45.730278 | -68.589444 | 1902-10-01 | 2023-12-01 | 37315 | ... | None | 01020002 | nan | nan | 4293 | 2656 | 2816.295 | Non-ref | 93.0 | 837.0 |
3 rows × 23 columns
[9]:
# See how to cite the use of this data
get_citations(dataset="usgs_nwis")
[9]:
'Most U.S. Geological Survey (USGS) information resides in Public Domain and may be used without restriction, though they do ask that proper credit be given. An example credit statement would be: "(Product or data name) courtesy of the U.S. Geological Survey". Source: https://www.usgs.gov/information-policies-and-instructions/acknowledging-or-crediting-usgs'
Example 3: Add a restriction on the minimum number of observations per site within a requested time range
The parameter min_num_obs allows the user to further specify that a site must have a minimum number of observations within the specified time range (if one is provided).
The example below ensures that only sites that have valid streamflow data for every day of the calendar year requested get returned.
[10]:
# Request point observations data
data_df = get_point_data(dataset="usgs_nwis", variable="streamflow", temporal_resolution="daily", aggregation="mean",
date_start="2005-01-01", date_end="2005-12-31",
state="CO",
min_num_obs=365)
# View first five records
data_df.head(5)
[10]:
| date | 06614800 | 06620000 | 06701500 | 06701900 | 06707500 | 06708800 | 06709000 | 06709530 | 06710150 | ... | 382628104493700 | 382629104493000 | 383619104520401 | 383637104531301 | 383944104474201 | 384037104472001 | 384047104510301 | 384048104504901 | 384220104503701 | 391504106225200 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2005-01-01 | 0.013584 | 3.5375 | 1.9244 | 2.19325 | 5.2921 | 0.163574 | 0.52072 | 0.55751 | 0.056600 | ... | 0.0 | 0.0 | 0.002547 | 0.0 | 0.0 | 0.021508 | 0.0 | 0.008490 | 0.0 | 0.004245 |
| 1 | 2005-01-02 | 0.013301 | 3.3960 | 1.9244 | 2.14514 | 5.2072 | 0.144896 | 0.48110 | 0.53770 | 0.052355 | ... | 0.0 | 0.0 | 0.002547 | 0.0 | 0.0 | 0.024621 | 0.0 | 0.008207 | 0.0 | 0.004245 |
| 2 | 2005-01-03 | 0.013301 | 3.3111 | 1.9244 | 2.15080 | 5.1506 | 0.128765 | 0.49525 | 0.50374 | 0.058015 | ... | 0.0 | 0.0 | 0.002547 | 0.0 | 0.0 | 0.023772 | 0.0 | 0.007924 | 0.0 | 0.004245 |
| 3 | 2005-01-04 | 0.013301 | 3.3960 | 1.9244 | 2.15080 | 5.0091 | 0.119992 | 0.48110 | 0.48110 | 0.051506 | ... | 0.0 | 0.0 | 0.002547 | 0.0 | 0.0 | 0.025470 | 0.0 | 0.007924 | 0.0 | 0.004245 |
| 4 | 2005-01-05 | 0.013301 | 3.3960 | 1.9244 | 2.23853 | 4.1035 | 0.139236 | 0.41601 | 0.50374 | 0.046412 | ... | 0.0 | 0.0 | 0.002547 | 0.0 | 0.0 | 0.022923 | 0.0 | 0.007924 | 0.0 | 0.004245 |
5 rows × 269 columns
[11]:
# NOTE: Metadata access does not support the `min_num_obs` filter because it does not inspect the data contents for the sliced date range.
# Metadata access only filters on overall data availability to be within the specified range.
# The following is an example workflow for obtaining metadata for only those sites that
# additionally satisfy the `min_num_obs` filter
metadata_df = get_point_metadata(dataset="usgs_nwis", variable="streamflow", temporal_resolution="daily", aggregation="mean",
date_start="2005-01-01", date_end="2005-12-31",
state="CO")
c = list(data_df.columns)
c.remove('date')
filtered_site_list = pd.DataFrame(data=c, columns=['site_id'])
filtered_metadata_df = pd.merge(filtered_site_list, metadata_df, on='site_id', how='left')
assert len(filtered_metadata_df) == data_df.shape[1]-1
# View first five records
filtered_metadata_df.head()
[11]:
| site_id | site_name | site_type | agency | state | latitude | longitude | first_date_data_available | last_date_data_available | record_count | ... | doi | huc8 | conus1_x | conus1_y | conus2_x | conus2_y | gagesii_drainage_area | gagesii_class | gagesii_site_elevation | usgs_drainage_area | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 06614800 | MICHIGAN RIVER NEAR CAMERON PASS, CO | stream gauge | USGS | CO | 40.496094 | -105.865012 | 1973-10-01 | 2023-12-01 | 18322 | ... | None | 10180001 | 1054 | 818 | 1481 | 1764 | 4.0284 | Ref | 3188.0 | 1.54 |
| 1 | 06620000 | NORTH PLATTE RIVER NEAR NORTHGATE, CO | stream gauge | USGS | CO | 40.936639 | -106.339194 | 1904-06-01 | 2023-12-01 | 39782 | ... | None | 10180001 | 1020 | 870 | 1448 | 1817 | 3702.6370 | Non-ref | 2388.0 | 1431.00 |
| 2 | 06701500 | SOUTH PLATTE RIVER BELOW CHEESMAN LAKE, CO | stream gauge | USGS | CO | 39.209157 | -105.267773 | 1924-10-01 | 2007-09-29 | 29217 | ... | None | 10190002 | 1091 | 671 | nan | nan | 4557.0680 | Non-ref | 2081.0 | 1752.00 |
| 3 | 06701900 | SOUTH PLATTE RIVER BLW BRUSH CRK NEAR TRUMBULL... | stream gauge | USGS | CO | 39.259990 | -105.221938 | 2002-07-19 | 2023-12-01 | 7792 | ... | None | 10190002 | nan | nan | 1523 | 1627 | 5252.5570 | Non-ref | 1990.0 | 2028.00 |
| 4 | 06707500 | SOUTH PLATTE RIVER AT SOUTH PLATTE, CO | stream gauge | USGS | CO | 39.409156 | -105.169990 | 1896-01-01 | 2007-09-29 | 32959 | ... | None | 10190002 | nan | nan | nan | nan | 6689.0300 | Non-ref | 1901.0 | 2579.00 |
5 rows × 23 columns
[12]:
# See how to cite the use of this data
get_citations(dataset="usgs_nwis")
[12]:
'Most U.S. Geological Survey (USGS) information resides in Public Domain and may be used without restriction, though they do ask that proper credit be given. An example credit statement would be: "(Product or data name) courtesy of the U.S. Geological Survey". Source: https://www.usgs.gov/information-policies-and-instructions/acknowledging-or-crediting-usgs'