Save and wrangle point observations data
To launch this notebook interactively in a Jupyter notebook-like browser interface, please click the “Launch Binder” button below. Note that Binder may take several minutes to launch.
The hf_hydrodata get_point_data and get_point_metadata functions return data in pandas DataFrames. This notebook goes through some common tasks using pandas, such as saving to a .csv file, saving to a NetCDF file, creating a new variable, and slicing out a particular value. For a more comprehensive introduction to working with data in pandas, please see their 10 minutes to
pandas introduction or Coming From.. documentation to see comparisons to working in R, SQL, Excel, Stata, or SAS.
Please see the hf_hydrodata documentation for information on what data is available, our data collection process, and new features we are working on! Our Metadata Description page itemizes the fields that get returned from get_point_metadata.
[1]:
# Import packages
import pandas as pd
import xarray as xr
import numpy as np
from hf_hydrodata import register_api_pin, get_point_data, get_point_metadata
[ ]:
# You need to register on https://hydrogen.princeton.edu/pin
# and run the following with your registered information
# before you can use the hydrodata utilities
register_api_pin("your_email", "your_pin")
Example 1: Working with pandas DataFrames
In this first example, we will showcase several common pandas commands that can be used to inspect a DataFrame.
Note that get_point_data and get_point_metadata require mandatory parameters of dataset, variable, temporal_resolution, and aggregation (and depth_level if asking for soil moisture data). Please see the documentation for information about what point observation datasets are available and the parameters used to query them.
The hf_hydrodata API Reference includes information on what optional filtering parameters are available. These include filters for things like a geographic region or date range. Those parameters work cumulatively, so if state and site_ids are both supplied, for example, then only sites within site_ids that are also in state will be returned.
[2]:
# Let's explore daily streamflow data with optional filters for a date range and bounding box.
# Get observations data and site-level metadata
data_df = get_point_data(dataset="usgs_nwis", variable="streamflow", temporal_resolution="daily", aggregation="mean",
date_start="2002-01-01", date_end="2002-01-05",
latitude_range=(45, 50), longitude_range=(-75, -50))
metadata_df = get_point_metadata(dataset="usgs_nwis", variable="streamflow", temporal_resolution="daily", aggregation="mean",
date_start="2002-01-01", date_end="2002-01-05",
latitude_range=(45, 50), longitude_range=(-75, -50))
First we will explore pandas’ head method for DataFrames. head will display the first n rows of the DataFrame, with the default to show the first 5 rows.
[3]:
# The default is to show the first 5 rows.
metadata_df.head()
[3]:
| site_id | site_name | site_type | agency | state | latitude | longitude | first_date_data_available | last_date_data_available | record_count | ... | doi | huc8 | conus1_x | conus1_y | conus2_x | conus2_y | gagesii_drainage_area | gagesii_class | gagesii_site_elevation | usgs_drainage_area | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 01011000 | Allagash River near Allagash, Maine | stream gauge | USGS | ME | 47.069722 | -69.079444 | 1910-07-01 | 2023-11-30 | 34028 | ... | None | 01010002 | nan | nan | 4210 | 2783 | 3186.8440 | Non-ref | 187.0 | 1478.00 |
| 1 | 01013500 | Fish River near Fort Kent, Maine | stream gauge | USGS | ME | 47.237500 | -68.582778 | 1903-07-29 | 2023-12-01 | 36507 | ... | None | 01010003 | nan | nan | 4237 | 2810 | 2252.6960 | Ref | 157.0 | 873.00 |
| 2 | 01015800 | Aroostook River near Masardis, Maine | stream gauge | USGS | ME | 46.523056 | -68.371667 | 1957-09-14 | 2023-12-01 | 24185 | ... | None | 01010004 | nan | nan | 4276 | 2747 | 2313.7550 | Non-ref | 166.0 | 892.00 |
| 3 | 01017000 | Aroostook River at Washburn, Maine | stream gauge | USGS | ME | 46.777222 | -68.157222 | 1930-08-01 | 2023-12-01 | 34091 | ... | None | 01010004 | nan | nan | 4281 | 2773 | 4278.9070 | Non-ref | 131.0 | 1654.00 |
| 4 | 01017550 | Williams Brook at Phair, Maine | stream gauge | USGS | ME | 46.628056 | -67.953056 | 1999-11-01 | 2023-12-01 | 8797 | ... | None | 01010005 | nan | nan | 4300 | 2762 | 10.0323 | Ref | 176.0 | 3.82 |
5 rows × 23 columns
[4]:
# However, a user can specify the number of rows they'd like to see.
# Here we are showing the first three records.
metadata_df.head(3)
[4]:
| site_id | site_name | site_type | agency | state | latitude | longitude | first_date_data_available | last_date_data_available | record_count | ... | doi | huc8 | conus1_x | conus1_y | conus2_x | conus2_y | gagesii_drainage_area | gagesii_class | gagesii_site_elevation | usgs_drainage_area | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 01011000 | Allagash River near Allagash, Maine | stream gauge | USGS | ME | 47.069722 | -69.079444 | 1910-07-01 | 2023-11-30 | 34028 | ... | None | 01010002 | nan | nan | 4210 | 2783 | 3186.844 | Non-ref | 187.0 | 1478.0 |
| 1 | 01013500 | Fish River near Fort Kent, Maine | stream gauge | USGS | ME | 47.237500 | -68.582778 | 1903-07-29 | 2023-12-01 | 36507 | ... | None | 01010003 | nan | nan | 4237 | 2810 | 2252.696 | Ref | 157.0 | 873.0 |
| 2 | 01015800 | Aroostook River near Masardis, Maine | stream gauge | USGS | ME | 46.523056 | -68.371667 | 1957-09-14 | 2023-12-01 | 24185 | ... | None | 01010004 | nan | nan | 4276 | 2747 | 2313.755 | Non-ref | 166.0 | 892.0 |
3 rows × 23 columns
Next, we will use the shape method to get information on the dimensions of a DataFrame. We can use this to answer questions like: how many sites were returned from my query parameters?
[5]:
print(f"DataFrame dimensions for metadata_df: {metadata_df.shape}")
print(f"Number of rows (sites) in metadata_df: {metadata_df.shape[0]}")
print(f"Number of columns (attributes) in metadata_df: {metadata_df.shape[1]}")
DataFrame dimensions for metadata_df: (31, 23)
Number of rows (sites) in metadata_df: 31
Number of columns (attributes) in metadata_df: 23
DataFrames have a method called columns that contains a list of all of the columns names. Notice how in metadata_df there is a set of ellipses (…) when we use the head method. These represent a whole set of columns that are not explicitly named or previewed. We can use columns to see a list of all of the columns that are in a given DataFrame.
[6]:
print(list(metadata_df.columns))
['site_id', 'site_name', 'site_type', 'agency', 'state', 'latitude', 'longitude', 'first_date_data_available', 'last_date_data_available', 'record_count', 'site_query_url', 'date_metadata_last_updated', 'tz_cd', 'doi', 'huc8', 'conus1_x', 'conus1_y', 'conus2_x', 'conus2_y', 'gagesii_drainage_area', 'gagesii_class', 'gagesii_site_elevation', 'usgs_drainage_area']
Finally, let’s extract one of the columns from the DataFrame returned by get_point_metadata. We’ll use the .loc() method for this.
[7]:
# Let's extract just the site ID and USGS drainage area field.
metadata_df.loc[:, ['site_id', 'usgs_drainage_area']]
[7]:
| site_id | usgs_drainage_area | |
|---|---|---|
| 0 | 01011000 | 1478.00 |
| 1 | 01013500 | 873.00 |
| 2 | 01015800 | 892.00 |
| 3 | 01017000 | 1654.00 |
| 4 | 01017550 | 3.82 |
| 5 | 01018000 | 175.00 |
| 6 | 01019000 | 228.30 |
| 7 | 01027200 | 232.00 |
| 8 | 01029200 | 173.00 |
| 9 | 01029500 | 837.00 |
| 10 | 01030500 | 1418.00 |
| 11 | 01031300 | 118.00 |
| 12 | 01031450 | 95.40 |
| 13 | 01031500 | 298.00 |
| 14 | 01033000 | 326.00 |
| 15 | 01034000 | 1162.00 |
| 16 | 01034500 | 6422.00 |
| 17 | 01042500 | 1590.00 |
| 18 | 01043500 | 516.00 |
| 19 | 01044550 | 193.00 |
| 20 | 01046000 | 90.00 |
| 21 | 01046500 | 2715.00 |
| 22 | 01129200 | 254.00 |
| 23 | 01010000 | 1341.00 |
| 24 | 01010070 | 171.00 |
| 25 | 01010500 | 2680.00 |
| 26 | 01014000 | 5929.00 |
| 27 | 01018500 | 413.00 |
| 28 | 01021000 | 1374.00 |
| 29 | 04264331 | 298800.00 |
| 30 | 04294300 | 34.50 |
Example 2: Save data to .csv
In this example, we will show how to use pandas’ to_csv method to save a DataFrame into a .csv file.
[8]:
# Let's explore daily streamflow data with optional filters for a date range and bounding box.
# Get observations data and site-level metadata
data_df = get_point_data(dataset="usgs_nwis", variable="streamflow", temporal_resolution="daily", aggregation="mean",
date_start="2002-01-01", date_end="2002-01-05",
latitude_range=(45, 50), longitude_range=(-75, -50))
metadata_df = get_point_metadata(dataset="usgs_nwis", variable="streamflow", temporal_resolution="daily", aggregation="mean",
date_start="2002-01-01", date_end="2002-01-05",
latitude_range=(45, 50), longitude_range=(-75, -50))
[9]:
# The following saves these files as .csv files in the same directory as this notebook.
# You may provide a full file path to save the file(s) elsewhere in your local directory.
# We will set index=False to not have the DataFrame index written to a column.
data_df.to_csv('streamflow_observations.csv', index=False)
metadata_df.to_csv('streamflow_site_metadata.csv', index=False)
Example 3: Save data to NetCDF
Using the same query parameters as in the prior examples, here we will use xarray to construct an xarray Dataset from our point observations pandas DataFrame. We will then use xarray’s built-in to_netcdf method to save the Dataset into a NetCDF file.
[10]:
# Let's explore daily streamflow data with optional filters for a date range and bounding box.
# Get observations data
data_df = get_point_data(dataset="usgs_nwis", variable="streamflow", temporal_resolution="daily", aggregation="mean",
date_start="2002-01-01", date_end="2002-01-05",
latitude_range=(45, 50), longitude_range=(-75, -50))
# Inspect first five rows
data_df.head(5)
[10]:
| date | 01011000 | 01013500 | 01015800 | 01017000 | 01017550 | 01018000 | 01019000 | 01027200 | 01029200 | ... | 01046500 | 01129200 | 01010000 | 01010070 | 01010500 | 01014000 | 01018500 | 01021000 | 04264331 | 04294300 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2002-01-01 | 9.7069 | 13.8104 | 12.9048 | 21.3099 | 0.013301 | NaN | 3.0847 | 1.98666 | 2.43663 | ... | 46.129 | 23.9984 | 11.9143 | 1.48292 | 24.0550 | 61.411 | 9.1126 | 21.9042 | 6084.5 | 0.2547 |
| 1 | 2002-01-02 | 9.5371 | 13.4142 | 12.0558 | 20.0364 | 0.012169 | NaN | 3.0564 | 1.91874 | 2.39135 | ... | 46.695 | 23.8286 | 11.6879 | 1.41500 | 23.4890 | 59.713 | 9.0277 | 21.9042 | 6056.2 | 0.2547 |
| 2 | 2002-01-03 | 9.3390 | 13.0746 | 11.5181 | 19.0742 | 0.011886 | NaN | 3.0281 | 1.88195 | 2.36305 | ... | 46.978 | 23.8286 | 11.5181 | 1.35840 | 23.0645 | 58.581 | 8.9145 | 21.9042 | 6084.5 | 0.2547 |
| 3 | 2002-01-04 | 9.1692 | 12.6501 | 11.0936 | 26.4322 | 0.011320 | NaN | 3.0564 | 1.83667 | 2.34890 | ... | 51.506 | 23.6305 | 11.2917 | 1.31312 | 22.6400 | 57.449 | 8.8579 | 21.9042 | 6056.2 | 0.2547 |
| 4 | 2002-01-05 | 8.9994 | 12.2822 | 10.6691 | 25.1870 | 0.010754 | NaN | 3.0281 | 1.79139 | 2.32060 | ... | 37.639 | 23.6022 | 11.0936 | 1.27633 | 22.2155 | 56.317 | 8.7447 | 21.9042 | 5546.8 | 0.2830 |
5 rows × 32 columns
[11]:
# Construct xarray Dataset from information in pandas DataFrame
xr_data = np.array(data_df.iloc[:, 1:]) # remove 'date' column
site_list = list(data_df.columns)[1:]
date_list = list(data_df['date'])
data_ds = xr.Dataset(data_vars=dict(streamflow=(['date', 'site'], xr_data)),
coords=dict(site=site_list,
date=date_list))
data_ds
[11]:
<xarray.Dataset>
Dimensions: (date: 5, site: 31)
Coordinates:
* site (site) <U8 '01011000' '01013500' ... '04264331' '04294300'
* date (date) <U10 '2002-01-01' '2002-01-02' ... '2002-01-05'
Data variables:
streamflow (date, site) float64 9.707 13.81 12.9 ... 21.9 5.547e+03 0.283[12]:
# The following saves this file as a .nc files in the same directory as this notebook.
# You may provide a full file path to save the file(s) elsewhere in your local directory.
data_ds.to_netcdf('streamflow_observations.nc')