Save and wrangle point observations data

To launch this notebook interactively in a Jupyter notebook-like browser interface, please click the “Launch Binder” button below. Note that Binder may take several minutes to launch.

Binder

The hf_hydrodata get_point_data and get_point_metadata functions return data in pandas DataFrames. This notebook goes through some common tasks using pandas, such as saving to a .csv file, saving to a NetCDF file, creating a new variable, and slicing out a particular value. For a more comprehensive introduction to working with data in pandas, please see their 10 minutes to pandas introduction or Coming From.. documentation to see comparisons to working in R, SQL, Excel, Stata, or SAS.

Please see the hf_hydrodata documentation for information on what data is available, our data collection process, and new features we are working on! Our Metadata Description page itemizes the fields that get returned from get_point_metadata.

[1]:
# Import packages
import pandas as pd
import xarray as xr
import numpy as np
from hf_hydrodata import register_api_pin, get_point_data, get_point_metadata
[ ]:
# You need to register on https://hydrogen.princeton.edu/pin
# and run the following with your registered information
# before you can use the hydrodata utilities
register_api_pin("your_email", "your_pin")

Example 1: Working with pandas DataFrames

In this first example, we will showcase several common pandas commands that can be used to inspect a DataFrame.

Note that get_point_data and get_point_metadata require mandatory parameters of dataset, variable, temporal_resolution, and aggregation (and depth_level if asking for soil moisture data). Please see the documentation for information about what point observation datasets are available and the parameters used to query them.

The hf_hydrodata API Reference includes information on what optional filtering parameters are available. These include filters for things like a geographic region or date range. Those parameters work cumulatively, so if state and site_ids are both supplied, for example, then only sites within site_ids that are also in state will be returned.

[2]:
# Let's explore daily streamflow data with optional filters for a date range and bounding box.

# Get observations data and site-level metadata
data_df = get_point_data(dataset="usgs_nwis", variable="streamflow", temporal_resolution="daily", aggregation="mean",
                         date_start="2002-01-01", date_end="2002-01-05",
                         latitude_range=(45, 50), longitude_range=(-75, -50))

metadata_df = get_point_metadata(dataset="usgs_nwis", variable="streamflow", temporal_resolution="daily", aggregation="mean",
                                 date_start="2002-01-01", date_end="2002-01-05",
                                 latitude_range=(45, 50), longitude_range=(-75, -50))

First we will explore pandas’ head method for DataFrames. head will display the first n rows of the DataFrame, with the default to show the first 5 rows.

[3]:
# The default is to show the first 5 rows.
metadata_df.head()
[3]:
site_id site_name site_type agency state latitude longitude first_date_data_available last_date_data_available record_count ... doi huc8 conus1_x conus1_y conus2_x conus2_y gagesii_drainage_area gagesii_class gagesii_site_elevation usgs_drainage_area
0 01011000 Allagash River near Allagash, Maine stream gauge USGS ME 47.069722 -69.079444 1910-07-01 2023-11-30 34028 ... None 01010002 nan nan 4210 2783 3186.8440 Non-ref 187.0 1478.00
1 01013500 Fish River near Fort Kent, Maine stream gauge USGS ME 47.237500 -68.582778 1903-07-29 2023-12-01 36507 ... None 01010003 nan nan 4237 2810 2252.6960 Ref 157.0 873.00
2 01015800 Aroostook River near Masardis, Maine stream gauge USGS ME 46.523056 -68.371667 1957-09-14 2023-12-01 24185 ... None 01010004 nan nan 4276 2747 2313.7550 Non-ref 166.0 892.00
3 01017000 Aroostook River at Washburn, Maine stream gauge USGS ME 46.777222 -68.157222 1930-08-01 2023-12-01 34091 ... None 01010004 nan nan 4281 2773 4278.9070 Non-ref 131.0 1654.00
4 01017550 Williams Brook at Phair, Maine stream gauge USGS ME 46.628056 -67.953056 1999-11-01 2023-12-01 8797 ... None 01010005 nan nan 4300 2762 10.0323 Ref 176.0 3.82

5 rows × 23 columns

[4]:
# However, a user can specify the number of rows they'd like to see.
# Here we are showing the first three records.
metadata_df.head(3)
[4]:
site_id site_name site_type agency state latitude longitude first_date_data_available last_date_data_available record_count ... doi huc8 conus1_x conus1_y conus2_x conus2_y gagesii_drainage_area gagesii_class gagesii_site_elevation usgs_drainage_area
0 01011000 Allagash River near Allagash, Maine stream gauge USGS ME 47.069722 -69.079444 1910-07-01 2023-11-30 34028 ... None 01010002 nan nan 4210 2783 3186.844 Non-ref 187.0 1478.0
1 01013500 Fish River near Fort Kent, Maine stream gauge USGS ME 47.237500 -68.582778 1903-07-29 2023-12-01 36507 ... None 01010003 nan nan 4237 2810 2252.696 Ref 157.0 873.0
2 01015800 Aroostook River near Masardis, Maine stream gauge USGS ME 46.523056 -68.371667 1957-09-14 2023-12-01 24185 ... None 01010004 nan nan 4276 2747 2313.755 Non-ref 166.0 892.0

3 rows × 23 columns

Next, we will use the shape method to get information on the dimensions of a DataFrame. We can use this to answer questions like: how many sites were returned from my query parameters?

[5]:
print(f"DataFrame dimensions for metadata_df: {metadata_df.shape}")
print(f"Number of rows (sites) in metadata_df: {metadata_df.shape[0]}")
print(f"Number of columns (attributes) in metadata_df: {metadata_df.shape[1]}")
DataFrame dimensions for metadata_df: (31, 23)
Number of rows (sites) in metadata_df: 31
Number of columns (attributes) in metadata_df: 23

DataFrames have a method called columns that contains a list of all of the columns names. Notice how in metadata_df there is a set of ellipses (…) when we use the head method. These represent a whole set of columns that are not explicitly named or previewed. We can use columns to see a list of all of the columns that are in a given DataFrame.

[6]:
print(list(metadata_df.columns))
['site_id', 'site_name', 'site_type', 'agency', 'state', 'latitude', 'longitude', 'first_date_data_available', 'last_date_data_available', 'record_count', 'site_query_url', 'date_metadata_last_updated', 'tz_cd', 'doi', 'huc8', 'conus1_x', 'conus1_y', 'conus2_x', 'conus2_y', 'gagesii_drainage_area', 'gagesii_class', 'gagesii_site_elevation', 'usgs_drainage_area']

Finally, let’s extract one of the columns from the DataFrame returned by get_point_metadata. We’ll use the .loc() method for this.

[7]:
# Let's extract just the site ID and USGS drainage area field.
metadata_df.loc[:, ['site_id', 'usgs_drainage_area']]
[7]:
site_id usgs_drainage_area
0 01011000 1478.00
1 01013500 873.00
2 01015800 892.00
3 01017000 1654.00
4 01017550 3.82
5 01018000 175.00
6 01019000 228.30
7 01027200 232.00
8 01029200 173.00
9 01029500 837.00
10 01030500 1418.00
11 01031300 118.00
12 01031450 95.40
13 01031500 298.00
14 01033000 326.00
15 01034000 1162.00
16 01034500 6422.00
17 01042500 1590.00
18 01043500 516.00
19 01044550 193.00
20 01046000 90.00
21 01046500 2715.00
22 01129200 254.00
23 01010000 1341.00
24 01010070 171.00
25 01010500 2680.00
26 01014000 5929.00
27 01018500 413.00
28 01021000 1374.00
29 04264331 298800.00
30 04294300 34.50

Example 2: Save data to .csv

In this example, we will show how to use pandas’ to_csv method to save a DataFrame into a .csv file.

[8]:
# Let's explore daily streamflow data with optional filters for a date range and bounding box.

# Get observations data and site-level metadata
data_df = get_point_data(dataset="usgs_nwis", variable="streamflow", temporal_resolution="daily", aggregation="mean",
                         date_start="2002-01-01", date_end="2002-01-05",
                         latitude_range=(45, 50), longitude_range=(-75, -50))

metadata_df = get_point_metadata(dataset="usgs_nwis", variable="streamflow", temporal_resolution="daily", aggregation="mean",
                                 date_start="2002-01-01", date_end="2002-01-05",
                                 latitude_range=(45, 50), longitude_range=(-75, -50))
[9]:
# The following saves these files as .csv files in the same directory as this notebook.
# You may provide a full file path to save the file(s) elsewhere in your local directory.
# We will set index=False to not have the DataFrame index written to a column.
data_df.to_csv('streamflow_observations.csv', index=False)
metadata_df.to_csv('streamflow_site_metadata.csv', index=False)

Example 3: Save data to NetCDF

Using the same query parameters as in the prior examples, here we will use xarray to construct an xarray Dataset from our point observations pandas DataFrame. We will then use xarray’s built-in to_netcdf method to save the Dataset into a NetCDF file.

[10]:
# Let's explore daily streamflow data with optional filters for a date range and bounding box.

# Get observations data
data_df = get_point_data(dataset="usgs_nwis", variable="streamflow", temporal_resolution="daily", aggregation="mean",
                         date_start="2002-01-01", date_end="2002-01-05",
                         latitude_range=(45, 50), longitude_range=(-75, -50))

# Inspect first five rows
data_df.head(5)
[10]:
date 01011000 01013500 01015800 01017000 01017550 01018000 01019000 01027200 01029200 ... 01046500 01129200 01010000 01010070 01010500 01014000 01018500 01021000 04264331 04294300
0 2002-01-01 9.7069 13.8104 12.9048 21.3099 0.013301 NaN 3.0847 1.98666 2.43663 ... 46.129 23.9984 11.9143 1.48292 24.0550 61.411 9.1126 21.9042 6084.5 0.2547
1 2002-01-02 9.5371 13.4142 12.0558 20.0364 0.012169 NaN 3.0564 1.91874 2.39135 ... 46.695 23.8286 11.6879 1.41500 23.4890 59.713 9.0277 21.9042 6056.2 0.2547
2 2002-01-03 9.3390 13.0746 11.5181 19.0742 0.011886 NaN 3.0281 1.88195 2.36305 ... 46.978 23.8286 11.5181 1.35840 23.0645 58.581 8.9145 21.9042 6084.5 0.2547
3 2002-01-04 9.1692 12.6501 11.0936 26.4322 0.011320 NaN 3.0564 1.83667 2.34890 ... 51.506 23.6305 11.2917 1.31312 22.6400 57.449 8.8579 21.9042 6056.2 0.2547
4 2002-01-05 8.9994 12.2822 10.6691 25.1870 0.010754 NaN 3.0281 1.79139 2.32060 ... 37.639 23.6022 11.0936 1.27633 22.2155 56.317 8.7447 21.9042 5546.8 0.2830

5 rows × 32 columns

[11]:
# Construct xarray Dataset from information in pandas DataFrame
xr_data = np.array(data_df.iloc[:, 1:]) # remove 'date' column
site_list = list(data_df.columns)[1:]
date_list = list(data_df['date'])

data_ds = xr.Dataset(data_vars=dict(streamflow=(['date', 'site'], xr_data)),
                     coords=dict(site=site_list,
                                 date=date_list))

data_ds
[11]:
<xarray.Dataset>
Dimensions:     (date: 5, site: 31)
Coordinates:
  * site        (site) <U8 '01011000' '01013500' ... '04264331' '04294300'
  * date        (date) <U10 '2002-01-01' '2002-01-02' ... '2002-01-05'
Data variables:
    streamflow  (date, site) float64 9.707 13.81 12.9 ... 21.9 5.547e+03 0.283
[12]:
# The following saves this file as a .nc files in the same directory as this notebook.
# You may provide a full file path to save the file(s) elsewhere in your local directory.
data_ds.to_netcdf('streamflow_observations.nc')