Save and wrangle point observations data

To launch this notebook interactively in a Jupyter notebook-like browser interface, please click the “Launch Binder” button below. Note that Binder may take several minutes to launch.

The hf_hydrodata get_point_data and get_point_metadata functions return data in pandas DataFrames. This notebook goes through some common tasks using pandas, such as saving to a .csv file, saving to a NetCDF file, creating a new variable, and slicing out a particular value. For a more comprehensive introduction to working with data in pandas, please see their 10 minutes to pandas introduction or Coming From.. documentation to see comparisons to working in R, SQL, Excel, Stata, or SAS.

Please see the hf_hydrodata documentation for information on what data is available, our data collection process, and new features we are working on! Our Metadata Description page itemizes the fields that get returned from get_point_metadata.

[1]:

# Import packages
import pandas as pd
import xarray as xr
import numpy as np
from hf_hydrodata import register_api_pin, get_point_data, get_point_metadata

[ ]:

# You need to register on https://hydrogen.princeton.edu/pin
# and run the following with your registered information
# before you can use the hydrodata utilities
register_api_pin("your_email", "your_pin")

Example 1: Working with pandas DataFrames

In this first example, we will showcase several common pandas commands that can be used to inspect a DataFrame.

Note that get_point_data and get_point_metadata require mandatory parameters of dataset, variable, temporal_resolution, and aggregation (and depth_level if asking for soil moisture data). Please see the documentation for information about what point observation datasets are available and the parameters used to query them.

The hf_hydrodata API Reference includes information on what optional filtering parameters are available. These include filters for things like a geographic region or date range. Those parameters work cumulatively, so if state and site_ids are both supplied, for example, then only sites within site_ids that are also in state will be returned.

[2]:

# Let's explore daily streamflow data with optional filters for a date range and bounding box.

# Get observations data and site-level metadata
data_df = get_point_data(dataset="usgs_nwis", variable="streamflow", temporal_resolution="daily", aggregation="mean",
                         date_start="2002-01-01", date_end="2002-01-05",
                         latitude_range=(45, 50), longitude_range=(-75, -50))

metadata_df = get_point_metadata(dataset="usgs_nwis", variable="streamflow", temporal_resolution="daily", aggregation="mean",
                                 date_start="2002-01-01", date_end="2002-01-05",
                                 latitude_range=(45, 50), longitude_range=(-75, -50))

First we will explore pandas’ head method for DataFrames. head will display the first n rows of the DataFrame, with the default to show the first 5 rows.

[3]:

# The default is to show the first 5 rows.
metadata_df.head()

[3]:

	site_id	site_name	site_type	agency	state	latitude	longitude	first_date_data_available	last_date_data_available	record_count	...	doi	huc8	conus1_x	conus1_y	conus2_x	conus2_y	gagesii_drainage_area	gagesii_class	gagesii_site_elevation	usgs_drainage_area
0	01011000	Allagash River near Allagash, Maine	stream gauge	USGS	ME	47.069722	-69.079444	1910-07-01	2023-11-30	34028	...	None	01010002	nan	nan	4210	2783	3186.8440	Non-ref	187.0	1478.00
1	01013500	Fish River near Fort Kent, Maine	stream gauge	USGS	ME	47.237500	-68.582778	1903-07-29	2023-12-01	36507	...	None	01010003	nan	nan	4237	2810	2252.6960	Ref	157.0	873.00
2	01015800	Aroostook River near Masardis, Maine	stream gauge	USGS	ME	46.523056	-68.371667	1957-09-14	2023-12-01	24185	...	None	01010004	nan	nan	4276	2747	2313.7550	Non-ref	166.0	892.00
3	01017000	Aroostook River at Washburn, Maine	stream gauge	USGS	ME	46.777222	-68.157222	1930-08-01	2023-12-01	34091	...	None	01010004	nan	nan	4281	2773	4278.9070	Non-ref	131.0	1654.00
4	01017550	Williams Brook at Phair, Maine	stream gauge	USGS	ME	46.628056	-67.953056	1999-11-01	2023-12-01	8797	...	None	01010005	nan	nan	4300	2762	10.0323	Ref	176.0	3.82

5 rows × 23 columns

[4]:

# However, a user can specify the number of rows they'd like to see.
# Here we are showing the first three records.
metadata_df.head(3)

[4]:

	site_id	site_name	site_type	agency	state	latitude	longitude	first_date_data_available	last_date_data_available	record_count	...	doi	huc8	conus1_x	conus1_y	conus2_x	conus2_y	gagesii_drainage_area	gagesii_class	gagesii_site_elevation	usgs_drainage_area
0	01011000	Allagash River near Allagash, Maine	stream gauge	USGS	ME	47.069722	-69.079444	1910-07-01	2023-11-30	34028	...	None	01010002	nan	nan	4210	2783	3186.844	Non-ref	187.0	1478.0
1	01013500	Fish River near Fort Kent, Maine	stream gauge	USGS	ME	47.237500	-68.582778	1903-07-29	2023-12-01	36507	...	None	01010003	nan	nan	4237	2810	2252.696	Ref	157.0	873.0
2	01015800	Aroostook River near Masardis, Maine	stream gauge	USGS	ME	46.523056	-68.371667	1957-09-14	2023-12-01	24185	...	None	01010004	nan	nan	4276	2747	2313.755	Non-ref	166.0	892.0

3 rows × 23 columns

Next, we will use the shape method to get information on the dimensions of a DataFrame. We can use this to answer questions like: how many sites were returned from my query parameters?

[5]:

print(f"DataFrame dimensions for metadata_df: {metadata_df.shape}")
print(f"Number of rows (sites) in metadata_df: {metadata_df.shape[0]}")
print(f"Number of columns (attributes) in metadata_df: {metadata_df.shape[1]}")

DataFrame dimensions for metadata_df: (31, 23)
Number of rows (sites) in metadata_df: 31
Number of columns (attributes) in metadata_df: 23

DataFrames have a method called columns that contains a list of all of the columns names. Notice how in metadata_df there is a set of ellipses (…) when we use the head method. These represent a whole set of columns that are not explicitly named or previewed. We can use columns to see a list of all of the columns that are in a given DataFrame.

[6]:

print(list(metadata_df.columns))

['site_id', 'site_name', 'site_type', 'agency', 'state', 'latitude', 'longitude', 'first_date_data_available', 'last_date_data_available', 'record_count', 'site_query_url', 'date_metadata_last_updated', 'tz_cd', 'doi', 'huc8', 'conus1_x', 'conus1_y', 'conus2_x', 'conus2_y', 'gagesii_drainage_area', 'gagesii_class', 'gagesii_site_elevation', 'usgs_drainage_area']

Finally, let’s extract one of the columns from the DataFrame returned by get_point_metadata. We’ll use the .loc() method for this.

[7]:

# Let's extract just the site ID and USGS drainage area field.
metadata_df.loc[:, ['site_id', 'usgs_drainage_area']]

[7]:

	site_id	usgs_drainage_area
0	01011000	1478.00
1	01013500	873.00
2	01015800	892.00
3	01017000	1654.00
4	01017550	3.82
5	01018000	175.00
6	01019000	228.30
7	01027200	232.00
8	01029200	173.00
9	01029500	837.00
10	01030500	1418.00
11	01031300	118.00
12	01031450	95.40
13	01031500	298.00
14	01033000	326.00
15	01034000	1162.00
16	01034500	6422.00
17	01042500	1590.00
18	01043500	516.00
19	01044550	193.00
20	01046000	90.00
21	01046500	2715.00
22	01129200	254.00
23	01010000	1341.00
24	01010070	171.00
25	01010500	2680.00
26	01014000	5929.00
27	01018500	413.00
28	01021000	1374.00
29	04264331	298800.00
30	04294300	34.50

Example 2: Save data to .csv

In this example, we will show how to use pandas’ to_csv method to save a DataFrame into a .csv file.

[8]:

# Let's explore daily streamflow data with optional filters for a date range and bounding box.

# Get observations data and site-level metadata
data_df = get_point_data(dataset="usgs_nwis", variable="streamflow", temporal_resolution="daily", aggregation="mean",
                         date_start="2002-01-01", date_end="2002-01-05",
                         latitude_range=(45, 50), longitude_range=(-75, -50))

metadata_df = get_point_metadata(dataset="usgs_nwis", variable="streamflow", temporal_resolution="daily", aggregation="mean",
                                 date_start="2002-01-01", date_end="2002-01-05",
                                 latitude_range=(45, 50), longitude_range=(-75, -50))

[9]:

# The following saves these files as .csv files in the same directory as this notebook.
# You may provide a full file path to save the file(s) elsewhere in your local directory.
# We will set index=False to not have the DataFrame index written to a column.
data_df.to_csv('streamflow_observations.csv', index=False)
metadata_df.to_csv('streamflow_site_metadata.csv', index=False)

Example 3: Save data to NetCDF

Using the same query parameters as in the prior examples, here we will use xarray to construct an xarray Dataset from our point observations pandas DataFrame. We will then use xarray’s built-in to_netcdf method to save the Dataset into a NetCDF file.

[10]:

# Let's explore daily streamflow data with optional filters for a date range and bounding box.

# Get observations data
data_df = get_point_data(dataset="usgs_nwis", variable="streamflow", temporal_resolution="daily", aggregation="mean",
                         date_start="2002-01-01", date_end="2002-01-05",
                         latitude_range=(45, 50), longitude_range=(-75, -50))

# Inspect first five rows
data_df.head(5)

[10]:

	date	01011000	01013500	01015800	01017000	01017550	01018000	01019000	01027200	01029200	...	01046500	01129200	01010000	01010070	01010500	01014000	01018500	01021000	04264331	04294300
0	2002-01-01	9.7069	13.8104	12.9048	21.3099	0.013301	NaN	3.0847	1.98666	2.43663	...	46.129	23.9984	11.9143	1.48292	24.0550	61.411	9.1126	21.9042	6084.5	0.2547
1	2002-01-02	9.5371	13.4142	12.0558	20.0364	0.012169	NaN	3.0564	1.91874	2.39135	...	46.695	23.8286	11.6879	1.41500	23.4890	59.713	9.0277	21.9042	6056.2	0.2547
2	2002-01-03	9.3390	13.0746	11.5181	19.0742	0.011886	NaN	3.0281	1.88195	2.36305	...	46.978	23.8286	11.5181	1.35840	23.0645	58.581	8.9145	21.9042	6084.5	0.2547
3	2002-01-04	9.1692	12.6501	11.0936	26.4322	0.011320	NaN	3.0564	1.83667	2.34890	...	51.506	23.6305	11.2917	1.31312	22.6400	57.449	8.8579	21.9042	6056.2	0.2547
4	2002-01-05	8.9994	12.2822	10.6691	25.1870	0.010754	NaN	3.0281	1.79139	2.32060	...	37.639	23.6022	11.0936	1.27633	22.2155	56.317	8.7447	21.9042	5546.8	0.2830

5 rows × 32 columns

[11]:

# Construct xarray Dataset from information in pandas DataFrame
xr_data = np.array(data_df.iloc[:, 1:]) # remove 'date' column
site_list = list(data_df.columns)[1:]
date_list = list(data_df['date'])

data_ds = xr.Dataset(data_vars=dict(streamflow=(['date', 'site'], xr_data)),
                     coords=dict(site=site_list,
                                 date=date_list))

data_ds

[12]:

# The following saves this file as a .nc files in the same directory as this notebook.
# You may provide a full file path to save the file(s) elsewhere in your local directory.
data_ds.to_netcdf('streamflow_observations.nc')