Introducing vector data cubes !¶

{note}
note to self
- 2 inline images not working, need to fix
- example section is a little disjointed
- might still be too long
- maybe don't use dTdt in example
- too many .explore() outputs at end + fix diff output sizes

The Xarray ecosystem now supports vector data cubes :tada. Just like raster data such as satellite imagery and modeled climate data, geospatial datasets representing real-world features as points, lines, and polygons are increasingly large, complex, and multidimensional, pushing the limitations of existing tools for working with these data types. We've seen growing interest in multidimensional representations of vector data across the geospatial scientific community and are excited to announce the extension of the Xarray data model to support vector geometries. Vector data cubes enable new ways of representing and interacting with geospatial data, extending powerful Xarray features like label-based indexing, selection, and computation to geometries.

This blog post is geared toward Xarray users and analysts working with geospatial datasets. We'll first introduce the concept of vector data cubes, discuss how they differ from existing tooling, and the analytical opportunities they create. Next, we will demonstrate how vector data cubes can be used to streamline and augment scientific workflows by showing how to construct a vector data cube from ERA5 reanalysis data sampled at a set of geometries representing points of interest.

[this will come later, but maybe add a short para here thanking/shouting out all of the people/groups who made this possible?/introduce pt2 here?]

{note}
The R programming language also has strong support for vector data cubes, which inspired much of this work. For information about vector data cube software in R, check out the [`stars`](https://r-spatial.github.io/stars/) package, and for a very detailed background on vector data cubes, check out this [post](https://r-spatial.org/r/2022/09/12/vdc.html) by Edzer Pebesma, a `stars` developer.

Background¶

Raster data cubes, vector data frames¶

The Python ecosystem has robust tools built to work with both raster and vector data. Geospatial raster data are typically stored as cubes, with dimensions such as [x, y] or [longitude, latitude] representing the spatial domain of the dataset. In addition to spatial dimensions, raster data cubes can have additional dimensions such as time and level (this commonly refers to the vertical domain of a model above earth's surface). These data are typically stored on disk in array-based formats such as NetCDF or Zarr, and software like Xarray offer powerful tools for processing and analyzing multi-dimensional array data in Python. In contrast, geospatial vector data, which model real-world features like rivers, counties, and buildings as geometry objects, are typically stored as tables and formatted on disk in a vector-specific container like GeoJSON or Shapefile. The GeoPandas library in Python extends the Pandas framework, representing vector data as data frames that have additional functionality related to spatial operations and handling coordinate reference system information.

This brief overview highlights a key difference between the traditional models of raster and vector data: raster data is viewed as a cube, while vector data is discussed as a data frame. Both of these are powerful and functional formats, but they are optimized for different types of operations. For this reason, framing raster data as cubes and vector data as data frames can influence both how we, as analysts, think about our data and the analytical questions we ask with it.

As an example, let's consider what a data frame means in the context of spatial data. A data frame is a 'structure that organizes data into a 2-dimensional table of rows and columns' (cite). In many cases, this is an ideal format for vector data, and tools like GeoPandas enable a range of spatial operations that are built on top of this model (note:redundant with sentence in paragraph above?). geopandas.GeoDataFrames are similar to pandas.DataFrames: they have an index and a number of data columns, but they also have a geometry column which store shapely geometry objects. Users can efficiently store vector data with easily accessible attribute data and perform operations that generate knowledge about the relationships between spatial features and associated attribute data.

geodataframe

Pushing the limits of data frames¶

What about situations where the data frame format can be limiting (note: reword)? Imagine we have a collection of weather stations located throughout a river basin. Each weather station records data such as temperature, windspeed, and relative humidity. These are stored in the data columns of a geopandas.GeoDataFrame, while the coordinates of each weather station are stored as Point geometries in the geometry column. From this table, we can quickly access a lot of information and ask questions such as how do temperatures vary across the elevation range covered by the weather stations, and where are windspeeds highest? But, each time the weather station records a measurement, we get a new set of data for each variable. Within the structure of the GeoDataFrame, how should that new data be incorporated? There are a number of strategies for organizing information in data frames such as wide form (the columns of this data frame are the cross-join of the implicit dimensions of the dataset) and long form (the number of columns is minimized while a row is added for every combination of dimensions) data frames [Pebesma, 2022]. These let the user store multidimensional data in a way that makes it easy to query and subset. However, the data frame column structure is still fundamentally one-dimensional and these strategies involve duplicating data along either the column or row dimension.

Returning to the weather station example, when we stack multiple temporal observations from weather stations into a data frame, we've added a dimension to the data. By adding columns or adding rows to incorporate the new data, we are essentially looking for ways to represent additional dimensions within the structure of a data frame. Vector datasets are frequently treated as 'flat' or where the spatial dimension is the only required functional dimension; but what happens when vector datasets contain additional dimensions like time? Contrast this to raster data cubes, where data is explicitly represented as multi-dimensional, and popular tools and data models reflect this fundamental shape. What would it look like, and how would our workflows change if vector data were represented by cube structures similarly to raster data? [reword]

Figure 1: Caption for both images

Image 1 Image 2

Example of a raster data cube (left) and vector data cube (right). The raster data cube has a gridded structure with latitude, longitude and time dimensions. The vector data cube has a dimension of geometries, a dimension of data variables and a time dimension.

What are vector data cubes?¶

Vector data cubes are n-dimensional arrays with at least one dimension that is an array of geometries (Pebesma, 2022). Where the spatial domain of a raster object is a gridded extent (eg. rdc = {'x': [0,1,2,3,4,5,6,7,8,9], 'y':[0,1,2,3,4,5,6,7,8,9]}), the spatial domain of a vector data cube is an array of points, lines or polygons representing real-world features:

vdc = {'counties':
[<POLYGON ((-95.343 48.547, -95.341  48.715, -95.095 48.912, ...>,
<POLYGON ((-118.851 47.95, -118.847 48.479, -118.87 48.647, ...>,
<POLYGON ((-117.438 48.044,-117.541 47.79,  -117.607 47.798,...>,

With vector data cubes, we can now work with n-dimensional geometric datasets! :start-struck

What can we do with vector data cubes?¶

Multi-dimensional representations of vector data open new directions of scientific inquiry and analysis. As discussed above, this data model streamlines analyses such as examining attribute data (such as temperature and relative humidity) variability over time (or another dimension). Similarly, one could observe how geospatial features change over time: tracking areal extents such as forest boundaries and agricultural parcels, or non-stationary monitoring systems such as ocean data buoys would be simplified by vector data cube representations.

A particularly exciting use-case is sampling gridded raster data at a number of points of interest, going from a raster data cube covering a large spatial extent to a vector data cube containing only the spatial areas of interest. The following example will demonstrate how to construct a vector data cube by sampling ERA5 climate model reanalysis data at a number of point locations representing cities across Europe. We will show both how to assemble this data into a cube and how the cube structure abridges analytical workflows aimed at answering a set of general scientific questions about the data.

Understanding climate variability in population centers across Europe¶

This example will demonstrate how the process of wrangling and querying data to answer a scientific question becomes more efficient and intuitive when we structure the data as a vector data cube. For this example, we are interested in understanding climatic conditions in cities across Europe using ERA5, a cliamte model reanalysis dataset. Climate models produced massive datasets with a number of physical variables that expand our understanding of the climate system beyond what is possible through direct monitoring and observation. ERA5 is a dataset that provides estimates for 32 climate variables for the entire globe for every 6-hour window from 1959 to 2022.

These are big files! Thankfully, we can access it from Google Cloud Storage without having to download the data to our local computer.

Assemble vector data cube¶

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
#imports 
import fsspec
import xarray as xr
from datasets import load_dataset
import pandas as pd
import geopandas as gpd
import cf_xarray
import xvec
import numpy as np
import shapely
import matplotlib.pyplot as plt
1
fs = fsspec.filesystem('gs')

Read raster data¶

We will use an 'analysis-ready' dataset, which means that a lot of work preparing the dataset has been done for us. Read more about analysis-ready data and ERA5 processing here.

The ERA5 dataset is stored in Google Cloud Storage as a Zarr datacube. Use Xarray to read it into python:

1
2
3
4
5
era5_ds = xr.open_zarr(
    'gs://gcp-public-data-arco-era5/ar/1959-2022-full_37-6h-0p25deg-chunk-1.zarr-v2',
    chunks={'time': 48},
    consolidated=True,
).isel(level=0)

This is a ton of data! :dizzy_face

1
print(f'size: {era5_ds.nbytes / (1024 ** 4)} TiB')
size: 6.257914529454865 TiB
1
era5_ds.isel(time=-2)['2m_temperature'].plot(robust=True, figsize=(10,6));
No description has been provided for this image

Let's extract a shorter time series that will be easier to work with and specify only the data variables we'd like to use in our analysis.

1
era5_ds_sub = era5_ds.sel(time=slice('2017-01','2022-01'))[['2m_temperature','u_component_of_wind']]

Read vector data¶

ERA5 outputs are gridded datasets covering the entire globe. We only need data about cities in Europe, so continuing to store the global dataset is a pain and unnecessary. The following cell reads a vector dataset from Hugging Face and formats it as a gpd.GeoDataFrame containing data covering the European continent.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
cities = load_dataset('jamescalam/world-cities-geo',split='train')
cities_df = pd.DataFrame({'city':cities['city'],
                          'country':cities['country'],
                          'region':cities['region'],
                          'latitude':cities['latitude'],
                          'longitude':cities['longitude'],
                          'continent':cities['continent'],
                         'x':cities['x'], 'y':cities['y'], 'z':cities['z']})

def lon_360(row):
    new_lon = ((360 + (row['longitude'] % 360)) % 360)
    return new_lon

cities_df['lon_360'] = cities_df.apply(lon_360, axis=1)
cities_eur = cities_df.loc[cities_df['continent'] == 'Europe']
cities_eur = gpd.GeoDataFrame(
    cities_eur,
    geometry = gpd.points_from_xy(cities_eur.longitude, cities_eur.latitude),
    crs = 'EPSG:4326'
)
Repo card metadata block was not found. Setting CardData to empty.

Now we can visualize and query the spatial components of the vector dataset.

1
cities_eur.explore()
Make this Notebook Trusted to load map: File -> Trust Notebook

We can also take a look at the information stored in the attributes of the GeoDataFrame; it is all scalar, meaning that a dataframe is an ideal representation of this data.

1
cities_eur.head()
city country region latitude longitude continent x y z lon_360 geometry
84 Tirana Albania Southern Europe 41.327500 19.818890 Europe 4500.907258 1622.102741 4207.167403 19.818890 POINT (19.81889 41.32750)
85 Durres Albania Southern Europe 41.323056 19.441389 Europe 4511.804644 1592.521568 4206.796276 19.441389 POINT (19.44139 41.32306)
86 Elbasan Albania Southern Europe 41.112500 20.082222 Europe 4508.200233 1648.181036 4189.184997 20.082222 POINT (20.08222 41.11250)
87 Vlore Albania Southern Europe 40.466667 19.489721 Europe 4569.228960 1617.126398 4134.814376 19.489721 POINT (19.48972 40.46667)
88 Shkoder Albania Southern Europe 42.068283 19.512577 Europe 4457.868549 1579.715398 4268.670549 19.512577 POINT (19.51258 42.06828)

Create a vector data cube with 'cities' data¶

xr.DataArray from 'cities' gpd.GeoDataFrame

To answer our question, we need ERA5 data at the locations in the GeoDataFrame above. We also want to know how the variables change over time, so the new object should have a time dimension. As a first step, construct a xr.DataArray from the cities gpd.GeoDataFrame; this will create a data cube with a geometry dimension.

1
2
3
4
5
6
7
8
europe_ds = xr.Dataset(
    {
        'city': ('geometry', cities_eur['city']),
        'country': ('geometry', cities_eur['country']),
        'lat': ('geometry', cities_eur['latitude']),
        'lon': ('geometry', cities_eur['lon_360']),
    },
    coords= {'geometry': cities_eur['geometry']})
1
europe_ds
<xarray.Dataset> Size: 114kB
Dimensions:   (geometry: 2844)
Coordinates:
  * geometry  (geometry) object 23kB POINT (19.8188896 41.3275) ... POINT (30...
Data variables:
    city      (geometry) object 23kB 'Tirana' 'Durres' ... 'Shepetivka' 'Bucha'
    country   (geometry) object 23kB 'Albania' 'Albania' ... 'Ukraine' 'Ukraine'
    lat       (geometry) float64 23kB 41.33 41.32 41.11 ... 50.75 50.18 50.57
    lon       (geometry) float64 23kB 19.82 19.44 20.08 ... 33.47 27.07 30.22
xarray.Dataset
    • geometry: 2844
    • geometry
      (geometry)
      object
      POINT (19.8188896 41.3275) ... P...
      array([<POINT (19.819 41.328)>, <POINT (19.441 41.323)>,
             <POINT (20.082 41.112)>, ..., <POINT (33.467 50.75)>,
             <POINT (27.067 50.183)>, <POINT (30.217 50.567)>], dtype=object)
    • city
      (geometry)
      object
      'Tirana' 'Durres' ... 'Bucha'
      array(['Tirana', 'Durres', 'Elbasan', ..., 'Romny', 'Shepetivka', 'Bucha'],
            dtype=object)
    • country
      (geometry)
      object
      'Albania' 'Albania' ... 'Ukraine'
      array(['Albania', 'Albania', 'Albania', ..., 'Ukraine', 'Ukraine',
             'Ukraine'], dtype=object)
    • lat
      (geometry)
      float64
      41.33 41.32 41.11 ... 50.18 50.57
      array([41.3275   , 41.3230556, 41.1125   , ..., 50.75     , 50.1833333,
             50.5666667])
    • lon
      (geometry)
      float64
      19.82 19.44 20.08 ... 27.07 30.22
      array([19.8188896, 19.4413891, 20.082222 , ..., 33.4666672, 27.0666676,
             30.2166672])
    • geometry
      PandasIndex
      PandasIndex(Index([   POINT (19.8188896 41.3275), POINT (19.4413891 41.3230556),
                 POINT (20.082222 41.1125), POINT (19.4897213 40.4666667),
             POINT (19.5125771 42.0682829), POINT (19.5666676 40.7166667),
             POINT (20.7808342 40.6186111), POINT (19.9522228 40.7058333),
             POINT (19.7049999 40.9419444), POINT (19.5569439 41.1855556),
             ...
             POINT (38.6616669 48.5069444), POINT (33.6454582 49.0097264),
             POINT (33.5077782 48.3436111),  POINT (29.916666 50.0833333),
             POINT (33.3802795 46.7508333),  POINT (34.894165 50.3086111),
                 POINT (39.7400017 48.295),      POINT (33.4666672 50.75),
             POINT (27.0666676 50.1833333), POINT (30.2166672 50.5666667)],
            dtype='object', name='geometry', length=2844))

However, taking a look at this object, you'll see that its index is a PandasIndex, meaning that Xarray doesn't understand that the geom dimension represents spatial data. This is where Xvec comes in - Xvec is a library for working with vector data cubes in Xarray. It adds extensions to Xarray objects that facilitate the features (such as spatially-aware indexes, handling projections, and interoperability with GeoPandas) that make vector data cubes in Xarray possible.

Convert the PandasIndex above to a GeometryIndex using Xvec's set_geom_indexes() method. GeometryIndexes maintain the functionality of PandasIndexes but they also can perform indexing and filtering based on geometries and have attributes specifying a coordinate reference system.

1
europe_ds = europe_ds.xvec.set_geom_indexes('geometry', crs=4326)
1
europe_ds.xindexes
Indexes:
    geometry  GeometryIndex (crs=EPSG:4326)

We now have a vector datacube :confetti-ball ! Remember that a vector data cube is any n-dimensional array with a spatial dimension mapping to a set of vector geometries. The europe_ds object is currently a 1-dimensional vector data cube, meaning it is the simplest form of a vector data cube, and is functionally the same as a data frame. Let's incorporate additional data into this object to see the utility of representing it as a data cube.

Sample raster data cube with geometries from vector data cube¶

We've created an 'empty' vector data cube (europe_ds) storing data about European cities. This object has the geographic information needed to communicate with the xr.Dataset storing ERA5 data (era5_ds_sub). We now need to interpolate the raster data cube onto the coordinates of European cities. The ERA5 dataset is indexed along 'latitude' and 'longitude' dimensions, while the vector data cube is indexed along a 'geometry' dimension. To interpolate ERA5 data onto the coordinate grid of cities, we need to pass era5_ds_sub information it understands about the locations of the cities (ie. the latitude and longitude variables of europe_ds not the geom coordinate dimension). (note: last sentence redundant?)

1
era5_ds_sub
<xarray.Dataset> Size: 61GB
Dimensions:              (time: 7304, latitude: 721, longitude: 1440)
Coordinates:
  * latitude             (latitude) float32 3kB 90.0 89.75 89.5 ... -89.75 -90.0
    level                int64 8B 94256640801872
  * longitude            (longitude) float32 6kB 0.0 0.25 0.5 ... 359.5 359.8
  * time                 (time) datetime64[ns] 58kB 2017-01-01 ... 2021-12-31...
Data variables:
    2m_temperature       (time, latitude, longitude) float32 30GB dask.array<chunksize=(28, 721, 1440), meta=np.ndarray>
    u_component_of_wind  (time, latitude, longitude) float32 30GB dask.array<chunksize=(28, 721, 1440), meta=np.ndarray>
xarray.Dataset
    • time: 7304
    • latitude: 721
    • longitude: 1440
    • latitude
      (latitude)
      float32
      90.0 89.75 89.5 ... -89.75 -90.0
      long_name :
      latitude
      units :
      degrees_north
      array([ 90.  ,  89.75,  89.5 , ..., -89.5 , -89.75, -90.  ], dtype=float32)
    • level
      ()
      int64
      94256640801872
      array(94256640801872)
    • longitude
      (longitude)
      float32
      0.0 0.25 0.5 ... 359.2 359.5 359.8
      long_name :
      longitude
      units :
      degrees_east
      array([0.0000e+00, 2.5000e-01, 5.0000e-01, ..., 3.5925e+02, 3.5950e+02,
             3.5975e+02], dtype=float32)
    • time
      (time)
      datetime64[ns]
      2017-01-01 ... 2021-12-31T18:00:00
      array(['2017-01-01T00:00:00.000000000', '2017-01-01T06:00:00.000000000',
             '2017-01-01T12:00:00.000000000', ..., '2021-12-31T06:00:00.000000000',
             '2021-12-31T12:00:00.000000000', '2021-12-31T18:00:00.000000000'],
            dtype='datetime64[ns]')
    • 2m_temperature
      (time, latitude, longitude)
      float32
      dask.array<chunksize=(28, 721, 1440), meta=np.ndarray>
      long_name :
      2 metre temperature
      short_name :
      t2m
      units :
      K
      Array Chunk
      Bytes 28.25 GiB 190.11 MiB
      Shape (7304, 721, 1440) (48, 721, 1440)
      Dask graph 153 chunks in 3 graph layers
      Data type float32 numpy.ndarray
      1440 721 7304
    • u_component_of_wind
      (time, latitude, longitude)
      float32
      dask.array<chunksize=(28, 721, 1440), meta=np.ndarray>
      long_name :
      U component of wind
      short_name :
      u
      standard_name :
      eastward_wind
      units :
      m s**-1
      Array Chunk
      Bytes 28.25 GiB 190.11 MiB
      Shape (7304, 721, 1440) (48, 721, 1440)
      Dask graph 153 chunks in 4 graph layers
      Data type float32 numpy.ndarray
      1440 721 7304
    • latitude
      PandasIndex
      PandasIndex(Index([  90.0,  89.75,   89.5,  89.25,   89.0,  88.75,   88.5,  88.25,   88.0,
              87.75,
             ...
             -87.75,  -88.0, -88.25,  -88.5, -88.75,  -89.0, -89.25,  -89.5, -89.75,
              -90.0],
            dtype='float32', name='latitude', length=721))
    • longitude
      PandasIndex
      PandasIndex(Index([   0.0,   0.25,    0.5,   0.75,    1.0,   1.25,    1.5,   1.75,    2.0,
               2.25,
             ...
              357.5, 357.75,  358.0, 358.25,  358.5, 358.75,  359.0, 359.25,  359.5,
             359.75],
            dtype='float32', name='longitude', length=1440))
    • time
      PandasIndex
      PandasIndex(DatetimeIndex(['2017-01-01 00:00:00', '2017-01-01 06:00:00',
                     '2017-01-01 12:00:00', '2017-01-01 18:00:00',
                     '2017-01-02 00:00:00', '2017-01-02 06:00:00',
                     '2017-01-02 12:00:00', '2017-01-02 18:00:00',
                     '2017-01-03 00:00:00', '2017-01-03 06:00:00',
                     ...
                     '2021-12-29 12:00:00', '2021-12-29 18:00:00',
                     '2021-12-30 00:00:00', '2021-12-30 06:00:00',
                     '2021-12-30 12:00:00', '2021-12-30 18:00:00',
                     '2021-12-31 00:00:00', '2021-12-31 06:00:00',
                     '2021-12-31 12:00:00', '2021-12-31 18:00:00'],
                    dtype='datetime64[ns]', name='time', length=7304, freq=None))
1
europe_ds
<xarray.Dataset> Size: 114kB
Dimensions:   (geometry: 2844)
Coordinates:
  * geometry  (geometry) object 23kB POINT (19.8188896 41.3275) ... POINT (30...
Data variables:
    city      (geometry) object 23kB 'Tirana' 'Durres' ... 'Shepetivka' 'Bucha'
    country   (geometry) object 23kB 'Albania' 'Albania' ... 'Ukraine' 'Ukraine'
    lat       (geometry) float64 23kB 41.33 41.32 41.11 ... 50.75 50.18 50.57
    lon       (geometry) float64 23kB 19.82 19.44 20.08 ... 33.47 27.07 30.22
Indexes:
    geometry  GeometryIndex (crs=EPSG:4326)
xarray.Dataset
    • geometry: 2844
    • geometry
      (geometry)
      object
      POINT (19.8188896 41.3275) ... P...
      crs :
      EPSG:4326
      array([<POINT (19.819 41.328)>, <POINT (19.441 41.323)>,
             <POINT (20.082 41.112)>, ..., <POINT (33.467 50.75)>,
             <POINT (27.067 50.183)>, <POINT (30.217 50.567)>], dtype=object)
    • city
      (geometry)
      object
      'Tirana' 'Durres' ... 'Bucha'
      array(['Tirana', 'Durres', 'Elbasan', ..., 'Romny', 'Shepetivka', 'Bucha'],
            dtype=object)
    • country
      (geometry)
      object
      'Albania' 'Albania' ... 'Ukraine'
      array(['Albania', 'Albania', 'Albania', ..., 'Ukraine', 'Ukraine',
             'Ukraine'], dtype=object)
    • lat
      (geometry)
      float64
      41.33 41.32 41.11 ... 50.18 50.57
      array([41.3275   , 41.3230556, 41.1125   , ..., 50.75     , 50.1833333,
             50.5666667])
    • lon
      (geometry)
      float64
      19.82 19.44 20.08 ... 27.07 30.22
      array([19.8188896, 19.4413891, 20.082222 , ..., 33.4666672, 27.0666676,
             30.2166672])
    • geometry
      GeometryIndex (crs=EPSG:4326)
      <xvec.index.GeometryIndex object at 0x7f85b8f66210>
1
2
era5_europe_cities = era5_ds_sub.xvec.extract_points(europe_ds.geometry.data, x_coords = 'longitude',
                                                     y_coords='latitude')

Cool, now we have a 2-dimensional vector data cube! We went from a 3-d raster datacube with latitude and longitude dimensions to a 2-d datacube where the only spatial dimension is 'geometry' (each element of the geometry dimension is a different city from our original dataset). Our dataset is also much smaller now :thumbsup

1
print(f'size: {era5_europe_cities.nbytes / (1024 ** 3)} GB')
size: 0.15484336763620377 GB
1
era5_europe_cities
<xarray.Dataset> Size: 166MB
Dimensions:              (time: 7304, geometry: 2844)
Coordinates:
    level                int64 8B 94256640801872
  * time                 (time) datetime64[ns] 58kB 2017-01-01 ... 2021-12-31...
  * geometry             (geometry) object 23kB POINT (19.8188896 41.3275) .....
Data variables:
    2m_temperature       (time, geometry) float32 83MB dask.array<chunksize=(28, 2844), meta=np.ndarray>
    u_component_of_wind  (time, geometry) float32 83MB dask.array<chunksize=(28, 2844), meta=np.ndarray>
Indexes:
    geometry  GeometryIndex (crs=None)
xarray.Dataset
    • time: 7304
    • geometry: 2844
    • level
      ()
      int64
      94256640801872
      array(94256640801872)
    • time
      (time)
      datetime64[ns]
      2017-01-01 ... 2021-12-31T18:00:00
      array(['2017-01-01T00:00:00.000000000', '2017-01-01T06:00:00.000000000',
             '2017-01-01T12:00:00.000000000', ..., '2021-12-31T06:00:00.000000000',
             '2021-12-31T12:00:00.000000000', '2021-12-31T18:00:00.000000000'],
            dtype='datetime64[ns]')
    • geometry
      (geometry)
      object
      POINT (19.8188896 41.3275) ... P...
      array([<POINT (19.819 41.328)>, <POINT (19.441 41.323)>,
             <POINT (20.082 41.112)>, ..., <POINT (33.467 50.75)>,
             <POINT (27.067 50.183)>, <POINT (30.217 50.567)>], dtype=object)
    • 2m_temperature
      (time, geometry)
      float32
      dask.array<chunksize=(28, 2844), meta=np.ndarray>
      long_name :
      2 metre temperature
      short_name :
      t2m
      units :
      K
      Array Chunk
      Bytes 79.24 MiB 533.25 kiB
      Shape (7304, 2844) (48, 2844)
      Dask graph 153 chunks in 5 graph layers
      Data type float32 numpy.ndarray
      2844 7304
    • u_component_of_wind
      (time, geometry)
      float32
      dask.array<chunksize=(28, 2844), meta=np.ndarray>
      long_name :
      U component of wind
      short_name :
      u
      standard_name :
      eastward_wind
      units :
      m s**-1
      Array Chunk
      Bytes 79.24 MiB 533.25 kiB
      Shape (7304, 2844) (48, 2844)
      Dask graph 153 chunks in 6 graph layers
      Data type float32 numpy.ndarray
      2844 7304
    • time
      PandasIndex
      PandasIndex(DatetimeIndex(['2017-01-01 00:00:00', '2017-01-01 06:00:00',
                     '2017-01-01 12:00:00', '2017-01-01 18:00:00',
                     '2017-01-02 00:00:00', '2017-01-02 06:00:00',
                     '2017-01-02 12:00:00', '2017-01-02 18:00:00',
                     '2017-01-03 00:00:00', '2017-01-03 06:00:00',
                     ...
                     '2021-12-29 12:00:00', '2021-12-29 18:00:00',
                     '2021-12-30 00:00:00', '2021-12-30 06:00:00',
                     '2021-12-30 12:00:00', '2021-12-30 18:00:00',
                     '2021-12-31 00:00:00', '2021-12-31 06:00:00',
                     '2021-12-31 12:00:00', '2021-12-31 18:00:00'],
                    dtype='datetime64[ns]', name='time', length=7304, freq=None))
    • geometry
      GeometryIndex (crs=None)
      <xvec.index.GeometryIndex object at 0x7f85b9d71340>

The above operation interpolated the ERA5 data onto the coordinates from europe_ds but in the process we lost the data variables describing the name and country of each city. Add those onto the interpolated vector data cube and drop the level coordinate variable, which we don't need.

1
2
3
era5_europe_cities['city'] = europe_ds['city']
era5_europe_cities['country'] = europe_ds['country']
era5_europe_cities = era5_europe_cities.drop_vars('level')

Up until now, we have been performing lazy operations, meaning that we haven't actually loaded the ERA5 data we accessed from Google Cloud Storage into memory on our local machine. We'll do this now so that we can perform computations that require the data in memory.

1
from dask.diagnostics import ProgressBar 
1
2
with ProgressBar(): 
    era5_europe_cities = era5_europe_cities.load() 
[########################################] | 100% Completed | 15m 58s
1
era5_europe_cities
<xarray.Dataset> Size: 166MB
Dimensions:              (time: 7304, geometry: 2844)
Coordinates:
  * time                 (time) datetime64[ns] 58kB 2017-01-01 ... 2021-12-31...
  * geometry             (geometry) object 23kB POINT (19.8188896 41.3275) .....
Data variables:
    2m_temperature       (time, geometry) float32 83MB 271.9 278.4 ... nan nan
    u_component_of_wind  (time, geometry) float32 83MB 112.1 112.7 ... 41.82
    city                 (geometry) object 23kB 'Tirana' 'Durres' ... 'Bucha'
    country              (geometry) object 23kB 'Albania' ... 'Ukraine'
Indexes:
    geometry  GeometryIndex (crs=EPSG:4326)
xarray.Dataset
    • time: 7304
    • geometry: 2844
    • time
      (time)
      datetime64[ns]
      2017-01-01 ... 2021-12-31T18:00:00
      array(['2017-01-01T00:00:00.000000000', '2017-01-01T06:00:00.000000000',
             '2017-01-01T12:00:00.000000000', ..., '2021-12-31T06:00:00.000000000',
             '2021-12-31T12:00:00.000000000', '2021-12-31T18:00:00.000000000'],
            dtype='datetime64[ns]')
    • geometry
      (geometry)
      object
      POINT (19.8188896 41.3275) ... P...
      crs :
      EPSG:4326
      array([<POINT (19.819 41.328)>, <POINT (19.441 41.323)>,
             <POINT (20.082 41.112)>, ..., <POINT (33.467 50.75)>,
             <POINT (27.067 50.183)>, <POINT (30.217 50.567)>], dtype=object)
    • 2m_temperature
      (time, geometry)
      float32
      271.9 278.4 269.9 ... nan nan nan
      long_name :
      2 metre temperature
      short_name :
      t2m
      units :
      K
      array([[271.87958, 278.35577, 269.85394, ..., 269.92792, 271.24445,
              272.31277],
             [270.30252, 277.62662, 267.76663, ..., 271.04095, 271.48956,
              272.74133],
             [281.74112, 282.03555, 281.34186, ..., 272.17865, 274.12875,
              273.7341 ],
             ...,
             [278.5096 , 282.59393, 277.4078 , ..., 267.58942, 274.69095,
              271.959  ],
             [287.78763, 286.99976, 287.64877, ..., 270.57343, 276.1852 ,
              275.06375],
             [      nan,       nan,       nan, ...,       nan,       nan,
                    nan]], dtype=float32)
    • u_component_of_wind
      (time, geometry)
      float32
      112.1 112.7 110.0 ... 33.92 41.82
      long_name :
      U component of wind
      short_name :
      u
      standard_name :
      eastward_wind
      units :
      m s**-1
      array([[112.14053  , 112.654465 , 110.033875 , ..., 139.53139  ,
              138.74988  , 141.79944  ],
             [109.49022  , 109.24812  , 108.42415  , ..., 113.78     ,
              127.885284 , 129.67764  ],
             [116.612946 , 116.66391  , 114.61247  , ..., 117.72574  ,
              131.63141  , 124.44496  ],
             ...,
             [-24.250381 , -23.992752 , -25.307579 , ...,  12.396057 ,
               -9.374149 ,   0.5137253],
             [-18.93332  , -18.946648 , -18.782288 , ...,  12.840256 ,
                8.216141 ,   8.793602 ],
             [-16.161514 , -15.952744 , -17.574066 , ...,  50.081924 ,
               33.921955 ,  41.82426  ]], dtype=float32)
    • city
      (geometry)
      object
      'Tirana' 'Durres' ... 'Bucha'
      array(['Tirana', 'Durres', 'Elbasan', ..., 'Romny', 'Shepetivka', 'Bucha'],
            dtype=object)
    • country
      (geometry)
      object
      'Albania' 'Albania' ... 'Ukraine'
      array(['Albania', 'Albania', 'Albania', ..., 'Ukraine', 'Ukraine',
             'Ukraine'], dtype=object)
    • time
      PandasIndex
      PandasIndex(DatetimeIndex(['2017-01-01 00:00:00', '2017-01-01 06:00:00',
                     '2017-01-01 12:00:00', '2017-01-01 18:00:00',
                     '2017-01-02 00:00:00', '2017-01-02 06:00:00',
                     '2017-01-02 12:00:00', '2017-01-02 18:00:00',
                     '2017-01-03 00:00:00', '2017-01-03 06:00:00',
                     ...
                     '2021-12-29 12:00:00', '2021-12-29 18:00:00',
                     '2021-12-30 00:00:00', '2021-12-30 06:00:00',
                     '2021-12-30 12:00:00', '2021-12-30 18:00:00',
                     '2021-12-31 00:00:00', '2021-12-31 06:00:00',
                     '2021-12-31 12:00:00', '2021-12-31 18:00:00'],
                    dtype='datetime64[ns]', name='time', length=7304, freq=None))
    • geometry
      GeometryIndex (crs=EPSG:4326)
      <xvec.index.GeometryIndex object at 0x7fe863ce1160>
1
2
3
era5_europe_cities = era5_europe_cities.assign_coords({'city':era5_europe_cities['city'],
                                                       'country':era5_europe_cities['country']
                                                      })
1
era5_europe_cities
<xarray.Dataset> Size: 166MB
Dimensions:              (time: 7304, geometry: 2844)
Coordinates:
  * time                 (time) datetime64[ns] 58kB 2017-01-01 ... 2021-12-31...
  * geometry             (geometry) object 23kB POINT (19.8188896 41.3275) .....
    city                 (geometry) object 23kB 'Tirana' 'Durres' ... 'Bucha'
    country              (geometry) object 23kB 'Albania' ... 'Ukraine'
Data variables:
    2m_temperature       (time, geometry) float32 83MB 271.9 278.4 ... nan nan
    u_component_of_wind  (time, geometry) float32 83MB 112.1 112.7 ... 41.82
Indexes:
    geometry  GeometryIndex (crs=EPSG:4326)
xarray.Dataset
    • time: 7304
    • geometry: 2844
    • time
      (time)
      datetime64[ns]
      2017-01-01 ... 2021-12-31T18:00:00
      array(['2017-01-01T00:00:00.000000000', '2017-01-01T06:00:00.000000000',
             '2017-01-01T12:00:00.000000000', ..., '2021-12-31T06:00:00.000000000',
             '2021-12-31T12:00:00.000000000', '2021-12-31T18:00:00.000000000'],
            dtype='datetime64[ns]')
    • geometry
      (geometry)
      object
      POINT (19.8188896 41.3275) ... P...
      crs :
      EPSG:4326
      array([<POINT (19.819 41.328)>, <POINT (19.441 41.323)>,
             <POINT (20.082 41.112)>, ..., <POINT (33.467 50.75)>,
             <POINT (27.067 50.183)>, <POINT (30.217 50.567)>], dtype=object)
    • city
      (geometry)
      object
      'Tirana' 'Durres' ... 'Bucha'
      array(['Tirana', 'Durres', 'Elbasan', ..., 'Romny', 'Shepetivka', 'Bucha'],
            dtype=object)
    • country
      (geometry)
      object
      'Albania' 'Albania' ... 'Ukraine'
      array(['Albania', 'Albania', 'Albania', ..., 'Ukraine', 'Ukraine',
             'Ukraine'], dtype=object)
    • 2m_temperature
      (time, geometry)
      float32
      271.9 278.4 269.9 ... nan nan nan
      long_name :
      2 metre temperature
      short_name :
      t2m
      units :
      K
      array([[271.87958, 278.35577, 269.85394, ..., 269.92792, 271.24445,
              272.31277],
             [270.30252, 277.62662, 267.76663, ..., 271.04095, 271.48956,
              272.74133],
             [281.74112, 282.03555, 281.34186, ..., 272.17865, 274.12875,
              273.7341 ],
             ...,
             [278.5096 , 282.59393, 277.4078 , ..., 267.58942, 274.69095,
              271.959  ],
             [287.78763, 286.99976, 287.64877, ..., 270.57343, 276.1852 ,
              275.06375],
             [      nan,       nan,       nan, ...,       nan,       nan,
                    nan]], dtype=float32)
    • u_component_of_wind
      (time, geometry)
      float32
      112.1 112.7 110.0 ... 33.92 41.82
      long_name :
      U component of wind
      short_name :
      u
      standard_name :
      eastward_wind
      units :
      m s**-1
      array([[112.14053  , 112.654465 , 110.033875 , ..., 139.53139  ,
              138.74988  , 141.79944  ],
             [109.49022  , 109.24812  , 108.42415  , ..., 113.78     ,
              127.885284 , 129.67764  ],
             [116.612946 , 116.66391  , 114.61247  , ..., 117.72574  ,
              131.63141  , 124.44496  ],
             ...,
             [-24.250381 , -23.992752 , -25.307579 , ...,  12.396057 ,
               -9.374149 ,   0.5137253],
             [-18.93332  , -18.946648 , -18.782288 , ...,  12.840256 ,
                8.216141 ,   8.793602 ],
             [-16.161514 , -15.952744 , -17.574066 , ...,  50.081924 ,
               33.921955 ,  41.82426  ]], dtype=float32)
    • time
      PandasIndex
      PandasIndex(DatetimeIndex(['2017-01-01 00:00:00', '2017-01-01 06:00:00',
                     '2017-01-01 12:00:00', '2017-01-01 18:00:00',
                     '2017-01-02 00:00:00', '2017-01-02 06:00:00',
                     '2017-01-02 12:00:00', '2017-01-02 18:00:00',
                     '2017-01-03 00:00:00', '2017-01-03 06:00:00',
                     ...
                     '2021-12-29 12:00:00', '2021-12-29 18:00:00',
                     '2021-12-30 00:00:00', '2021-12-30 06:00:00',
                     '2021-12-30 12:00:00', '2021-12-30 18:00:00',
                     '2021-12-31 00:00:00', '2021-12-31 06:00:00',
                     '2021-12-31 12:00:00', '2021-12-31 18:00:00'],
                    dtype='datetime64[ns]', name='time', length=7304, freq=None))
    • geometry
      GeometryIndex (crs=EPSG:4326)
      <xvec.index.GeometryIndex object at 0x7f85bb9bb410>

Using a vector data cube¶

We've shown how to combine raster and vector data to create a vector data cube with two indexes (geometry and time). Now, we will demonstrate how to use this object in a simple example of a scientific analysis. This example focuses on operations that are likely found across a range of workflows such as:

  • Using Xarray functionality to compute and reduce along geometric and non-geometric dimensions,
  • Indexing and selection using geometric dimensions,
  • Adjusting vector data cubes to use Xarray plotting methods, (take out)
  • Converting vector data cubes to Geopandas GeoDataFrames for geographic visualizations.

1. Computation and grouping along a time dimension¶

We can use Xarray to group observations by season and compute seasonal mean estimates. This allows us to smoothly change the resolution of the time dimension while preserving the geometry dimension.

1
era5_europe_cities_seasons = era5_europe_cities.groupby(era5_europe_cities['time'].dt.season).mean()
1
era5_europe_cities_seasons
<xarray.Dataset> Size: 159kB
Dimensions:              (season: 4, geometry: 2844)
Coordinates:
  * geometry             (geometry) object 23kB POINT (19.8188896 41.3275) .....
    city                 (geometry) object 23kB 'Tirana' 'Durres' ... 'Bucha'
    country              (geometry) object 23kB 'Albania' ... 'Ukraine'
  * season               (season) object 32B 'DJF' 'JJA' 'MAM' 'SON'
Data variables:
    2m_temperature       (season, geometry) float32 46kB 280.6 283.1 ... 282.7
    u_component_of_wind  (season, geometry) float32 46kB 55.58 55.63 ... 47.79
Indexes:
    geometry  GeometryIndex (crs=EPSG:4326)
xarray.Dataset
    • season: 4
    • geometry: 2844
    • geometry
      (geometry)
      object
      POINT (19.8188896 41.3275) ... P...
      crs :
      EPSG:4326
      array([<POINT (19.819 41.328)>, <POINT (19.441 41.323)>,
             <POINT (20.082 41.112)>, ..., <POINT (33.467 50.75)>,
             <POINT (27.067 50.183)>, <POINT (30.217 50.567)>], dtype=object)
    • city
      (geometry)
      object
      'Tirana' 'Durres' ... 'Bucha'
      array(['Tirana', 'Durres', 'Elbasan', ..., 'Romny', 'Shepetivka', 'Bucha'],
            dtype=object)
    • country
      (geometry)
      object
      'Albania' 'Albania' ... 'Ukraine'
      array(['Albania', 'Albania', 'Albania', ..., 'Ukraine', 'Ukraine',
             'Ukraine'], dtype=object)
    • season
      (season)
      object
      'DJF' 'JJA' 'MAM' 'SON'
      array(['DJF', 'JJA', 'MAM', 'SON'], dtype=object)
    • 2m_temperature
      (season, geometry)
      float32
      280.6 283.1 279.2 ... 282.6 282.7
      long_name :
      2 metre temperature
      short_name :
      t2m
      units :
      K
      array([[280.62354, 283.11908, 279.23575, ..., 270.3793 , 271.60138,
              271.44788],
             [297.49365, 297.15274, 297.14392, ..., 293.97604, 292.93857,
              294.0415 ],
             [287.5308 , 288.08292, 286.8431 , ..., 282.1084 , 282.28137,
              282.68982],
             [289.8003 , 291.75275, 288.78796, ..., 282.25952, 282.63968,
              282.73175]], dtype=float32)
    • u_component_of_wind
      (season, geometry)
      float32
      55.58 55.63 54.57 ... 47.58 47.79
      long_name :
      U component of wind
      short_name :
      u
      standard_name :
      eastward_wind
      units :
      m s**-1
      array([[ 55.575176,  55.630436,  54.574   , ...,  74.21834 ,  76.10485 ,
               75.07553 ],
             [-34.584896, -34.58256 , -34.817337, ..., -26.087788, -26.422586,
              -26.269844],
             [ 15.728514,  15.73295 ,  15.545878, ...,  17.109795,  17.090252,
               17.18853 ],
             [ 37.93523 ,  37.957623,  37.490055, ...,  47.977142,  47.582752,
               47.79004 ]], dtype=float32)
    • geometry
      GeometryIndex (crs=EPSG:4326)
      <xvec.index.GeometryIndex object at 0x7f858594aea0>
    • season
      PandasIndex
      PandasIndex(Index(['DJF', 'JJA', 'MAM', 'SON'], dtype='object', name='season'))

2. Spatial indexing¶

We can use the vector data cube's spatially-aware indexes to select points based on geometry and proximity to other points. As an example, find all cities within 0.5 degree of the city with the highest single temperature estimate:

1
max_temp_ob = era5_europe_cities.where(era5_europe_cities['2m_temperature'] == era5_europe_cities['2m_temperature'].max(), drop=True) 
1
2
3
4
era5_europe_cities.xvec.query(
    'geometry', 
    max_temp_ob.geometry.data[0].buffer(0.5)
)
<xarray.Dataset> Size: 175kB
Dimensions:              (time: 7304, geometry: 2)
Coordinates:
  * time                 (time) datetime64[ns] 58kB 2017-01-01 ... 2021-12-31...
  * geometry             (geometry) object 16B POINT (-16.2546158 28.4682386)...
    city                 (geometry) object 16B 'Santa Cruz de Tenerife' 'La L...
    country              (geometry) object 16B 'Spain' 'Spain'
Data variables:
    2m_temperature       (time, geometry) float32 58kB 280.3 280.3 ... nan nan
    u_component_of_wind  (time, geometry) float32 58kB 20.11 20.11 ... -31.75
Indexes:
    geometry  GeometryIndex (crs=EPSG:4326)
xarray.Dataset
    • time: 7304
    • geometry: 2
    • time
      (time)
      datetime64[ns]
      2017-01-01 ... 2021-12-31T18:00:00
      array(['2017-01-01T00:00:00.000000000', '2017-01-01T06:00:00.000000000',
             '2017-01-01T12:00:00.000000000', ..., '2021-12-31T06:00:00.000000000',
             '2021-12-31T12:00:00.000000000', '2021-12-31T18:00:00.000000000'],
            dtype='datetime64[ns]')
    • geometry
      (geometry)
      object
      POINT (-16.2546158 28.4682386) P...
      crs :
      EPSG:4326
      array([<POINT (-16.255 28.468)>, <POINT (-16.317 28.483)>], dtype=object)
    • city
      (geometry)
      object
      'Santa Cruz de Tenerife' 'La Lag...
      array(['Santa Cruz de Tenerife', 'La Laguna'], dtype=object)
    • country
      (geometry)
      object
      'Spain' 'Spain'
      array(['Spain', 'Spain'], dtype=object)
    • 2m_temperature
      (time, geometry)
      float32
      280.3 280.3 276.4 ... 294.3 nan nan
      long_name :
      2 metre temperature
      short_name :
      t2m
      units :
      K
      array([[280.3136 , 280.3136 ],
             [276.42108, 276.42108],
             [287.49585, 287.49585],
             ...,
             [281.391  , 281.391  ],
             [294.3005 , 294.3005 ],
             [      nan,       nan]], dtype=float32)
    • u_component_of_wind
      (time, geometry)
      float32
      20.11 20.11 18.9 ... -31.75 -31.75
      long_name :
      U component of wind
      short_name :
      u
      standard_name :
      eastward_wind
      units :
      m s**-1
      array([[ 20.114271,  20.114271],
             [ 18.899544,  18.899544],
             [ 10.494137,  10.494137],
             ...,
             [-44.22158 , -44.22158 ],
             [-34.52471 , -34.52471 ],
             [-31.752907, -31.752907]], dtype=float32)
    • time
      PandasIndex
      PandasIndex(DatetimeIndex(['2017-01-01 00:00:00', '2017-01-01 06:00:00',
                     '2017-01-01 12:00:00', '2017-01-01 18:00:00',
                     '2017-01-02 00:00:00', '2017-01-02 06:00:00',
                     '2017-01-02 12:00:00', '2017-01-02 18:00:00',
                     '2017-01-03 00:00:00', '2017-01-03 06:00:00',
                     ...
                     '2021-12-29 12:00:00', '2021-12-29 18:00:00',
                     '2021-12-30 00:00:00', '2021-12-30 06:00:00',
                     '2021-12-30 12:00:00', '2021-12-30 18:00:00',
                     '2021-12-31 00:00:00', '2021-12-31 06:00:00',
                     '2021-12-31 12:00:00', '2021-12-31 18:00:00'],
                    dtype='datetime64[ns]', name='time', length=7304, freq=None))
    • geometry
      GeometryIndex (crs=EPSG:4326)
      <xvec.index.GeometryIndex object at 0x7f859b6fd880>

And fastest summer wind speed estimate:

1
max_u_summer_ob = era5_europe_cities_seasons.where(era5_europe_cities_seasons['u_component_of_wind'] == era5_europe_cities_seasons['u_component_of_wind'].max(), drop=True) 
1
2
3
4
max_summer_wind_buffer = era5_europe_cities.xvec.query(
    'geometry', 
    max_u_summer_ob.geometry.data[0].buffer(.5)
)

3. Convert Xarray objects to geopandas GeoDataFrame¶

This is a case where converting the xr.Dataset to a gpd.GeoDataFrame using Xvec's to_geodataframe() method is very helpful for visualizing and interpreting results; it allows us to view the results of computation in Xarray with spatial context:

1
era5_europe_cities_seasons['u_component_of_wind'].sel(season='DJF').xvec.to_geodataframe(geometry='geometry').explore('u_component_of_wind')
Make this Notebook Trusted to load map: File -> Trust Notebook

Wrap up¶

Summary¶

Vector data cubes offer a streamlined way to manipulate and analyze n-dimensional data indexed along geometric dimensions. New features associated with vector data cubes include: spatial indexing, CRS-aware geometry dimensions, interoperability with GeoPandas, and serialization. We look forward to seeing how vector data cubes are leveraged by the scientific community and the novel analytical workflows that arise from this data model.

As shown above, there are a few rough edges remaining around interoperability between Xarray vector data cubes and existing Xarray methods. Stay tuned for more updates to vector data cubes in Xarray and check out [insert repo / issues page?] if you're interested in getting involved.

Up next¶

Implementing vector data cubes is the result of hard work and development across the open-source community by a number of individuals and groups working on a range of packages. It's exciting to see collaborative efforts in the open-source community that lead to significant breakthroughs and improvements to user experience. In our next blog post, we will highlight all of the pieces and groups that came together to make vector data cubes possible.

Acknowledgments¶

References¶

note: add refs from old draft

1