Introducing vector data cubes !¶
{note}
note to self
- 2 inline images not working, need to fix
- example section is a little disjointed
- might still be too long
- maybe don't use dTdt in example
- too many .explore() outputs at end + fix diff output sizes
The Xarray ecosystem now supports vector data cubes :tada. Just like raster data such as satellite imagery and modeled climate data, geospatial datasets representing real-world features as points, lines, and polygons are increasingly large, complex, and multidimensional, pushing the limitations of existing tools for working with these data types. We've seen growing interest in multidimensional representations of vector data across the geospatial scientific community and are excited to announce the extension of the Xarray data model to support vector geometries. Vector data cubes enable new ways of representing and interacting with geospatial data, extending powerful Xarray features like label-based indexing, selection, and computation to geometries.
This blog post is geared toward Xarray users and analysts working with geospatial datasets. We'll first introduce the concept of vector data cubes, discuss how they differ from existing tooling, and the analytical opportunities they create. Next, we will demonstrate how vector data cubes can be used to streamline and augment scientific workflows by showing how to construct a vector data cube from ERA5 reanalysis data sampled at a set of geometries representing points of interest.
[this will come later, but maybe add a short para here thanking/shouting out all of the people/groups who made this possible?/introduce pt2 here?]
{note}
The R programming language also has strong support for vector data cubes, which inspired much of this work. For information about vector data cube software in R, check out the [`stars`](https://r-spatial.github.io/stars/) package, and for a very detailed background on vector data cubes, check out this [post](https://r-spatial.org/r/2022/09/12/vdc.html) by Edzer Pebesma, a `stars` developer.
Background¶
Raster data cubes, vector data frames¶
The Python ecosystem has robust tools built to work with both raster and vector data. Geospatial raster data are typically stored as cubes, with dimensions such as [x, y] or [longitude, latitude] representing the spatial domain of the dataset. In addition to spatial dimensions, raster data cubes can have additional dimensions such as time and level (this commonly refers to the vertical domain of a model above earth's surface). These data are typically stored on disk in array-based formats such as NetCDF or Zarr, and software like Xarray offer powerful tools for processing and analyzing multi-dimensional array data in Python. In contrast, geospatial vector data, which model real-world features like rivers, counties, and buildings as geometry objects, are typically stored as tables and formatted on disk in a vector-specific container like GeoJSON or Shapefile. The GeoPandas library in Python extends the Pandas framework, representing vector data as data frames that have additional functionality related to spatial operations and handling coordinate reference system information.
This brief overview highlights a key difference between the traditional models of raster and vector data: raster data is viewed as a cube, while vector data is discussed as a data frame. Both of these are powerful and functional formats, but they are optimized for different types of operations. For this reason, framing raster data as cubes and vector data as data frames can influence both how we, as analysts, think about our data and the analytical questions we ask with it.
As an example, let's consider what a data frame means in the context of spatial data. A data frame is a 'structure that organizes data into a 2-dimensional table of rows and columns' (cite). In many cases, this is an ideal format for vector data, and tools like GeoPandas enable a range of spatial operations that are built on top of this model (note:redundant with sentence in paragraph above?). geopandas.GeoDataFrames are similar to pandas.DataFrames: they have an index and a number of data columns, but they also have a geometry column which store shapely geometry objects. Users can efficiently store vector data with easily accessible attribute data and perform operations that generate knowledge about the relationships between spatial features and associated attribute data.

Pushing the limits of data frames¶
What about situations where the data frame format can be limiting (note: reword)? Imagine we have a collection of weather stations located throughout a river basin. Each weather station records data such as temperature, windspeed, and relative humidity. These are stored in the data columns of a geopandas.GeoDataFrame, while the coordinates of each weather station are stored as Point geometries in the geometry column. From this table, we can quickly access a lot of information and ask questions such as how do temperatures vary across the elevation range covered by the weather stations, and where are windspeeds highest? But, each time the weather station records a measurement, we get a new set of data for each variable. Within the structure of the GeoDataFrame, how should that new data be incorporated?
There are a number of strategies for organizing information in data frames such as wide form (the columns of this data frame are the cross-join of the implicit dimensions of the dataset) and long form (the number of columns is minimized while a row is added for every combination of dimensions) data frames [Pebesma, 2022]. These let the user store multidimensional data in a way that makes it easy to query and subset. However, the data frame column structure is still fundamentally one-dimensional and these strategies involve duplicating data along either the column or row dimension.
Returning to the weather station example, when we stack multiple temporal observations from weather stations into a data frame, we've added a dimension to the data. By adding columns or adding rows to incorporate the new data, we are essentially looking for ways to represent additional dimensions within the structure of a data frame. Vector datasets are frequently treated as 'flat' or where the spatial dimension is the only required functional dimension; but what happens when vector datasets contain additional dimensions like time? Contrast this to raster data cubes, where data is explicitly represented as multi-dimensional, and popular tools and data models reflect this fundamental shape. What would it look like, and how would our workflows change if vector data were represented by cube structures similarly to raster data? [reword]
Figure 1: Caption for both images
What are vector data cubes?¶
Vector data cubes are n-dimensional arrays with at least one dimension that is an array of geometries (Pebesma, 2022). Where the spatial domain of a raster object is a gridded extent (eg. rdc = {'x': [0,1,2,3,4,5,6,7,8,9], 'y':[0,1,2,3,4,5,6,7,8,9]}), the spatial domain of a vector data cube is an array of points, lines or polygons representing real-world features:
vdc = {'counties':
[<POLYGON ((-95.343 48.547, -95.341 48.715, -95.095 48.912, ...>,
<POLYGON ((-118.851 47.95, -118.847 48.479, -118.87 48.647, ...>,
<POLYGON ((-117.438 48.044,-117.541 47.79, -117.607 47.798,...>,
With vector data cubes, we can now work with n-dimensional geometric datasets! :start-struck
What can we do with vector data cubes?¶
Multi-dimensional representations of vector data open new directions of scientific inquiry and analysis. As discussed above, this data model streamlines analyses such as examining attribute data (such as temperature and relative humidity) variability over time (or another dimension). Similarly, one could observe how geospatial features change over time: tracking areal extents such as forest boundaries and agricultural parcels, or non-stationary monitoring systems such as ocean data buoys would be simplified by vector data cube representations.
A particularly exciting use-case is sampling gridded raster data at a number of points of interest, going from a raster data cube covering a large spatial extent to a vector data cube containing only the spatial areas of interest. The following example will demonstrate how to construct a vector data cube by sampling ERA5 climate model reanalysis data at a number of point locations representing cities across Europe. We will show both how to assemble this data into a cube and how the cube structure abridges analytical workflows aimed at answering a set of general scientific questions about the data.
Understanding climate variability in population centers across Europe¶
This example will demonstrate how the process of wrangling and querying data to answer a scientific question becomes more efficient and intuitive when we structure the data as a vector data cube. For this example, we are interested in understanding climatic conditions in cities across Europe using ERA5, a cliamte model reanalysis dataset. Climate models produced massive datasets with a number of physical variables that expand our understanding of the climate system beyond what is possible through direct monitoring and observation. ERA5 is a dataset that provides estimates for 32 climate variables for the entire globe for every 6-hour window from 1959 to 2022.
These are big files! Thankfully, we can access it from Google Cloud Storage without having to download the data to our local computer.
Assemble vector data cube¶
1 2 3 4 5 6 7 8 9 10 11 | #imports import fsspec import xarray as xr from datasets import load_dataset import pandas as pd import geopandas as gpd import cf_xarray import xvec import numpy as np import shapely import matplotlib.pyplot as plt |
1 | fs = fsspec.filesystem('gs') |
Read raster data¶
We will use an 'analysis-ready' dataset, which means that a lot of work preparing the dataset has been done for us. Read more about analysis-ready data and ERA5 processing here.
The ERA5 dataset is stored in Google Cloud Storage as a Zarr datacube. Use Xarray to read it into python:
1 2 3 4 5 | era5_ds = xr.open_zarr( 'gs://gcp-public-data-arco-era5/ar/1959-2022-full_37-6h-0p25deg-chunk-1.zarr-v2', chunks={'time': 48}, consolidated=True, ).isel(level=0) |
This is a ton of data! :dizzy_face
1 | print(f'size: {era5_ds.nbytes / (1024 ** 4)} TiB') |
size: 6.257914529454865 TiB
1 | era5_ds.isel(time=-2)['2m_temperature'].plot(robust=True, figsize=(10,6)); |
Let's extract a shorter time series that will be easier to work with and specify only the data variables we'd like to use in our analysis.
1 | era5_ds_sub = era5_ds.sel(time=slice('2017-01','2022-01'))[['2m_temperature','u_component_of_wind']] |
Read vector data¶
ERA5 outputs are gridded datasets covering the entire globe. We only need data about cities in Europe, so continuing to store the global dataset is a pain and unnecessary. The following cell reads a vector dataset from Hugging Face and formats it as a gpd.GeoDataFrame containing data covering the European continent.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | cities = load_dataset('jamescalam/world-cities-geo',split='train') cities_df = pd.DataFrame({'city':cities['city'], 'country':cities['country'], 'region':cities['region'], 'latitude':cities['latitude'], 'longitude':cities['longitude'], 'continent':cities['continent'], 'x':cities['x'], 'y':cities['y'], 'z':cities['z']}) def lon_360(row): new_lon = ((360 + (row['longitude'] % 360)) % 360) return new_lon cities_df['lon_360'] = cities_df.apply(lon_360, axis=1) cities_eur = cities_df.loc[cities_df['continent'] == 'Europe'] cities_eur = gpd.GeoDataFrame( cities_eur, geometry = gpd.points_from_xy(cities_eur.longitude, cities_eur.latitude), crs = 'EPSG:4326' ) |
Repo card metadata block was not found. Setting CardData to empty.
Now we can visualize and query the spatial components of the vector dataset.
1 | cities_eur.explore() |
We can also take a look at the information stored in the attributes of the GeoDataFrame; it is all scalar, meaning that a dataframe is an ideal representation of this data.
1 | cities_eur.head() |
| city | country | region | latitude | longitude | continent | x | y | z | lon_360 | geometry | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 84 | Tirana | Albania | Southern Europe | 41.327500 | 19.818890 | Europe | 4500.907258 | 1622.102741 | 4207.167403 | 19.818890 | POINT (19.81889 41.32750) |
| 85 | Durres | Albania | Southern Europe | 41.323056 | 19.441389 | Europe | 4511.804644 | 1592.521568 | 4206.796276 | 19.441389 | POINT (19.44139 41.32306) |
| 86 | Elbasan | Albania | Southern Europe | 41.112500 | 20.082222 | Europe | 4508.200233 | 1648.181036 | 4189.184997 | 20.082222 | POINT (20.08222 41.11250) |
| 87 | Vlore | Albania | Southern Europe | 40.466667 | 19.489721 | Europe | 4569.228960 | 1617.126398 | 4134.814376 | 19.489721 | POINT (19.48972 40.46667) |
| 88 | Shkoder | Albania | Southern Europe | 42.068283 | 19.512577 | Europe | 4457.868549 | 1579.715398 | 4268.670549 | 19.512577 | POINT (19.51258 42.06828) |
Create a vector data cube with 'cities' data¶
xr.DataArray from 'cities' gpd.GeoDataFrame
To answer our question, we need ERA5 data at the locations in the GeoDataFrame above. We also want to know how the variables change over time, so the new object should have a time dimension. As a first step, construct a xr.DataArray from the cities gpd.GeoDataFrame; this will create a data cube with a geometry dimension.
1 2 3 4 5 6 7 8 | europe_ds = xr.Dataset( { 'city': ('geometry', cities_eur['city']), 'country': ('geometry', cities_eur['country']), 'lat': ('geometry', cities_eur['latitude']), 'lon': ('geometry', cities_eur['lon_360']), }, coords= {'geometry': cities_eur['geometry']}) |
1 | europe_ds
|
<xarray.Dataset> Size: 114kB
Dimensions: (geometry: 2844)
Coordinates:
* geometry (geometry) object 23kB POINT (19.8188896 41.3275) ... POINT (30...
Data variables:
city (geometry) object 23kB 'Tirana' 'Durres' ... 'Shepetivka' 'Bucha'
country (geometry) object 23kB 'Albania' 'Albania' ... 'Ukraine' 'Ukraine'
lat (geometry) float64 23kB 41.33 41.32 41.11 ... 50.75 50.18 50.57
lon (geometry) float64 23kB 19.82 19.44 20.08 ... 33.47 27.07 30.22However, taking a look at this object, you'll see that its index is a PandasIndex, meaning that Xarray doesn't understand that the geom dimension represents spatial data. This is where Xvec comes in - Xvec is a library for working with vector data cubes in Xarray. It adds extensions to Xarray objects that facilitate the features (such as spatially-aware indexes, handling projections, and interoperability with GeoPandas) that make vector data cubes in Xarray possible.
Convert the PandasIndex above to a GeometryIndex using Xvec's set_geom_indexes() method. GeometryIndexes maintain the functionality of PandasIndexes but they also can perform indexing and filtering based on geometries and have attributes specifying a coordinate reference system.
1 | europe_ds = europe_ds.xvec.set_geom_indexes('geometry', crs=4326) |
1 | europe_ds.xindexes |
Indexes:
geometry GeometryIndex (crs=EPSG:4326)
We now have a vector datacube :confetti-ball ! Remember that a vector data cube is any n-dimensional array with a spatial dimension mapping to a set of vector geometries. The europe_ds object is currently a 1-dimensional vector data cube, meaning it is the simplest form of a vector data cube, and is functionally the same as a data frame. Let's incorporate additional data into this object to see the utility of representing it as a data cube.
Sample raster data cube with geometries from vector data cube¶
We've created an 'empty' vector data cube (europe_ds) storing data about European cities. This object has the geographic information needed to communicate with the xr.Dataset storing ERA5 data (era5_ds_sub). We now need to interpolate the raster data cube onto the coordinates of European cities. The ERA5 dataset is indexed along 'latitude' and 'longitude' dimensions, while the vector data cube is indexed along a 'geometry' dimension. To interpolate ERA5 data onto the coordinate grid of cities, we need to pass era5_ds_sub information it understands about the locations of the cities (ie. the latitude and longitude variables of europe_ds not the geom coordinate dimension). (note: last sentence redundant?)
1 | era5_ds_sub
|
<xarray.Dataset> Size: 61GB
Dimensions: (time: 7304, latitude: 721, longitude: 1440)
Coordinates:
* latitude (latitude) float32 3kB 90.0 89.75 89.5 ... -89.75 -90.0
level int64 8B 94256640801872
* longitude (longitude) float32 6kB 0.0 0.25 0.5 ... 359.5 359.8
* time (time) datetime64[ns] 58kB 2017-01-01 ... 2021-12-31...
Data variables:
2m_temperature (time, latitude, longitude) float32 30GB dask.array<chunksize=(28, 721, 1440), meta=np.ndarray>
u_component_of_wind (time, latitude, longitude) float32 30GB dask.array<chunksize=(28, 721, 1440), meta=np.ndarray>1 | europe_ds
|
<xarray.Dataset> Size: 114kB
Dimensions: (geometry: 2844)
Coordinates:
* geometry (geometry) object 23kB POINT (19.8188896 41.3275) ... POINT (30...
Data variables:
city (geometry) object 23kB 'Tirana' 'Durres' ... 'Shepetivka' 'Bucha'
country (geometry) object 23kB 'Albania' 'Albania' ... 'Ukraine' 'Ukraine'
lat (geometry) float64 23kB 41.33 41.32 41.11 ... 50.75 50.18 50.57
lon (geometry) float64 23kB 19.82 19.44 20.08 ... 33.47 27.07 30.22
Indexes:
geometry GeometryIndex (crs=EPSG:4326)1 2 | era5_europe_cities = era5_ds_sub.xvec.extract_points(europe_ds.geometry.data, x_coords = 'longitude', y_coords='latitude') |
Cool, now we have a 2-dimensional vector data cube! We went from a 3-d raster datacube with latitude and longitude dimensions to a 2-d datacube where the only spatial dimension is 'geometry' (each element of the geometry dimension is a different city from our original dataset). Our dataset is also much smaller now :thumbsup
1 | print(f'size: {era5_europe_cities.nbytes / (1024 ** 3)} GB') |
size: 0.15484336763620377 GB
1 | era5_europe_cities
|
<xarray.Dataset> Size: 166MB
Dimensions: (time: 7304, geometry: 2844)
Coordinates:
level int64 8B 94256640801872
* time (time) datetime64[ns] 58kB 2017-01-01 ... 2021-12-31...
* geometry (geometry) object 23kB POINT (19.8188896 41.3275) .....
Data variables:
2m_temperature (time, geometry) float32 83MB dask.array<chunksize=(28, 2844), meta=np.ndarray>
u_component_of_wind (time, geometry) float32 83MB dask.array<chunksize=(28, 2844), meta=np.ndarray>
Indexes:
geometry GeometryIndex (crs=None)The above operation interpolated the ERA5 data onto the coordinates from europe_ds but in the process we lost the data variables describing the name and country of each city. Add those onto the interpolated vector data cube and drop the level coordinate variable, which we don't need.
1 2 3 | era5_europe_cities['city'] = europe_ds['city'] era5_europe_cities['country'] = europe_ds['country'] era5_europe_cities = era5_europe_cities.drop_vars('level') |
Up until now, we have been performing lazy operations, meaning that we haven't actually loaded the ERA5 data we accessed from Google Cloud Storage into memory on our local machine. We'll do this now so that we can perform computations that require the data in memory.
1 | from dask.diagnostics import ProgressBar |
1 2 | with ProgressBar(): era5_europe_cities = era5_europe_cities.load() |
[########################################] | 100% Completed | 15m 58s
1 | era5_europe_cities
|
<xarray.Dataset> Size: 166MB
Dimensions: (time: 7304, geometry: 2844)
Coordinates:
* time (time) datetime64[ns] 58kB 2017-01-01 ... 2021-12-31...
* geometry (geometry) object 23kB POINT (19.8188896 41.3275) .....
Data variables:
2m_temperature (time, geometry) float32 83MB 271.9 278.4 ... nan nan
u_component_of_wind (time, geometry) float32 83MB 112.1 112.7 ... 41.82
city (geometry) object 23kB 'Tirana' 'Durres' ... 'Bucha'
country (geometry) object 23kB 'Albania' ... 'Ukraine'
Indexes:
geometry GeometryIndex (crs=EPSG:4326)1 2 3 | era5_europe_cities = era5_europe_cities.assign_coords({'city':era5_europe_cities['city'], 'country':era5_europe_cities['country'] }) |
1 | era5_europe_cities
|
<xarray.Dataset> Size: 166MB
Dimensions: (time: 7304, geometry: 2844)
Coordinates:
* time (time) datetime64[ns] 58kB 2017-01-01 ... 2021-12-31...
* geometry (geometry) object 23kB POINT (19.8188896 41.3275) .....
city (geometry) object 23kB 'Tirana' 'Durres' ... 'Bucha'
country (geometry) object 23kB 'Albania' ... 'Ukraine'
Data variables:
2m_temperature (time, geometry) float32 83MB 271.9 278.4 ... nan nan
u_component_of_wind (time, geometry) float32 83MB 112.1 112.7 ... 41.82
Indexes:
geometry GeometryIndex (crs=EPSG:4326)Using a vector data cube¶
We've shown how to combine raster and vector data to create a vector data cube with two indexes (geometry and time). Now, we will demonstrate how to use this object in a simple example of a scientific analysis. This example focuses on operations that are likely found across a range of workflows such as:
- Using Xarray functionality to compute and reduce along geometric and non-geometric dimensions,
- Indexing and selection using geometric dimensions,
- Adjusting vector data cubes to use Xarray plotting methods, (take out)
- Converting vector data cubes to Geopandas
GeoDataFramesfor geographic visualizations.
1. Computation and grouping along a time dimension¶
We can use Xarray to group observations by season and compute seasonal mean estimates. This allows us to smoothly change the resolution of the time dimension while preserving the geometry dimension.
1 | era5_europe_cities_seasons = era5_europe_cities.groupby(era5_europe_cities['time'].dt.season).mean() |
1 | era5_europe_cities_seasons
|
<xarray.Dataset> Size: 159kB
Dimensions: (season: 4, geometry: 2844)
Coordinates:
* geometry (geometry) object 23kB POINT (19.8188896 41.3275) .....
city (geometry) object 23kB 'Tirana' 'Durres' ... 'Bucha'
country (geometry) object 23kB 'Albania' ... 'Ukraine'
* season (season) object 32B 'DJF' 'JJA' 'MAM' 'SON'
Data variables:
2m_temperature (season, geometry) float32 46kB 280.6 283.1 ... 282.7
u_component_of_wind (season, geometry) float32 46kB 55.58 55.63 ... 47.79
Indexes:
geometry GeometryIndex (crs=EPSG:4326)2. Spatial indexing¶
We can use the vector data cube's spatially-aware indexes to select points based on geometry and proximity to other points. As an example, find all cities within 0.5 degree of the city with the highest single temperature estimate:
1 | max_temp_ob = era5_europe_cities.where(era5_europe_cities['2m_temperature'] == era5_europe_cities['2m_temperature'].max(), drop=True) |
1 2 3 4 | era5_europe_cities.xvec.query( 'geometry', max_temp_ob.geometry.data[0].buffer(0.5) ) |
<xarray.Dataset> Size: 175kB
Dimensions: (time: 7304, geometry: 2)
Coordinates:
* time (time) datetime64[ns] 58kB 2017-01-01 ... 2021-12-31...
* geometry (geometry) object 16B POINT (-16.2546158 28.4682386)...
city (geometry) object 16B 'Santa Cruz de Tenerife' 'La L...
country (geometry) object 16B 'Spain' 'Spain'
Data variables:
2m_temperature (time, geometry) float32 58kB 280.3 280.3 ... nan nan
u_component_of_wind (time, geometry) float32 58kB 20.11 20.11 ... -31.75
Indexes:
geometry GeometryIndex (crs=EPSG:4326)And fastest summer wind speed estimate:
1 | max_u_summer_ob = era5_europe_cities_seasons.where(era5_europe_cities_seasons['u_component_of_wind'] == era5_europe_cities_seasons['u_component_of_wind'].max(), drop=True) |
1 2 3 4 | max_summer_wind_buffer = era5_europe_cities.xvec.query( 'geometry', max_u_summer_ob.geometry.data[0].buffer(.5) ) |
3. Convert Xarray objects to geopandas GeoDataFrame¶
This is a case where converting the xr.Dataset to a gpd.GeoDataFrame using Xvec's to_geodataframe() method is very helpful for visualizing and interpreting results; it allows us to view the results of computation in Xarray with spatial context:
1 | era5_europe_cities_seasons['u_component_of_wind'].sel(season='DJF').xvec.to_geodataframe(geometry='geometry').explore('u_component_of_wind') |
Wrap up¶
Summary¶
Vector data cubes offer a streamlined way to manipulate and analyze n-dimensional data indexed along geometric dimensions. New features associated with vector data cubes include: spatial indexing, CRS-aware geometry dimensions, interoperability with GeoPandas, and serialization. We look forward to seeing how vector data cubes are leveraged by the scientific community and the novel analytical workflows that arise from this data model.
As shown above, there are a few rough edges remaining around interoperability between Xarray vector data cubes and existing Xarray methods. Stay tuned for more updates to vector data cubes in Xarray and check out [insert repo / issues page?] if you're interested in getting involved.
Up next¶
Implementing vector data cubes is the result of hard work and development across the open-source community by a number of individuals and groups working on a range of packages. It's exciting to see collaborative efforts in the open-source community that lead to significant breakthroughs and improvements to user experience. In our next blog post, we will highlight all of the pieces and groups that came together to make vector data cubes possible.
Acknowledgments¶
References¶
note: add refs from old draft
1 |