STAC objects -> data containers
Hi all,
I'm tenataively making a pitch to add convenience methods for converting pystac objects (Asset, Item, ItemCollection, ...) to commonly used data containers (xarray.Dataset
, geopandas.GeoDataFrame
, pandas.DataFrame
, etc.).
I'm opening this in pystac
since this is primarily for convenience, so that users can method-chain their way from STAC Catalog to data container, and pystac
owns the namespaces I care about. You can already do everything I'm showing today without any changes to pystac
but it feels less nice. I really think that pd.read_csv
is part of why Python is where it is today for data analytics; I want using STAC from Python to be as easy to use as pd.read_csv
.
Secondarily, it can elevate the best-practice way to go from STAC to data containers, by providing a top-level method similar to to_dict()
.
As a couple hypothetical examples, to give an idea:
ds = (
catalog
.get_collection("sentinel-2-l2a")
.get_item("S2B_MSIL2A_20220612T182919_R027_T24XWR_20220613T123251")
.assets["B03"]
.to_xarray()
)
ds
Or building a datacube from a pystac-client search (which subclasses pystac).
ds = (
catalog
.search(collections="sentinel-2-l2a", bbox=bbox)
.get_all_items() # ItemCollection
.to_xarray()
)
ds
Implementation details¶
This would be optional. pystac
would not add required dependencies on pandas
, xarray
, etc. It would merely provide the methods Item.to_xarray
, Asset.to_xarray
, ... Internally those methods would try to import the implementation and raise an ImportError
if the optional dependencies aren't met at runtime.
Speaking of the implementations, there's a few things to figure out. Some relatively complicated conversions (like ItemCollection -> xarray) are implemented multiple times (https://stackstac.readthedocs.io/, https://odc-stac.readthedocs.io/en/latest/examples.html). pystac
certainly wouldn't want to re-implement that conversion and would dispatch to one or either of those libraries (perhaps letting users decide with an engine
argument).
Others conversions, like Asset -> Zarr, are so straightforward they haven't really been codified in a library yet (though I have a prototype at https://github.com/TomAugspurger/staccontainers/blob/086c2a7d46520ca5213d70716726b28ba6f36ba5/staccontainers/_xarray.py#L61-L63). Maybe those could live in pystac; I'd be happy to maintain them.
Problems¶
A non-exhaustive list of reasons not to do this:
- It's not strictly necessary: You can do all this today, with some effort.
- It's a can of worms: Why
to_xarray
and notto_numpy()
,to_PIL
, ...? Whyto_pandas()
and notto_spark()
,to_modin
, ...?
Alternatives¶
Alternatively, we could recommend using intake, along with intake-stac, which would wrap pystac-client
and pystac
. That would be the primary "user-facing" catalog people actually interface with. It already has a rich ecosystem of drivers that convert from files to data containers. I've hit some issues with trying to use intake-stac, but those could presumably be fixed with some effort.
Examples¶
A whole bunch of examples, to give some ideas of the various conversions. You'll notice a pattern.
catalog -> collection -> item -> asset -> xarray (raster)¶
ds = (
catalog
.get_collection("sentinel-2-l2a")
.get_item("S2B_MSIL2A_20220612T182919_R027_T24XWR_20220613T123251")
.assets["B03"]
.to_xarray()
)
catalog -> collection -> item -> asset -> xarray (zarr)¶
ds = (
catalog
.get_collection("cil-gdpcir-cc0")
.get_item("cil-gdpcir-INM-INM-CM5-0-ssp585-r1i1p1f1-day")
.assets["pr"]
.to_xarray()
)
catlaog -> collection -> item -> asset -> xarray (references)¶
ds = (
catalog
.get_collection("deltares-floods")
.get_item("NASADEM-90m-2050-0010")
.assets["index"]
.to_xarray()
)
catalog -> collection -> item -> asset -> geodataframe¶
df = (
catalog
.get_collection("us-census")
.get_item("2020-cb_2020_us_tbg_500k")
.assets["data"]
.to_geopandas()
)
df.head()
1 2 3 4 | !pip uninstall -y pystac-client staccontainers planetary-computer !pip install -q git+https://github.com/TomAugspurger/pystac-client@feature/sign !pip install -q git+https://github.com/microsoft/planetary-computer-sdk-for-python !pip install -q git+https://github.com/TomAugspurger/staccontainers |
Found existing installation: pystac-client 0.4.0 Uninstalling pystac-client-0.4.0: Successfully uninstalled pystac-client-0.4.0 Found existing installation: staccontainers 0.1.0 Uninstalling staccontainers-0.1.0: Successfully uninstalled staccontainers-0.1.0 Found existing installation: planetary-computer 0.4.7 Uninstalling planetary-computer-0.4.7: Successfully uninstalled planetary-computer-0.4.7
1 2 3 4 5 6 7 8 9 10 | import pystac_client import planetary_computer from staccontainers import * bbox = [9.4, 0, 9.5, 1] catalog = pystac_client.Client.open( "https://planetarycomputer.microsoft.com/api/stac/v1", sign_function=planetary_computer.sign ) |
Asset -> xarray (raster)¶
1 2 3 4 5 6 7 8 9 | # catalog -> item -> asset -> xarray (zarr) ds = ( catalog .get_collection("sentinel-2-l2a") .get_item("S2B_MSIL2A_20220612T182919_R027_T24XWR_20220613T123251") .assets["B03"] .to_xarray() ) ds |
<xarray.Dataset> Dimensions: (band: 1, x: 10980, y: 10980) Coordinates: * band (band) int64 1 * x (x) float64 5e+05 5e+05 5e+05 ... 6.098e+05 6.098e+05 6.098e+05 * y (y) float64 9.1e+06 9.1e+06 9.1e+06 ... 8.99e+06 8.99e+06 spatial_ref int64 0 Data variables: band_data (band, y, x) float32 ...
Asset -> xarray (zarr)¶
1 2 3 4 5 6 7 8 9 | # catalog -> item -> asset -> xarray (zarr) ds = ( catalog .get_collection("cil-gdpcir-cc0") .get_item("cil-gdpcir-INM-INM-CM5-0-ssp585-r1i1p1f1-day") .assets["pr"] .to_xarray() ) ds |
<xarray.Dataset> Dimensions: (lat: 720, lon: 1440, time: 31390) Coordinates: * lat (lat) float64 -89.88 -89.62 -89.38 -89.12 ... 89.38 89.62 89.88 * lon (lon) float64 -179.9 -179.6 -179.4 -179.1 ... 179.4 179.6 179.9 * time (time) object 2015-01-01 12:00:00 ... 2100-12-31 12:00:00 Data variables: pr (time, lat, lon) float64 dask.array<chunksize=(365, 360, 360), meta=np.ndarray> Attributes: (12/47) Conventions: CF-1.7 CMIP-6.2 activity_id: ScenarioMIP contact: climatesci@rhg.com creation_date: 2019-07-23T13:02:14Z data_specs_version: 01.00.29 dc6_bias_correction_method: Quantile Delta Method (QDM) ... ... sub_experiment_id: none table_id: day tracking_id: hdl:21.14100/ba34d30b-fca8-4737-887f-344ec5... variable_id: pr variant_label: r1i1p1f1 version_id: v20190724
Asset -> xarray (references)¶
1 2 3 4 5 6 7 8 9 | # catlaog -> item -> asset -> xarray (references) ds = ( catalog .get_collection("deltares-floods") .get_item("NASADEM-90m-2050-0010") .assets["index"] .to_xarray() ) ds |
<xarray.Dataset> Dimensions: (time: 1, lat: 216000, lon: 432000) Coordinates: * lat (lat) float64 -90.0 -90.0 -90.0 -90.0 ... 90.0 90.0 90.0 90.0 * lon (lon) float64 -180.0 -180.0 -180.0 -180.0 ... 180.0 180.0 180.0 * time (time) datetime64[ns] 2010-01-01 Data variables: inun (time, lat, lon) float32 dask.array<chunksize=(1, 600, 600), meta=np.ndarray> projection object ... Attributes: Conventions: CF-1.6 config_file: /mnt/globalRuns/watermask_post_NASA90m_rest/run_rp0010_slr2... institution: Deltares project: Microsoft Planetary Computer - Global Flood Maps references: https://www.deltares.nl/en/ source: Global Tide and Surge Model v3.0 - ERA5 title: GFM - NASA DEM 90m - 2050 slr - 0010-year return level
Asset -> geopandas¶
1 2 3 4 5 6 7 8 | df = ( catalog .get_collection("us-census") .get_item("2020-cb_2020_us_tbg_500k") .assets["data"] .to_geopandas() ) df.head() |
AIANNHCE | TTRACTCE | TBLKGPCE | AFFGEOID | GEOID | NAMELSAD | LSAD | ALAND | AWATER | geometry | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 2430 | T03700 | C | 2580000US2430T03700C | 2430T03700C | Tribal Block Group C | IB | 3945195 | 0 | POLYGON ((-111.26008 36.10715, -111.25910 36.1... |
1 | 20 | T00400 | B | 2580000US0020T00400B | 0020T00400B | Tribal Block Group B | IB | 1200584 | 100165 | POLYGON ((-116.47052 33.78691, -116.46940 33.7... |
2 | 1150 | T00100 | C | 2580000US1150T00100C | 1150T00100C | Tribal Block Group C | IB | 654354613 | 2911122 | MULTIPOLYGON (((-108.90981 47.91399, -108.8883... |
3 | 2555 | T01000 | A | 2580000US2555T01000A | 2555T01000A | Tribal Block Group A | IB | 39634390 | 4216784 | POLYGON ((-75.91155 43.00678, -75.90228 43.006... |
4 | 275 | T00100 | A | 2580000US0275T00100A | 0275T00100A | Tribal Block Group A | IB | 482651 | 0 | POLYGON ((-122.88954 39.02367, -122.88639 39.0... |
Asset -> dask_geopandas¶
1 2 3 4 5 6 7 8 | ddf = ( catalog .get_collection("ms-buildings") .get_item("Germany_2022-06-14") .assets["data"] .to_dask_geopandas() ) ddf |
geometry | RegionName | |
---|---|---|
npartitions=13 | ||
geometry | category[known] | |
... | ... | |
... | ... | ... |
... | ... | |
... | ... |
ItemCollection -> xarray¶
1 2 3 4 5 6 7 8 9 | ic = ( catalog.search( collections=["sentinel-2-l2a"], bbox=[9.4, 0, 9.5, 1] ) ) ds = ic.to_xarray(assets=["B02", "B03"], epsg=32732) ds |
<xarray.Dataset> Dimensions: (time: 100, y: 30984, x: 10981, band: 2) Coordinates: (12/43) * time (time) datetime64[ns] 2022-01-26... id (time) <U54 'S2A_MSIL2A_20220126... * x (x) float64 5e+05 ... 6.098e+05 * y (y) float64 1.02e+07 ... 9.89e+06 s2:unclassified_percentage (time) float64 4.19 4.19 ... 0.3259 s2:not_vegetated_percentage (time) float64 2.283 ... 0.321 ... ... gsd float64 10.0 proj:shape object {10980} common_name (band) <U5 'blue' 'green' center_wavelength (band) float64 0.49 0.56 full_width_half_max (band) float64 0.098 0.045 epsg int64 32732 Dimensions without coordinates: band Data variables: B02 (time, y, x) float64 dask.array<chunksize=(1, 1024, 1024), meta=np.ndarray> B03 (time, y, x) float64 dask.array<chunksize=(1, 1024, 1024), meta=np.ndarray> Attributes: spec: RasterSpec(epsg=32732, bounds=(499979.99999708973, 989020... crs: epsg:32732 transform: | 10.00, 0.00, 499980.00|\n| 0.00,-10.00, 10200040.00|\n|... resolution_xy: (9.999999999941792, 10.0)
Search -> xarray¶
1 2 3 4 5 6 7 | ds = ( catalog.search( collections=["sentinel-2-l2a"], bbox=[9.4, 0, 9.5, 1] ).to_xarray(assets=["B02", "B03"], epsg=32732) ) ds |
<xarray.Dataset> Dimensions: (time: 100, y: 30984, x: 10981, band: 2) Coordinates: (12/43) * time (time) datetime64[ns] 2022-01-26... id (time) <U54 'S2A_MSIL2A_20220126... * x (x) float64 5e+05 ... 6.098e+05 * y (y) float64 1.02e+07 ... 9.89e+06 s2:unclassified_percentage (time) float64 4.19 4.19 ... 0.3259 s2:not_vegetated_percentage (time) float64 2.283 ... 0.321 ... ... gsd float64 10.0 proj:shape object {10980} common_name (band) <U5 'blue' 'green' center_wavelength (band) float64 0.49 0.56 full_width_half_max (band) float64 0.098 0.045 epsg int64 32732 Dimensions without coordinates: band Data variables: B02 (time, y, x) float64 dask.array<chunksize=(1, 1024, 1024), meta=np.ndarray> B03 (time, y, x) float64 dask.array<chunksize=(1, 1024, 1024), meta=np.ndarray> Attributes: spec: RasterSpec(epsg=32732, bounds=(499979.99999708973, 989020... crs: epsg:32732 transform: | 10.00, 0.00, 499980.00|\n| 0.00,-10.00, 10200040.00|\n|... resolution_xy: (9.999999999941792, 10.0)
Search / ItemCollection -> geopandas¶
1 2 | df = catalog.search(collections=["sentinel-2-l2a"], bbox=[9.4, 0, 9.5, 1]).to_geopandas() df |
type | stac_version | id | geometry | links | bbox | stac_extensions | collection | datetime | platform | ... | assets.datastrip-metadata.roles | assets.tilejson.href | assets.tilejson.type | assets.tilejson.title | assets.tilejson.roles | assets.rendered_preview.href | assets.rendered_preview.type | assets.rendered_preview.title | assets.rendered_preview.rel | assets.rendered_preview.roles | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Feature | 1.0.0 | S2B_MSIL2A_20220710T093039_R136_T32NNG_2022071... | POLYGON ((8.99982 1.80982, 9.98700 1.80955, 9.... | [{'rel': 'collection', 'href': 'https://planet... | [8.99982018, 0.81630719, 9.98700503, 1.80981859] | [https://stac-extensions.github.io/eo/v1.0.0/s... | sentinel-2-l2a | 2022-07-10T09:30:39.024000Z | Sentinel-2B | ... | [metadata] | https://planetarycomputer.microsoft.com/api/da... | application/json | TileJSON with default rendering | [tiles] | https://planetarycomputer.microsoft.com/api/da... | image/png | Rendered preview | preview | [overview] |
1 | Feature | 1.0.0 | S2B_MSIL2A_20220710T093039_R136_T32NNF_2022071... | POLYGON ((8.99982 0.90491, 9.98664 0.90478, 9.... | [{'rel': 'collection', 'href': 'https://planet... | [8.99982024, -0.08848273, 9.98663827, 0.90491156] | [https://stac-extensions.github.io/eo/v1.0.0/s... | sentinel-2-l2a | 2022-07-10T09:30:39.024000Z | Sentinel-2B | ... | [metadata] | https://planetarycomputer.microsoft.com/api/da... | application/json | TileJSON with default rendering | [tiles] | https://planetarycomputer.microsoft.com/api/da... | image/png | Rendered preview | preview | [overview] |
2 | Feature | 1.0.0 | S2B_MSIL2A_20220710T093039_R136_T32MNE_2022071... | POLYGON ((8.99982 0.00000, 9.98652 0.00000, 9.... | [{'rel': 'collection', 'href': 'https://planet... | [8.99982024, -0.99339404, 9.98666334, 0.0] | [https://stac-extensions.github.io/eo/v1.0.0/s... | sentinel-2-l2a | 2022-07-10T09:30:39.024000Z | Sentinel-2B | ... | [metadata] | https://planetarycomputer.microsoft.com/api/da... | application/json | TileJSON with default rendering | [tiles] | https://planetarycomputer.microsoft.com/api/da... | image/png | Rendered preview | preview | [overview] |
3 | Feature | 1.0.0 | S2A_MSIL2A_20220705T093051_R136_T32NNG_2022070... | POLYGON ((8.99982 1.80982, 9.98700 1.80955, 9.... | [{'rel': 'collection', 'href': 'https://planet... | [8.99982018, 0.81630719, 9.98700503, 1.80981859] | [https://stac-extensions.github.io/eo/v1.0.0/s... | sentinel-2-l2a | 2022-07-05T09:30:51.025000Z | Sentinel-2A | ... | [metadata] | https://planetarycomputer.microsoft.com/api/da... | application/json | TileJSON with default rendering | [tiles] | https://planetarycomputer.microsoft.com/api/da... | image/png | Rendered preview | preview | [overview] |
4 | Feature | 1.0.0 | S2A_MSIL2A_20220705T093051_R136_T32NNF_2022070... | POLYGON ((8.99982 0.90491, 9.98664 0.90478, 9.... | [{'rel': 'collection', 'href': 'https://planet... | [8.99982024, -0.08848273, 9.98663827, 0.90491156] | [https://stac-extensions.github.io/eo/v1.0.0/s... | sentinel-2-l2a | 2022-07-05T09:30:51.025000Z | Sentinel-2A | ... | [metadata] | https://planetarycomputer.microsoft.com/api/da... | application/json | TileJSON with default rendering | [tiles] | https://planetarycomputer.microsoft.com/api/da... | image/png | Rendered preview | preview | [overview] |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
95 | Feature | 1.0.0 | S2B_MSIL2A_20220131T093119_R136_T32NNG_2022021... | POLYGON ((8.99982 1.80982, 9.98700 1.80955, 9.... | [{'rel': 'collection', 'href': 'https://planet... | [8.99982018, 0.81630719, 9.98700503, 1.80981859] | [https://stac-extensions.github.io/eo/v1.0.0/s... | sentinel-2-l2a | 2022-01-31T09:31:19.024000Z | Sentinel-2B | ... | [metadata] | https://planetarycomputer.microsoft.com/api/da... | application/json | TileJSON with default rendering | [tiles] | https://planetarycomputer.microsoft.com/api/da... | image/png | Rendered preview | preview | [overview] |
96 | Feature | 1.0.0 | S2B_MSIL2A_20220131T093119_R136_T32NNF_2022021... | POLYGON ((8.99982 0.90491, 9.98664 0.90478, 9.... | [{'rel': 'collection', 'href': 'https://planet... | [8.99982024, -0.08848273, 9.98663827, 0.90491156] | [https://stac-extensions.github.io/eo/v1.0.0/s... | sentinel-2-l2a | 2022-01-31T09:31:19.024000Z | Sentinel-2B | ... | [metadata] | https://planetarycomputer.microsoft.com/api/da... | application/json | TileJSON with default rendering | [tiles] | https://planetarycomputer.microsoft.com/api/da... | image/png | Rendered preview | preview | [overview] |
97 | Feature | 1.0.0 | S2B_MSIL2A_20220131T093119_R136_T32MNE_2022021... | POLYGON ((8.99982 0.00000, 9.98652 0.00000, 9.... | [{'rel': 'collection', 'href': 'https://planet... | [8.99982024, -0.99339404, 9.98666334, 0.0] | [https://stac-extensions.github.io/eo/v1.0.0/s... | sentinel-2-l2a | 2022-01-31T09:31:19.024000Z | Sentinel-2B | ... | [metadata] | https://planetarycomputer.microsoft.com/api/da... | application/json | TileJSON with default rendering | [tiles] | https://planetarycomputer.microsoft.com/api/da... | image/png | Rendered preview | preview | [overview] |
98 | Feature | 1.0.0 | S2A_MSIL2A_20220126T093251_R136_T32NNG_2022022... | POLYGON ((8.99982 1.80982, 9.98700 1.80955, 9.... | [{'rel': 'collection', 'href': 'https://planet... | [8.99982018, 0.81630719, 9.98700503, 1.80981859] | [https://stac-extensions.github.io/eo/v1.0.0/s... | sentinel-2-l2a | 2022-01-26T09:32:51.024000Z | Sentinel-2A | ... | [metadata] | https://planetarycomputer.microsoft.com/api/da... | application/json | TileJSON with default rendering | [tiles] | https://planetarycomputer.microsoft.com/api/da... | image/png | Rendered preview | preview | [overview] |
99 | Feature | 1.0.0 | S2A_MSIL2A_20220126T093251_R136_T32NNG_2022021... | POLYGON ((8.99982 1.80982, 9.98700 1.80955, 9.... | [{'rel': 'collection', 'href': 'https://planet... | NaN | [https://stac-extensions.github.io/eo/v1.0.0/s... | sentinel-2-l2a | 2022-01-26T09:32:51.024000Z | Sentinel-2A | ... | [metadata] | https://planetarycomputer.microsoft.com/api/da... | application/json | TileJSON with default rendering | [tiles] | https://planetarycomputer.microsoft.com/api/da... | image/png | Rendered preview | preview | [overview] |
100 rows × 215 columns
1 |