Installation and set up¶
To run the code locally, install the following tools and libraries at the command line.
Jupyter Notebook, if not already installed
- pip install notebook
Supporting libraries
- pip install pandas
- pip install geopandas
- pip install shapely
DuckDB and associated tools to improve its performance in the notebook
- pip install duckdb
- pip install jupysql
- pip install duckdb-engine
1 2 3 4 5 | # import our toolkit import pandas as pd import geopandas as gpd from shapely import wkt import duckdb |
1 2 3 4 5 6 | # install and load DuckDB extensions to work with spatial data and AWS %sql INSTALL spatial; %sql INSTALL httpfs; %sql LOAD spatial; %sql LOAD httpfs; %sql SET s3_region='us-west-2' |
The DuckDB documentation offers tips and examples for running DuckDB queries in Jupyter notebooks. In this example, we're using duckdb-engine and JupySQL. You can also connect to DuckDB natively.
1 2 3 4 | # load (or reload) jupysql to create SQL cells # no need to import duckdb_engine, JupySQL will auto-detect driver # load (or reload) jupysql Jupyter extension to create SQL cells %reload_ext sql |
1 2 3 4 | # configure cell output -> query to Pandas %config SqlMagic.autopandas = True %config SqlMagic.feedback = False %config SqlMagic.displaycon = False |
1 2 | # connection string %sql duckdb:///:memory: |
Getting Overture data¶
Now we're going to extract data from Overture's base theme for an area along the Gulf Coast. The magic %%sql command turns the notebook cell into a SQL cell and allows us to dump our query results in a Pandas DataFrame. Note: this query take a couple minutes to run.
1 2 3 4 5 6 7 8 9 10 11 12 | %%sql gulf_water << SELECT id, names.primary AS primary_name, ST_AsText(ST_GeomFromWKB(geometry)) as geometry FROM read_parquet('s3://overturemaps-us-west-2/release/2024-07-22.0/theme=base/type=water/*', filename=true, hive_partitioning=1) WHERE bbox.xmin >= -91.3994 and bbox.xmax <= -89.3864 and bbox.ymin >= 29.152 and bbox.ymax <= 30.5161 |
1 | gulf_water.head() |
Before we move on, let's deal with the geometry we pulled out of Overture's GeoParquet file. Geometries in GeoParquet are stored as well-known binary (WKB). In our query, we transformed that geometry into text. The next step is to convert it to a shapely geometry before we create the GeoDataFrame. Here's how we do that.
1 | gulf_water['geometry'] = gulf_water['geometry'].apply(wkt.loads) |
1 2 3 4 5 | # dataframe to geodataframe, set crs gulf_water_gdf = gpd.GeoDataFrame( gulf_water , geometry='geometry', crs="EPSG:4326" ) |
Let's work with just the water polygons and lines, not the points.
1 2 | #Apply a lambda to remove point geometries gulf_water_gdf = gulf_water_gdf[gulf_water_gdf['geometry'].apply(lambda x : x.geom_type!='Point' )] |
1 | gulf_water_gdf.plot(facecolor="#628290", edgecolor="#006064", lw=0.05) |
1 |