SIParCS presentation¶

15 mins

Title: Developing educational resources for cloud-computing of remote sensing data with xarray in support of open science

Motivation - why¶

  • recent increases in availability of remote sensing data, computational resources
    • exciting because this could democratize participation in earth science (all spatial science), reduce barriers to entry such as storage, computational resources
    • data / computing resources + access aren't enough, domain knowledge about these datasets, as well as general data management, access and computing knowledge are often specialized, not widely disseminated -- example, maybe specialized knowledge is passed down from advisor to advisee, or in small university classes -- this is not enough to make use of the vast remote sensing data being generated
    • to realize the scientific potential and public benefit of remote sensing data, must scale the educational resources that accompany datasets, computational resources and analytical tools in order to widely increase usership and promote open and reproducible scientific work
  • this brings us to xarray
    • most remote sensing data is in raster format, often large and complex datasets with variable metadata
      • xarray objects support multi-dimensional gridded data and incorporate various types of metadata, permitting label-based indexing, selection and grouping, as well as a suite of higher-level computational and visualization tools
    • xarray is a powerful tool for managing and analyzing this data
      • this work focuses on developing more educational resources and documentation to increase and ease the use of xarray for working with remote sensing data

Content - what/how¶

The goals for this summer were two-fold, focusing on developing my own skills, and generating educational resources to distribute more broadly:

  1. develop cloud-based and parallel computation skills
  • accessing cloud hosted data stored in Amazon S3 buckets, Microsoft Azure ...
  • using cloud-computing resources like Microsoft Planetary Computer, ASF HYp3
  • becoming more proficient working with xarray, transitioning to more 'xarray-native'-like programming
    • improving skills and comfort using xarray functionality like .groupby(), .map() and .reduce() as well as ufuncs, aligning and broadcasting, weighted reductions and resampling along specific dimensions
    • incorporating parallelization in my workflows
      • transitioning to cloud-hosted data greatly increases the size of datasets that we can work with, introduces the need for parallelization in data processing and analysis workflows
      • using dask in local and cloud-based workflows, dask-gateway for distributed workflows in cloud-computing environments (Microsoft Planetary Computer)
    • become more comfortable using version control (git) in my day-to-day workflows and collaborating with others via github
  1. increase / improve documentation and educational resources
  • these are the resources I've produced or contributed to this summer, incorporating the xarray and cloud computing skills that I gained reverse order on these? a) Jupyter book tutorials demonstrating workflows with specific cloud-hosted remote sensing datasets 1) ITS_LIVE book - this is a book containing several content jupyter notebooks that demonstrate working with a global land ice velocity dataset in zarr datacube format and stored in AWS S3 buckets. Notebooks include: - dataset inspection and re-organizing - exploratory data analysis at multiple spatial and temporal scales 2) Sentinel1 RTC book - data: Sentinel1 Radiometrically Terrain Corrected (RTC) Synthetic Aperture Radar (SAR) imagery processed and hosted on Microsoft Planetary Computer - accessing, organizing and preliminary analysis of this dataset with a focus on analyzing temoral variability in backscatter over proglacial lakes in the himalaya

  • In developing these tutorials, I tried to imagine what kind of resources would have helped me when I was first starting to work with remote sensing data and xarray. I also found it helpful to imagine I was explaining the workflow to a friend who works in a different field. In this way, I emphasized the use of narrative text to explain certain concepts or text, and included many links to documentation or other useful online resources. Another aspect of most (of my) workflows that I wanted to emphasize in the tutorials was errors and the best ways to troubleshoot and understand them. Many of the tutorials I see online show code that works, which is very helpful but often is often less helpful for messy data or when things don't go quite right. Many of the online resources and examples that I've found most helpful are ones that show errors and how to move past them. I tried to keep many of the errors that I encountered when I was developing the notebooks and include them in the final version with an explanation of the error's meaning and its solution.

b) contribution to xarray-dev tutorial

  • data cleaning example of taking a time series of gridded data with multiple variables and organizing it into a 'data cube' format of 3-dimensioal objects with x,y, and time coordinates with variables existing along each dimension
  • this work included submission of a new example dataset to pydata/xarray-data

c) Pull requests to pydata/xarray repository

  • submitted PRs to main xarray repository that improve documentation for existing xarray methods by adding examples of different syntax and use cases

d) assisted with the xarray tutorial at the SciPy 2022 conference in Austin TX (July 2022)

What did I learn?¶

  • collaborating via git/github - this was huge for me! it's something I'd kind of avoided doing in my work in the past out of hesitation and a bit of intimidation. Being able to get accustomed to this work style through contributing to projects with my mentors was invaluable, and helped me to figure out the nuances of different github situations etc.
  • 'xarray-native' programming - this was really fun! my work in the past had used xarray but not taken full advantage of it. I ended up writing a lot of things by hand that I could have implemented xarray functionality for, either because I didn't know those methods existed or I didn't understand how to incorporate them into my work
  • this summer I had many opportunities to take code I had written for a specific purpose and break it down to see how I could accomplish the same task within the xarray framework. It was really nice to have the time to prioritize this, rather than just moving onto the next task, and to be able to receive feedback and advice from my mentors on these tasks, there's a very useful video of an xarray tutorial on youtube called thinking like xarray and this summer I feel like I finally started to think like xarray (still improving here) and it has helped me to reframe my approach to these workflows, and it has been fun!
  • cloud computing
    • this was all new to me, and I felt slightly intimidated about diving into this realm and understanding the ecosystem of different cloud hosting platforms, the various ways to query and access catalogs of cloud-hosted data and then make use of cloud-based computational resources. Diving into this world this summer has been exciting and fun
  • open source and open science community

Conclusions - what's next¶

  • continue to share my work and receive feedback from others
  • incorporate a few more 'chapters' in the Sentinel1 book (these are in progress)
  • share work in an Open Science session at AGU
  • present one of the books (Sentinel1 RTC imagery) at NISAR science conference, August 2022
  • publish jupyter books in the journal of open source education?
  • Pangeo / xarray blog posts
  • continue to contribute to open source packages !