Granules found: 20

1

1
2

[('349.38 MB', 'ATL03_20220208031618_07311402_006_01.h5'),
 ('663.88 MB', 'ATL03_20220808183613_07311602_006_01.h5'),
 ('665.11 MB', 'ATL03_20221107141555_07311702_006_01.h5'),
 ('779.02 MB', 'ATL03_20220509225606_07311502_006_01.h5'),
 ('800.61 MB', 'ATL03_20230807011436_07312002_006_01.h5'),
 ('1115.44 MB', 'ATL03_20220926042045_00831706_006_01.h5'),
 ('1584.02 MB', 'ATL03_20230625151938_00832006_006_01.h5'),
 ('2042.96 MB', 'ATL03_20220828054448_10281606_006_01.h5'),
 ('2634.99 MB', 'ATL03_20230508053526_07311902_006_01.h5'),
 ('3586.49 MB', 'ATL03_20220529100448_10281506_006_01.h5')]

import xarray as xr

# We open our data and we live happily ever after
ds = xr.open_dataset(some_zarr_store, engine="zarr")

1
2

Opening 1 granules, approx size: 0.34 GB

QUEUEING TASKS | :   0%|          | 0/1 [00:00<?, ?it/s]

PROCESSING TASKS | :   0%|          | 0/1 [00:00<?, ?it/s]

COLLECTING RESULTS | :   0%|          | 0/1 [00:00<?, ?it/s]

Opening the data (out-of-region) with xarray took 953.88 seconds
CPU times: user 19.8 s, sys: 3.6 s, total: 23.4 s
Wall time: 15min 53s

<xarray.Dataset>
Dimensions:         (delta_time: 3376706, ds_surf_type: 5)
Coordinates:
  * delta_time      (delta_time) datetime64[ns] 2022-02-08T03:16:18.010370560...
    lat_ph          (delta_time) float64 ...
    lon_ph          (delta_time) float64 ...
Dimensions without coordinates: ds_surf_type
Data variables:
    dist_ph_across  (delta_time) float32 ...
    dist_ph_along   (delta_time) float32 ...
    h_ph            (delta_time) float32 ...
    pce_mframe_cnt  (delta_time) uint32 ...
    ph_id_channel   (delta_time) uint8 ...
    ph_id_count     (delta_time) uint8 ...
    ph_id_pulse     (delta_time) uint8 ...
    quality_ph      (delta_time) int8 ...
    signal_conf_ph  (delta_time, ds_surf_type) int8 ...
    weight_ph       (delta_time) uint8 ...
Attributes:
    Description:  Contains arrays of the parameters for each received photon.
    data_rate:    Data are stored at the photon detection rate.

array(['2022-02-08T03:16:18.010370560', '2022-02-08T03:16:18.010470560',
       '2022-02-08T03:16:18.010570560', ..., '2022-02-08T03:24:47.189382768',
       '2022-02-08T03:24:47.191982784', '2022-02-08T03:24:47.193682768'],
      dtype='datetime64[ns]')

[3376706 values with dtype=float64]

[3376706 values with dtype=float64]

[3376706 values with dtype=float32]

[3376706 values with dtype=float32]

[3376706 values with dtype=float32]

[3376706 values with dtype=uint32]

[3376706 values with dtype=uint8]

1
2

Opening 1 granules, approx size: 0.34 GB

QUEUEING TASKS | :   0%|          | 0/1 [00:00<?, ?it/s]

PROCESSING TASKS | :   0%|          | 0/1 [00:00<?, ?it/s]

COLLECTING RESULTS | :   0%|          | 0/1 [00:00<?, ?it/s]

Opening the data (out-of-region) with xarray took 31.42 seconds with earthacess smart open
CPU times: user 1.12 s, sys: 182 ms, total: 1.3 s
Wall time: 31.4 s

<xarray.Dataset>
Dimensions:         (delta_time: 3376706, ds_surf_type: 5)
Coordinates:
  * delta_time      (delta_time) datetime64[ns] 2022-02-08T03:16:18.010370560...
    lat_ph          (delta_time) float64 ...
    lon_ph          (delta_time) float64 ...
Dimensions without coordinates: ds_surf_type
Data variables:
    dist_ph_across  (delta_time) float32 ...
    dist_ph_along   (delta_time) float32 ...
    h_ph            (delta_time) float32 ...
    pce_mframe_cnt  (delta_time) uint32 ...
    ph_id_channel   (delta_time) uint8 ...
    ph_id_count     (delta_time) uint8 ...
    ph_id_pulse     (delta_time) uint8 ...
    quality_ph      (delta_time) int8 ...
    signal_conf_ph  (delta_time, ds_surf_type) int8 ...
    weight_ph       (delta_time) uint8 ...
Attributes:
    Description:  Contains arrays of the parameters for each received photon.
    data_rate:    Data are stored at the photon detection rate.

array(['2022-02-08T03:16:18.010370560', '2022-02-08T03:16:18.010470560',
       '2022-02-08T03:16:18.010570560', ..., '2022-02-08T03:24:47.189382768',
       '2022-02-08T03:24:47.191982784', '2022-02-08T03:24:47.193682768'],
      dtype='datetime64[ns]')

[3376706 values with dtype=float64]

[3376706 values with dtype=float64]

[3376706 values with dtype=float32]

[3376706 values with dtype=float32]

[3376706 values with dtype=float32]

[3376706 values with dtype=uint32]

[3376706 values with dtype=uint8]

1 data granule
Size: 349.38 MB

Egress:
 without earthaccess: 3199.29 MB 
 with earthaccess   :  112.0 MB

Time to science:
 without earthaccess: 15.9 minutes 
 with earthaccess   :  0.52 minutes

1

granules: 20, total size: 9.45GB

CPU times: user 28.9 s, sys: 5.62 s, total: 34.6 s
Wall time: 2min 37s

{'miss': 252, 'hits': 1968, 'total': 1008.0}

<xarray.Dataset>
Dimensions:       (time: 10, lat: 870, lon: 2111)
Coordinates:
  * lat           (lat) float32 40.26 40.27 40.28 40.29 ... 48.93 48.94 48.95
  * lon           (lon) float32 -94.93 -94.92 -94.91 ... -73.85 -73.84 -73.83
  * time          (time) datetime64[ns] 2013-12-24T09:00:00 ... 2022-12-24T09...
Data variables:
    analysed_sst  (time, lat, lon) float32 nan nan nan nan ... nan nan nan nan

array([40.26, 40.27, 40.28, ..., 48.93, 48.94, 48.95], dtype=float32)

array([-94.93, -94.92, -94.91, ..., -73.85, -73.84, -73.83], dtype=float32)

array(['2013-12-24T09:00:00.000000000', '2014-12-24T09:00:00.000000000',
       '2015-12-24T09:00:00.000000000', '2016-12-24T09:00:00.000000000',
       '2017-12-24T09:00:00.000000000', '2018-12-24T09:00:00.000000000',
       '2019-12-24T09:00:00.000000000', '2020-12-24T09:00:00.000000000',
       '2021-12-24T09:00:00.000000000', '2022-12-24T09:00:00.000000000'],
      dtype='datetime64[ns]')

array([[[        nan,         nan,         nan, ...,  8.842499  ,
          8.837494  ,  8.817505  ],
        [        nan,         nan,         nan, ...,  8.716522  ,
          8.72348   ,  8.715485  ],
        [        nan,         nan,         nan, ...,  8.597504  ,
          8.615997  ,  8.618988  ],
        ...,
        [        nan,  1.4134827 ,  1.4129944 , ...,         nan,
                 nan,         nan],
        [ 1.4150085 ,  1.4145203 ,  1.4145203 , ...,         nan,
                 nan,         nan],
        [ 1.4164734 ,  1.4159851 ,  1.4159851 , ...,         nan,
                 nan,         nan]],

       [[        nan,         nan,         nan, ...,  8.001984  ,
          8.027008  ,  8.058014  ],
        [        nan,         nan,         nan, ...,  7.959503  ,
          7.9880066 ,  8.024017  ],
        [        nan,         nan,         nan, ...,  7.9140015 ,
          7.9424744 ,  7.9804993 ],
...
        [        nan,  1.740509  ,  1.7425232 , ...,         nan,
                 nan,         nan],
        [ 1.7384949 ,  1.740509  ,  1.7425232 , ...,         nan,
                 nan,         nan],
        [ 1.7384949 ,  1.740509  ,  1.7425232 , ...,         nan,
                 nan,         nan]],

       [[        nan,         nan,         nan, ...,  7.746002  ,
          7.825012  ,  7.9194946 ],
        [        nan,         nan,         nan, ...,  7.7270203 ,
          7.790497  ,  7.8775024 ],
        [        nan,         nan,         nan, ...,  7.725006  ,
          7.78598   ,  7.8735046 ],
        ...,
        [        nan,  0.7460022 ,  0.74749756, ...,         nan,
                 nan,         nan],
        [ 0.74502563,  0.7460022 ,  0.74697876, ...,         nan,
                 nan,         nan],
        [ 0.74450684,  0.7455139 ,  0.74697876, ...,         nan,
                 nan,         nan]]], dtype=float32)

PandasIndex(Index([  40.2599983215332,  40.27000045776367, 40.279998779296875,
       40.290000915527344,  40.29999923706055, 40.310001373291016,
        40.31999969482422,  40.33000183105469,  40.34000015258789,
       40.349998474121094,
       ...
        48.86000061035156, 48.869998931884766, 48.880001068115234,
        48.88999938964844, 48.900001525878906,  48.90999984741211,
        48.91999816894531,  48.93000030517578, 48.939998626708984,
        48.95000076293945],
      dtype='float32', name='lat', length=870))

PandasIndex(Index([-94.93000030517578, -94.91999816894531, -94.91000366210938,
        -94.9000015258789, -94.88999938964844, -94.87999725341797,
       -94.87000274658203, -94.86000061035156,  -94.8499984741211,
       -94.83999633789062,
       ...
       -73.91999816894531, -73.91000366210938,  -73.9000015258789,
       -73.88999938964844, -73.87999725341797, -73.87000274658203,
       -73.86000061035156,  -73.8499984741211, -73.83999633789062,
       -73.83000183105469],
      dtype='float32', name='lon', length=2111))

PandasIndex(DatetimeIndex(['2013-12-24 09:00:00', '2014-12-24 09:00:00',
               '2015-12-24 09:00:00', '2016-12-24 09:00:00',
               '2017-12-24 09:00:00', '2018-12-24 09:00:00',
               '2019-12-24 09:00:00', '2020-12-24 09:00:00',
               '2021-12-24 09:00:00', '2022-12-24 09:00:00'],
              dtype='datetime64[ns]', name='time', freq=None))

1

1

1

total egress: 0.98gb

1. In the cloud era, what can earthaccess do for us?¶

Access remote files, automatically handling authentication and serialization.¶

NASA Openscapes cookbook examples ¶

Generate an on-the-fly Zarr compatible cache with Kerchunk!¶

Presented at the Dask demo day for August/2023 ¶

Smart Access ◀¶

Sneak peak today, more details at SciPy 2024!¶

Scale out workflows with Dask ◀¶

Processing Terabyte-Scale NASA Cloud Datasets with Coiled ¶

2. What "cloud native workflows" means?¶

When we run cloud native workflows, we don't replicate the archives. e.g. COGs or Zarr. Instead, this is what usually happens:¶

Most mission data is not in COGs or Zarr but fear not, earthaccess FTW!.¶

3. Let's use the force.¶

4. Recap and the importance of optimized access patterns.¶

We accessed one granule of mission data with earthaccess and xarray.¶

The first time we opened it as usual, it takes forever...¶

The second time we used experimental earthaccess' "smart open" capabilities.¶

Now let's see some I/O stats¶

5. Why is this important?¶

As data in the cloud grows, we need to provide researchers with the means to do cloud native workflows, the simple way!¶

Cloud native workflows could get expensive, earthaccess can help!¶

6. Cloud compute with earthaccess, from terabyte to petabyte scale analysis.¶

We are going to answer a scientific question: "How cold do the Great Lakes get in December?"¶

Local execution: 10 years, 2 days around Santa's visit. ~ 10GB of HDF¶

Cloud execution: 20 years, all December. ~ 0.25 TB of HDF¶

We'll use NASA's GHRSST Level 4 MUR Global Foundation Sea Surface Temperature Analysis (v4.1)¶

Now you're probably asking, what AWS integration is he going to use? Lambda? Step Functions? Perhaps a novel SQS-powered pipeline...¶

Remote code execution, or local code execution... our choice.¶

1. In the cloud era, what can earthaccess do for us?¶

Access remote files, automatically handling authentication and serialization.¶

NASA Openscapes cookbook examples¶

Generate an on-the-fly Zarr compatible cache with Kerchunk!¶

Presented at the Dask demo day for August/2023¶

Smart Access ◀¶

Sneak peak today, more details at SciPy 2024!¶

Scale out workflows with Dask ◀¶

Processing Terabyte-Scale NASA Cloud Datasets with Coiled¶

2. What "cloud native workflows" means?¶

When we run cloud native workflows, we don't replicate the archives. e.g. COGs or Zarr. Instead, this is what usually happens:¶

Most mission data is not in COGs or Zarr but fear not, earthaccess FTW!.¶

3. Let's use the force.¶

4. Recap and the importance of optimized access patterns.¶

We accessed one granule of mission data with earthaccess and xarray.¶

The first time we opened it as usual, it takes forever...¶

The second time we used experimental earthaccess' "smart open" capabilities.¶

Now let's see some I/O stats¶

5. Why is this important?¶

As data in the cloud grows, we need to provide researchers with the means to do cloud native workflows, the simple way!¶

Cloud native workflows could get expensive, earthaccess can help!¶

6. Cloud compute with earthaccess, from terabyte to petabyte scale analysis.¶

We are going to answer a scientific question: "How cold do the Great Lakes get in December?"¶

Local execution: 10 years, 2 days around Santa's visit. ~ 10GB of HDF¶

Cloud execution: 20 years, all December. ~ 0.25 TB of HDF¶

We'll use NASA's GHRSST Level 4 MUR Global Foundation Sea Surface Temperature Analysis (v4.1)¶

Now you're probably asking, what AWS integration is he going to use? Lambda? Step Functions? Perhaps a novel SQS-powered pipeline...¶

Remote code execution, or local code execution... our choice.¶

NASA Openscapes cookbook examples ¶

Presented at the Dask demo day for August/2023 ¶

Processing Terabyte-Scale NASA Cloud Datasets with Coiled ¶