{ "cells": [ { "cell_type": "markdown", "id": "65a32b4a-9359-47f0-a05a-db58b3d1ef56", "metadata": {}, "source": [ "## Using Kerchunk\n", "\n", "In this notebook, we'll use a Kerchunk index file to speed up the *metadata reading* for a large collection of NetCDF files. The actual data will still be in the original NetCDF files.\n", "\n", "## Kerchunk Background\n", "\n", "In the last notebook, we saw that accessing data from the NetCDF file over the network was slow, in part because it was making a bunch of HTTP requests just to read some metadata that's scattered around the NetCDF file. With a Kerchunk index file, you get to bypass all that seeking around for metadata: it's already been extracted into the index file. While that's maybe not a huge deal for a *single* NetCDF file, it matters a bunch when you're dealing with thousands of NetCDF files (1,000 files * 1.5 seconds per file = ~25 minutes *just to read metadata*)." ] }, { "cell_type": "code", "execution_count": 1, "id": "a1218108-1a56-4dd7-8e2f-799393f3ee7b", "metadata": { "tags": [] }, "outputs": [], "source": [ "import adlfs\n", "import xarray as xr\n", "import fsspec\n", "import json\n", "import odc.geo\n", "import rioxarray\n", "import requests\n", "import pyproj\n", "import planetary_computer\n", "import pystac_client\n", "import geopandas\n", "\n", "# force xarray to import everything\n", "xr.tutorial.open_dataset(\"air_temperature\");" ] }, { "cell_type": "code", "execution_count": 2, "id": "15b91d65-7f7c-43a4-949d-e19150d6802d", "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 580 ms, sys: 59.9 ms, total: 640 ms\n", "Wall time: 945 ms\n" ] }, { "data": { "text/html": [ "
<xarray.Dataset>\n", "Dimensions: (time: 6866, feature_id: 2776738, reference_time: 1)\n", "Coordinates:\n", " * feature_id (feature_id) float64 101.0 179.0 181.0 ... 1.18e+09 1.18e+09\n", " * reference_time (reference_time) datetime64[ns] 2022-06-29\n", " * time (time) datetime64[ns] 2022-06-29T01:00:00 ... 2023-04-21T...\n", "Data variables:\n", " crs (time) object dask.array<chunksize=(1,), meta=np.ndarray>\n", " nudge (time, feature_id) float64 dask.array<chunksize=(1, 2776738), meta=np.ndarray>\n", " qBtmVertRunoff (time, feature_id) float64 dask.array<chunksize=(1, 2776738), meta=np.ndarray>\n", " qBucket (time, feature_id) float64 dask.array<chunksize=(1, 2776738), meta=np.ndarray>\n", " qSfcLatRunoff (time, feature_id) float64 dask.array<chunksize=(1, 2776738), meta=np.ndarray>\n", " streamflow (time, feature_id) float64 dask.array<chunksize=(1, 2776738), meta=np.ndarray>\n", " velocity (time, feature_id) float64 dask.array<chunksize=(1, 2776738), meta=np.ndarray>\n", "Attributes: (12/19)\n", " Conventions: CF-1.6\n", " NWM_version_number: v2.2\n", " TITLE: OUTPUT FROM NWM v2.2\n", " cdm_datatype: Station\n", " code_version: v5.2.0-beta2\n", " dev: dev_ prefix indicates development/internal me...\n", " ... ...\n", " model_output_type: channel_rt\n", " model_output_valid_time: 2022-06-29_01:00:00\n", " model_total_valid_times: 18\n", " proj4: +proj=lcc +units=m +a=6370000.0 +b=6370000.0 ...\n", " station_dimension: feature_id\n", " stream_order_output: 1
<xarray.Dataset>\n", "Dimensions: (time: 6992, y: 3840, x: 4608, reference_time: 1)\n", "Coordinates:\n", " * reference_time (reference_time) datetime64[ns] 2022-06-29\n", " * time (time) datetime64[ns] 2022-06-29T01:00:00 ... 2023-04-26T...\n", " * x (x) float64 -2.303e+06 -2.302e+06 ... 2.303e+06 2.304e+06\n", " * y (y) float64 -1.92e+06 -1.919e+06 ... 1.918e+06 1.919e+06\n", "Data variables:\n", " ACCET (time, y, x) float64 dask.array<chunksize=(1, 768, 922), meta=np.ndarray>\n", " FSNO (time, y, x) float64 dask.array<chunksize=(1, 768, 922), meta=np.ndarray>\n", " SNEQV (time, y, x) float64 dask.array<chunksize=(1, 768, 922), meta=np.ndarray>\n", " SNOWH (time, y, x) float64 dask.array<chunksize=(1, 768, 922), meta=np.ndarray>\n", " SNOWT_AVG (time, y, x) float64 dask.array<chunksize=(1, 768, 922), meta=np.ndarray>\n", " SOILSAT_TOP (time, y, x) float64 dask.array<chunksize=(1, 768, 922), meta=np.ndarray>\n", " crs (time) object dask.array<chunksize=(1,), meta=np.ndarray>\n", "Attributes:\n", " Conventions: CF-1.6\n", " GDAL_DataType: Generic\n", " NWM_version_number: v2.2\n", " TITLE: OUTPUT FROM NWM v2.2\n", " code_version: v5.2.0-beta2\n", " model_configuration: short_range\n", " model_initialization_time: 2022-06-29_00:00:00\n", " model_output_type: land\n", " model_output_valid_time: 2022-06-29_01:00:00\n", " model_total_valid_times: 18\n", " proj4: +proj=lcc +units=m +a=6370000.0 +b=6370000.0 ...
<xarray.Dataset>\n", "Dimensions: (time: 7306, y: 3840, x: 4608, reference_time: 1)\n", "Coordinates:\n", " * reference_time (reference_time) datetime64[ns] 2022-06-29\n", " * time (time) datetime64[ns] 2022-06-29T01:00:00 ... 2023-04-29T...\n", " * x (x) float64 -2.303e+06 -2.302e+06 ... 2.303e+06 2.304e+06\n", " * y (y) float64 -1.92e+06 -1.919e+06 ... 1.918e+06 1.919e+06\n", "Data variables:\n", " LWDOWN (time, y, x) float64 dask.array<chunksize=(1, 768, 922), meta=np.ndarray>\n", " PSFC (time, y, x) float64 dask.array<chunksize=(1, 768, 922), meta=np.ndarray>\n", " Q2D (time, y, x) float64 dask.array<chunksize=(1, 768, 922), meta=np.ndarray>\n", " RAINRATE (time, y, x) float32 dask.array<chunksize=(1, 768, 922), meta=np.ndarray>\n", " SWDOWN (time, y, x) float64 dask.array<chunksize=(1, 768, 922), meta=np.ndarray>\n", " T2D (time, y, x) float64 dask.array<chunksize=(1, 768, 922), meta=np.ndarray>\n", " U2D (time, y, x) float64 dask.array<chunksize=(1, 768, 922), meta=np.ndarray>\n", " V2D (time, y, x) float64 dask.array<chunksize=(1, 768, 922), meta=np.ndarray>\n", " crs object ...\n", "Attributes:\n", " NWM_version_number: v2.2\n", " model_configuration: short_range\n", " model_initialization_time: 2022-06-29_00:00:00\n", " model_output_type: forcing\n", " model_output_valid_time: 2022-06-29_01:00:00\n", " model_total_valid_times: 18
<xarray.DataArray 'T2D' (time: 7306, y: 337, x: 519)>\n", "dask.array<getitem, shape=(7306, 337, 519), dtype=float64, chunksize=(1, 328, 433), chunktype=numpy.ndarray>\n", "Coordinates:\n", " * time (time) datetime64[ns] 2022-06-29T01:00:00 ... 2023-04-29T11:00:00\n", " * x (x) float64 2.95e+04 3.05e+04 3.15e+04 ... 5.465e+05 5.475e+05\n", " * y (y) float64 5.65e+04 5.75e+04 5.85e+04 ... 3.915e+05 3.925e+05\n", " crs int64 0\n", "Attributes:\n", " cell_methods: time: point\n", " esri_pe_string: PROJCS["Lambert_Conformal_Conic",GEOGCS["GCS_Sphere",DAT...\n", " grid_mapping: crs\n", " long_name: 2-m Air Temperature\n", " proj4: +proj=lcc +units=m +a=6370000.0 +b=6370000.0 +lat_1=30.0...\n", " remap: remapped via ESMF regrid_with_weights: Bilinear\n", " standard_name: air_temperature\n", " units: K