Serializing Dataclasses

This post is a bit of a tutorial on serializing and deserializing Python dataclasses. I’ve been hacking on zarr-python-v3 a bit, which uses some dataclasses to represent some metadata objects. Those objects need to be serialized to and deserialized from JSON. This is a (surprisingly?) challenging area, and there are several excellent libraries out there that you should probably use. My personal favorite is msgspec, but cattrs, pydantic, and pyserde are also options. But hopefully this can be helpful for understanding how those libraries work at a conceptual level (their exact implementations will look very different.) In zarr-python’s case, this didn’t quite warrant needing to bring in a dependency, so we rolled our own. ...

August 31, 2024

stac-geoparquet

I wrote up a quick introduction to stac-geoparquet on the Cloud Native Geo blog with Kyle Barron and Chris Holmes. The key takeaway: STAC GeoParquet offers a very convenient and high-performance way to distribute large STAC collections, provided the items in that collection are pretty homogenous Check out the project at http://github.com/stac-utils/stac-geoparquet.

August 29, 2024

What's Next? (2024 edition)

I have, as they say, some personal news to share. On Monday I (along with some very talented teammates, see below if you’re hiring) was laid off from Microsoft as part of a reorganization. Like my Moving to Microsoft post, I wanted to jot down some of the things I got to work on. For those of you wondering, the Planetary Computer project does continue, just without me. Reflections It should go without saying that all of this was a team effort. I’ve been incredibly fortunate to have great teammates over the years, but the team building out the Planetary Computer was especially fantastic. Just like before, this will be very self-centered and project-focused, overlooking all the other people and work that went into this. ...

August 12, 2024

My Real-World Match / Case

Ned Batchelder recently shared Real-world match/case, showing a real example of Python’s Structural Pattern Matching. These real-world examples are a great complement to the tutorial, so I’ll share mine. While working on some STAC + Kerchunk stuff, in this pull request I used the match statement to parse some nested objects: for k, v in refs.items(): match k.split("/"): case [".zgroup"]: # k = ".zgroup" item.properties["kerchunk:zgroup"] = json.loads(v) case [".zattrs"]: # k = ".zattrs" item.properties["kerchunk:zattrs"] = json.loads(v) case [variable, ".zarray"]: # k = "prcp/.zarray" if u := item.properties["cube:dimensions"].get(variable): u["kerchunk:zarray"] = json.loads(refs[k]) elif u := item.properties["cube:variables"].get(variable): u["kerchunk:zarray"] = json.loads(refs[k]) case [variable, ".zattrs"]: # k = "prcp/.zattrs" if u := item.properties["cube:dimensions"].get(variable): u["kerchunk:zattrs"] = json.loads(refs[k]) elif u := item.properties["cube:variables"].get(variable): u["kerchunk:zattrs"] = json.loads(refs[k]) case [variable, index]: # k = "prcp/0.0.0" if u := item.properties["cube:dimensions"].get(variable): u.setdefault("kerchunk:value", collections.defaultdict(dict)) u["kerchunk:value"][index] = refs[k] elif u := item.properties["cube:variables"].get(variable): u.setdefault("kerchunk:value", collections.defaultdict(dict)) u["kerchunk:value"][index] = refs[k] The for loop is iterating over a set of Kerchunk references, which are essentially the keys for a Zarr group. The keys vary a bit. They could be: ...

December 13, 2023

STAC Updates I'm Excited About

I wanted to share an update on a couple of developments in the STAC ecosystem that I’m excited about. It’s a great sign that even after 2 years after its initial release, the STAC ecosystem is still growing and improving how we can catalog, serve, and access geospatial data. STAC and Geoparquet A STAC API is a great way to query for data. But, like any API serving JSON, its throughput is limited. So in May 2022, the Planetary Computer team decided to export snapshots of our STAC database as geoparquet. Each STAC collection is exported as a Parquet dataset, where each record in the dataset is a STAC item. We pitched this as a way to do bulk queries over the data, where returning many and many pages of JSON would be slow (and expensive for our servers and database). ...

October 15, 2023

Gone Rafting

Last week, I was fortunate to attend Dave Beazley’s Rafting Trip course. The pretext of the course is to implement the Raft Consensus Algorithm. I’ll post more about Raft, and the journey of implementing, it later. But in brief, Raft is an algorithm that lets a cluster of machines work together to reliably do something. If you had a service that needed to stay up (and stay consistent), even if some of the machines in the cluster went down, then you might want to use Raft. ...

August 13, 2023

National Water Model on Azure

A few colleagues and I recently presented at the CIROH Training and Developers Conference. In preparation for that I created a Jupyter Book. You can view it at https://tomaugspurger.net/noaa-nwm/intro.html I created a few cloud-optimized versions for subsets of the data, but those will be going away since we don’t have operational pipelines to keep them up to date. But hopefully the static notebooks are still helpful. Lessons learned Aside from running out of time (I always prepare too much material for the amount of time), I think things went well. JupyterHub (perhaps + Dask) and Kubernetes continues to be a great way to run a workshop. ...

May 25, 2023

Jupyter, STAC, and Tool Building

Over in Planetary Computer land, we’re working on bringing Sentinel-5P into our STAC catalog. STAC items require a geometry property, a GeoJSON object that describes the footprint of the assets. Thanks to the satellites’ orbit and the (spatial) size of the assets, we started with some…interesting… footprints: That initial footprint, shown in orange, would render the STAC collection essentially useless for spatial searches. The assets don’t actually cover (most of) the southern hemisphere. ...

April 15, 2023

py-spy in Azure Batch

Today, I was debugging a hanging task in Azure Batch. This short post records how I used py-spy to investigate the problem. Background Azure Batch is a compute service that we use to run container workloads. In this case, we start up a container that processes a bunch of GOES-GLM data to create STAC items for the Planetary Computer . The workflow is essentially a big for url in urls: local_file = download_url(url) stac.create_item(local_file) We noticed that some Azure Batch tasks were hanging. Based on our logs, we knew it was somewhere in that for loop, but couldn’t determine exactly where things were hanging. The goes-glm stactools package we used does read a NetCDF file, and my experience with Dask biased me towards thinking the netcdf library (or the HDF5 reader it uses) was hanging. But I wanted to confirm that before trying to implement a fix. ...

February 22, 2023

Dask-GeoPandas Spatial Partitioning Performance

A college reached out yesterday about a performance issue they were hitting when working with the Microsoft Building Footprints dataset we host on the Planetary Computer. They wanted to get the building footprints for a small section of Turkey, but noticed that the performance was relatively slow and it seemed like a lot of data was being read. This post details how we debugged what was going on, and the steps we took to fix it. ...

February 9, 2023