py-spy in Azure Batch

Today, I was debugging a hanging task in Azure Batch. This short post records how I used py-spy to investigate the problem. Background Azure Batch is a compute service that we use to run container workloads. In this case, we start up a container that processes a bunch of GOES-GLM data to create STAC items for the Planetary Computer . The workflow is essentially a big for url in urls: local_file = download_url(url) stac.create_item(local_file) We noticed that some Azure Batch tasks were hanging. Based on our logs, we knew it was somewhere in that for loop, but couldn’t determine exactly where things were hanging. The goes-glm stactools package we used does read a NetCDF file, and my experience with Dask biased me towards thinking the netcdf library (or the HDF5 reader it uses) was hanging. But I wanted to confirm that before trying to implement a fix. ...

February 22, 2023

Planetary Computer Release: January 2023

The Planetary Computer made its January 2023 release a couple weeks back. The flagship new feature is a really cool new ability to visualize the Microsoft AI-detected Buildings Footprints dataset. Here’s a little demo made by my teammate, Rob: Your browser doesn't support HTML video. Here is a link to the video instead. Currently, enabling this feature required converting the data from its native geoparquet to a lot of protobuf files with Tippecanoe. I’m very excited about projects to visualize the geoparquet data directly (see Kyle Barron’s demo) but for now we needed to do the conversion. ...

February 9, 2023

Cloud Optimized Vibes

Over on the Planetary Computer team, we get to have a lot of fun discussions about doing geospatial data analysis on the cloud. This post summarizes some work we did, and the (I think) interesting conversations that came out of it. Background: GOES-GLM The instigator in this case was onboarding a new dataset to the Planetary Computer, GOES-GLM. GOES is a set of geostationary weather satellites operated by NOAA, and GLM is the Geostationary Lightning Mapper, an instrument on the satellites that’s used to monitor lightning. It produces some really neat (and valuable) data. ...

January 14, 2023

Queues in the News

I came across a couple of new (to me) uses of queues recently. When I came up with the title to this article I knew I had to write them up together. Queues in Dask Over at the Coiled Blog, Gabe Joseph has a nice post summarizing a huge amount of effort addressing a problem that’s been vexing demanding Dask users for years. The main symptom of the problem was unexpectedly high memory usage on workers, leading to crashing workers (which in turn caused even more network communication, and so more memory usage, and more crashing workers). This is actually a problem I worked on a bit back in 2019, and I made very little progress. ...

December 26, 2022

My 2022 Year in Books

It’s “Year in X” time, and here’s my 2022 Year in Books on GoodReads. I’ll cover some highlights here. Many of these recommendations came from the Incomparable’s Book Club, part of the main Incomparable podcast. In particular, episode 600 The Machine was a Vampire which is a roundup of their favorites from the 2010s. Bookended by Murderbot Diaries I started and ended this year (so far) with a couple installments in the Murderbot Diaries. These follow a robotic / organic “Security Unit” that’s responsible for taking care of humans in dangerous situations. We pick up after an unfortunate incident where it seems to have gone rouge and murdered her clients (hence, the murderbot) and hacked its governor module to essentially become “free”. ...

December 21, 2022

Podcast: Revolutions

Mike Duncan is wrapping up his excellent Revolutions podcast. If you’re at all interested in history then now is a great time to pick it up. He takes the concept of “a revolution” and looks at it through the lens of a bunch of revolutions throughout history. The appendix episodes from the last few weeks have really tied things together, looking at whats common (and not) across all the revolutions covered in the series. ...

December 20, 2022

Rebooting

Like some others, I’m getting back into blogging. I’ll be “straying from my lane” and won’t just be writing about Python data libraries (though there will still be some of that). If you too would like to blog more, I’d encourge you to read Simon Willison’s What to blog About and Matt Rocklin’s Write Short Blogposts. Because I’m me, I couldn’t just make a new post. I also had to switch static site generators, just becauase. All the old links, including my RSS feed, should continue to work. If you spot any issues, let me know (I think I’ve fixed at least one bug in the RSS feed, apologies for any spurious updates. But just in case, you might want to update your RSS links to http://tomaugspurger.net/index.xml). ...

December 18, 2022

What's Next?

Some personal news: Last Friday was my last day at Anaconda. Next week, I’m joining Microsoft’s AI for Earth team. This is a very bittersweet transition. While I loved working at Anaconda and all the great people there, I’m extremely excited about what I’ll be working on at Microsoft. Reflections I was inspired to write this section by Jim Crist’s post on a similar topic: https://jcristharif.com/farewell-to-anaconda.html. I’ll highlight some of the projects I worked on while at Anaconda. If you want to skip the navel gazing, skip down to what’s next. ...

November 11, 2020

Maintaining Performance

As pandas’ documentation claims: pandas provides high-performance data structures. But how do we verify that the claim is correct? And how do we ensure that it stays correct over many releases. This post describes pandas’ current setup for monitoring performance My personal debugging strategy for understanding and fixing performance regressions when they occur. I hope that the first section topic is useful for library maintainers and the second topic is generally useful for people writing performance-sensitive code. ...

April 1, 2020

Compatibility Code

Compatibility Code Most libraries with dependencies will want to support multiple versions of that dependency. But supporting old version is a pain: it requires compatibility code, code that is around solely to get the same output from versions of a library. This post gives some advice on writing compatibility code. Don’t write your own version parser Centralize all version parsing Use consistent version comparisons Use Python’s argument unpacking Clean up unused compatibility code 1. Don’t write your own version parser It can be tempting just do something like ...

December 12, 2019