I am working on a data analysis task using climate variables provided in netCDF files and want to use xarray to load and process the data. I have monthly files of hourly gridded data for each variable with a naming convention somewhat like:
{prefix}_{variablename}_YYYY_MM.nc
Each file is ~1.5 GB indexed by three coordinates: time (720 values for a 30-day month), latitude (721), longitude (1440). I extract the variable values at a particular grid cell (specified by lon/lat of a point of interest), calculate the average over the month, and finally want to calculate a long-term average (30 years for example) of monthly averages.
I tried a few things. With xarray.open_mfdataset() the grouped averaging just sat there for an hour. So I decided to extract the average of the variable from each file, stuff them in a list, and convert the list to a dataframe to group and average again later.
This works better, but I have some trouble with the performance of the data extraction. The main loop, with the code to time each loop looks like this, with the variable being snow depth sd:
output = []
for yr in range(startyr, endyr+1):
    for mth in range(1, 13):
        start = time.time()
        fp = datadirpath / f"{prefix}_sd_{yr}_{str(mth).zfill(2)}.nc"
        with xr.open_dataset(fp) as src:
            one_month_snow = src['sd'].sel(
                latitude=lat, longitude=lon, 
                method='nearest')
            output.append([yr, mth, one_month_snow.mean().values.item()])
        stop = time.time()
        print(stop - start)
Each iteration takes about 22 seconds. That seems long. But very weirdly, if I re-run the exact same code, each iteration only takes <10 ms.
What makes no difference:
- Whether I run this in a Jupyter Notebook or from a stand-alone script
- Whether I restart the Python 3 kernel
- Whether I use .sel(..., method='nearest')or provide the exact lon/lat coordinates
I'm a bit stumped here. Something is being cached, on the OS level? How can I optimize this code, since I'm gonna do a lot of this exact thing? Should I use chunking?
