In my experimentation so far, I've tried:
- xr.open_datasetwith- chunksarg, and it loads the data into memory.
- Set up a NetCDF4DataStore, and callds['field'].valuesand it loads the data into memory.
- Set up a ScipyDataStorewithmmap='r', andds['field'].valuesloads the data into memory.
From what I have seen, the design seems to center not around actually applying numpy functions on memory-mapped arrays, but rather loading small chunks into memory (sometimes using memory-mapping to do so). For example, this comment. And somewhat related comment here about not xarray not being able to determine whether a numpy array is mmapped or not.
I'd like to be able to represent and slice data as an xarray.Dataset, and be able to call .values (or .data) to get an ndarray, but have it remain mmapped (for purposes of shared-memory and so on).
It would also be nice if chunked dask operations could at least operate on the memory-mapped array until it actually needs to mutate something, which seems possible since dask seems to be designed around immutable arrays.
I did find a trick with xarray, though, which is to do like so:
data=np.load('file.npy', mmap_mode='r')
ds=xr.Dataset({'foo': (['dim1', 'dim2'], data)})
At this point, things like the following work without loading anything into memory:
np.sum(ds['foo'].values)
np.sum(ds['foo'][::2,:].values)
...xarray apparently doesn't know that the array is mmapped, and can't afford to impose a np.copy for cases like these.
Is there a "supported" way to do read-only memmapping (or copy-on write for that matter) in xarray or dask?
 
    