从NETCDF文件中获取特定的单元格值,首先执行
我正在使用Xarray Python库访问NetCDF文件。我使用的特定文件是公开可用。
因此,该文件具有多个变量,对于大多数这些变量,尺寸为:时间:4314,x:700,y:562。我使用的是et_500m变量,但是其他变量的行为也相似。大块是:288,36,44。
我正在检索一个单元格并使用以下代码打印值:
import xarray as xr
ds = xr.open_dataset('./dataset_greece.nc')
print(ds.ET_500m.values[0][0][0])
根据我的理解,Xarray应该直接定位包含磁盘中相应值的块的位置并读取它。由于块的大小不应大于几个MB,因此我希望这需要几秒钟甚至更少。但是,相反,它需要超过2分钟。
如果在同一脚本中,我还是检索了另一个单元格的值,即使它位于其他块中(例如print(ds.et_500m.values [1000] [500] [500] [500]))
> ),然后第二次检索只花了一些毫秒。
所以我的问题是什么是在第一次检索中导致这个开销的原因是什么?
编辑:我刚刚看到,在Xarray Open_Dataset中有可选参数缓存,该缓存根据手册:
如果为true,则在访问内存中从内存中的基础数据存储中加载的缓存数据,以免多次从基础数据存储中读取。默认为true [...]
因此,当我将其设置为false时,随后的获取也像第一个一样慢。但是我的问题仍然是。为什么这么慢,因为我只访问一个单元格。我期望Xarray直接将块定位在磁盘上,只读了几个MB。
I am accessing a netcdf file using the xarray python library. The specific file that I am using is publicly available.
So, the file has several variables, and for most of these variables the dimensions are: time: 4314, x: 700, y: 562. I am using the ET_500m variable, but the behaviour is similar for the other variables as well. The chunking is: 288, 36, 44.
I am retrieving a single cell and printing the value using the following code:
import xarray as xr
ds = xr.open_dataset('./dataset_greece.nc')
print(ds.ET_500m.values[0][0][0])
According to my understanding, xarray should locate directly the position of the chunk that contains the corresponding value in disk and read it. Since the size of the chunk should not be bigger than a couple of MBs, I would expect this to take a few seconds or even less. But instead, it takes more than 2 minutes.
If, in the same script, I retrieve the value of another cell, even if it is located in a different chunk (e.g. print(ds.ET_500m.values[1000][500][500])
), then this second retrieval takes only some milliseconds.
So my question is what exactly causes this overhead in the first retrieval?
EDIT: I just saw that in xarray open_dataset there is the optional parameter cache, which according to the manual:
If True, cache data loaded from the underlying datastore in memory as NumPy arrays when accessed to avoid reading from the underlying data- store multiple times. Defaults to True [...]
So, when I set this to False, subsequent fetches are also slow like the first one. But my question remains. Why is this so slow since I am only accessing a single cell. I was expecting that xarray directly locates the chunk on disk and only reads a couple of MBs.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
与其从
.values
属性中选择,不如首先子集:问题是
.values
将数据胁迫到numpy阵列,因此您正在加载所有数据,然后将数组置于阵列。对于Xarray来说,没有办法 - Numpy没有任何懒惰加载的概念,因此,一旦您调用.values
Xarray别无选择,只能加载(或计算)所有数据。如果数据是DASK支持的数组,则可以使用 .data 而不是
.values
访问dask数组并在dask数组上使用位置索引,例如ds.et_500m.data [0,0,0,0,0,0 ,0]
。但是,如果数据只是一个懒惰的netcdf.data
,将具有上述相同的负载 - 所有陷阱。Rather than selecting from the
.values
property, subset the array first:The problem is that
.values
coerces the data to a numpy array, so you're loading all of the data and then subsetting the array. There's no way around this for xarray - numpy doesn't have any concept of lazy loading, so as soon as you call.values
xarray has no option but to load (or compute) all of your data.If the data is a dask-backed array, you could use
.data
rather than.values
to access the dask array and use positional indexing on the dask array, e.g.ds.ET_500m.data[0, 0, 0]
. But if the data is just a lazy-loaded netCDF.data
will have the same load-everything pitfall described above.