使用XARRAY在NETCDF数据上应用功能后,文件大小(x10)的显着增加

发布于 2025-01-25 23:52:17 字数 2204 浏览 4 评论 0 原文

我正在研究以NETCDF格式,我需要根据U和V组件计算风向。

我已经有一个使用Xarray和Pandas的Python代码工作件,并将输出保存为NetCDF。

虽然我不添加任何其他数据(仅应用转换来计算新变量而不是u和v),但我会大大增加ouput文件的大小。 输入文件约为30 mo,输出文件约为300 mo。

有人可以解释发生了什么吗?我的输出文件的成型方式必须有一些东西,但我不明白。是由于文件的编码,所使用的数据类型(在此处输入和输出中的float32)或任何其他NetCDF格式问题?

另外,您是否知道如何优化输出文件的大小?

为了帮助您了解差异,这是输入文件的摘要:

<xarray.Dataset>
Dimensions:     (latitude: 7, longitude: 8, time: 113952)
Coordinates:
    number      int64 ...
  * time        (time) datetime64[ns] 1979-01-01 ... 1991-12-31T23:00:00
    step        timedelta64[ns] ...
    surface     float64 ...
  * latitude    (latitude) float64 47.5 47.25 47.0 46.75 46.5 46.25 46.0
  * longitude   (longitude) float64 -3.0 -2.75 -2.5 -2.25 -2.0 -1.75 -1.5 -1.25
    valid_time  (time) datetime64[ns] ...
Data variables:
    u10         (time, latitude, longitude) float32 ...
Attributes:
    GRIB_edition:            1
    GRIB_centre:             ecmf
    GRIB_centreDescription:  European Centre for Medium-Range Weather Forecasts
    GRIB_subCentre:          0
    Conventions:             CF-1.7
    institution:             European Centre for Medium-Range Weather Forecasts
    history:                 2022-05-04T13:10 GRIB to CDM+CF via cfgrib-0.9.9...

输出一个:

<xarray.Dataset>
Dimensions:     (latitude: 7, longitude: 8, time: 113952)
Coordinates:
  * time        (time) datetime64[ns] 1979-01-01 ... 1991-12-31T23:00:00
  * latitude    (latitude) float64 46.0 46.25 46.5 46.75 47.0 47.25 47.5
  * longitude   (longitude) float64 -3.0 -2.75 -2.5 -2.25 -2.0 -1.75 -1.5 -1.25
Data variables:
    number      (time, latitude, longitude) int64 0 0 0 0 0 0 0 ... 0 0 0 0 0 0
    step        (time, latitude, longitude) timedelta64[ns] 00:00:00 ... 00:0...
    surface     (time, latitude, longitude) float64 0.0 0.0 0.0 ... 0.0 0.0 0.0
    valid_time  (time, latitude, longitude) datetime64[ns] 1979-01-01 ... 199...
    direction   (time, latitude, longitude) float32 17.76 26.89 ... 180.1 178.5

我能看到的唯一差异是一些原始坐标现在是数据变量。

I am working on ERA5 reanalysis data in NetCDF format, and I need to compute wind direction based on U and V components.

I already have a working piece of Python code to do so using xarray and pandas, and saving output as NetCDF.

While I do not add any additional data (only applying transformation to compute a new variable instead of U and V), I'm getting a huge increase in the size of the ouput file.
Input files are about 30 Mo and the output one is about 300 Mo.

Can someone explain what is going on ? There must be something with the way my output file is being shaped, but I don't get how. Is it due to the encoding of the file, datatype used (float32 in both input and output here), or any other NetCDF format issue ?

Also, do you have any idea how I could optimize the size of the output file ?

To help you understand the differences, here is a summary of the input file :

<xarray.Dataset>
Dimensions:     (latitude: 7, longitude: 8, time: 113952)
Coordinates:
    number      int64 ...
  * time        (time) datetime64[ns] 1979-01-01 ... 1991-12-31T23:00:00
    step        timedelta64[ns] ...
    surface     float64 ...
  * latitude    (latitude) float64 47.5 47.25 47.0 46.75 46.5 46.25 46.0
  * longitude   (longitude) float64 -3.0 -2.75 -2.5 -2.25 -2.0 -1.75 -1.5 -1.25
    valid_time  (time) datetime64[ns] ...
Data variables:
    u10         (time, latitude, longitude) float32 ...
Attributes:
    GRIB_edition:            1
    GRIB_centre:             ecmf
    GRIB_centreDescription:  European Centre for Medium-Range Weather Forecasts
    GRIB_subCentre:          0
    Conventions:             CF-1.7
    institution:             European Centre for Medium-Range Weather Forecasts
    history:                 2022-05-04T13:10 GRIB to CDM+CF via cfgrib-0.9.9...

And the output one :

<xarray.Dataset>
Dimensions:     (latitude: 7, longitude: 8, time: 113952)
Coordinates:
  * time        (time) datetime64[ns] 1979-01-01 ... 1991-12-31T23:00:00
  * latitude    (latitude) float64 46.0 46.25 46.5 46.75 47.0 47.25 47.5
  * longitude   (longitude) float64 -3.0 -2.75 -2.5 -2.25 -2.0 -1.75 -1.5 -1.25
Data variables:
    number      (time, latitude, longitude) int64 0 0 0 0 0 0 0 ... 0 0 0 0 0 0
    step        (time, latitude, longitude) timedelta64[ns] 00:00:00 ... 00:0...
    surface     (time, latitude, longitude) float64 0.0 0.0 0.0 ... 0.0 0.0 0.0
    valid_time  (time, latitude, longitude) datetime64[ns] 1979-01-01 ... 199...
    direction   (time, latitude, longitude) float32 17.76 26.89 ... 180.1 178.5

Only difference I can see is that some of the original coordinates are now data variables.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

一梦浮鱼 2025-02-01 23:52:17

最重要的是,您的所有变量现在都由(时间,纬度,经度)索引,因此将具有的完整数组大小(7 x 8 x 113952)。以前,号码 step surface 是标量(不是数组 - 只是一个值),有效_Time 仅由 time 索引。由于所有这些都是64位,因此您的新数组现在有效4个新变量,每个变量的大小是 u10 的两倍。这样一来,就可以增加9倍的增长。

为了确保不会发生这种情况,请尝试真正小心地只做数学&amp;使用数据阵列重塑操作,而不是数据集。 Xarray在使用数据集时运行良好,但是当您掌握它时的行为并不总是直观的,而这样的自动广播是可以使您措手不及的事情之一。我始终建议人们使用数据阵列进行工作,然后在出于这个原因之前创建数据集。请参阅 and 自动对准有关此主题的更多信息。

我还希望,如果您从ECMWF获得此数据,源数据可能会在磁盘上压缩,这不是 ds.to_netcdf 。请参阅写NetCdfs 压缩选项。

Most importantly, all your variables are now indexed by (time, latitude, longitude), so will have the full array size of (7 x 8 x 113952). Previously, number, step, and surface were scalars (not an array - just a single value), and valid_time was only indexed by time. Since all of these are 64-bit, your new array now has effectively 4 new variables each of which are twice the size as u10. So that alone accounts for a 9x increase.

To ensure this doesn't happen, try being really careful to only do math & reshape operations with DataArrays, not Datasets. Xarray works perfectly well when working with Datasets, but the behavior isn't always intuitive when you're just getting the hang of it, and automatic broadcasting like this is one of the things that can catch you off guard. I always recommend that people do their work with DataArrays and then create a Dataset before write for this reason. See the docs on Broadcasting by dimension name and Automatic Alignment for more on this topic.

I'd also expect that if you got this data from ECMWF the source data may be compressed on disk, which is not the default for ds.to_netcdf. See the docs on writing netCDFs for a discussion of compression options.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文