使用XARRAY在NETCDF数据上应用功能后，文件大小（x10）的显着增加

发布于 2025-01-25 23:52:17 字数 2204 浏览 4 评论 0 原文

我正在研究以NETCDF格式，我需要根据U和V组件计算风向。

我已经有一个使用Xarray和Pandas的Python代码工作件，并将输出保存为NetCDF。

虽然我不添加任何其他数据（仅应用转换来计算新变量而不是u和v），但我会大大增加ouput文件的大小。输入文件约为30 mo，输出文件约为300 mo。

有人可以解释发生了什么吗？我的输出文件的成型方式必须有一些东西，但我不明白。是由于文件的编码，所使用的数据类型（在此处输入和输出中的float32）或任何其他NetCDF格式问题？

另外，您是否知道如何优化输出文件的大小？

为了帮助您了解差异，这是输入文件的摘要：

<xarray.Dataset>
Dimensions:     (latitude: 7, longitude: 8, time: 113952)
Coordinates:
    number      int64 ...
  * time        (time) datetime64[ns] 1979-01-01 ... 1991-12-31T23:00:00
    step        timedelta64[ns] ...
    surface     float64 ...
  * latitude    (latitude) float64 47.5 47.25 47.0 46.75 46.5 46.25 46.0
  * longitude   (longitude) float64 -3.0 -2.75 -2.5 -2.25 -2.0 -1.75 -1.5 -1.25
    valid_time  (time) datetime64[ns] ...
Data variables:
    u10         (time, latitude, longitude) float32 ...
Attributes:
    GRIB_edition:            1
    GRIB_centre:             ecmf
    GRIB_centreDescription:  European Centre for Medium-Range Weather Forecasts
    GRIB_subCentre:          0
    Conventions:             CF-1.7
    institution:             European Centre for Medium-Range Weather Forecasts
    history:                 2022-05-04T13:10 GRIB to CDM+CF via cfgrib-0.9.9...

输出一个：

<xarray.Dataset>
Dimensions:     (latitude: 7, longitude: 8, time: 113952)
Coordinates:
  * time        (time) datetime64[ns] 1979-01-01 ... 1991-12-31T23:00:00
  * latitude    (latitude) float64 46.0 46.25 46.5 46.75 47.0 47.25 47.5
  * longitude   (longitude) float64 -3.0 -2.75 -2.5 -2.25 -2.0 -1.75 -1.5 -1.25
Data variables:
    number      (time, latitude, longitude) int64 0 0 0 0 0 0 0 ... 0 0 0 0 0 0
    step        (time, latitude, longitude) timedelta64[ns] 00:00:00 ... 00:0...
    surface     (time, latitude, longitude) float64 0.0 0.0 0.0 ... 0.0 0.0 0.0
    valid_time  (time, latitude, longitude) datetime64[ns] 1979-01-01 ... 199...
    direction   (time, latitude, longitude) float32 17.76 26.89 ... 180.1 178.5

我能看到的唯一差异是一些原始坐标现在是数据变量。

原文

I am working on ERA5 reanalysis data in NetCDF format, and I need to compute wind direction based on U and V components.

I already have a working piece of Python code to do so using xarray and pandas, and saving output as NetCDF.

While I do not add any additional data (only applying transformation to compute a new variable instead of U and V), I'm getting a huge increase in the size of the ouput file.
Input files are about 30 Mo and the output one is about 300 Mo.

Can someone explain what is going on ? There must be something with the way my output file is being shaped, but I don't get how. Is it due to the encoding of the file, datatype used (float32 in both input and output here), or any other NetCDF format issue ?

Also, do you have any idea how I could optimize the size of the output file ?

To help you understand the differences, here is a summary of the input file :

<xarray.Dataset>
Dimensions:     (latitude: 7, longitude: 8, time: 113952)
Coordinates:
    number      int64 ...
  * time        (time) datetime64[ns] 1979-01-01 ... 1991-12-31T23:00:00
    step        timedelta64[ns] ...
    surface     float64 ...
  * latitude    (latitude) float64 47.5 47.25 47.0 46.75 46.5 46.25 46.0
  * longitude   (longitude) float64 -3.0 -2.75 -2.5 -2.25 -2.0 -1.75 -1.5 -1.25
    valid_time  (time) datetime64[ns] ...
Data variables:
    u10         (time, latitude, longitude) float32 ...
Attributes:
    GRIB_edition:            1
    GRIB_centre:             ecmf
    GRIB_centreDescription:  European Centre for Medium-Range Weather Forecasts
    GRIB_subCentre:          0
    Conventions:             CF-1.7
    institution:             European Centre for Medium-Range Weather Forecasts
    history:                 2022-05-04T13:10 GRIB to CDM+CF via cfgrib-0.9.9...

And the output one :

<xarray.Dataset>
Dimensions:     (latitude: 7, longitude: 8, time: 113952)
Coordinates:
  * time        (time) datetime64[ns] 1979-01-01 ... 1991-12-31T23:00:00
  * latitude    (latitude) float64 46.0 46.25 46.5 46.75 47.0 47.25 47.5
  * longitude   (longitude) float64 -3.0 -2.75 -2.5 -2.25 -2.0 -1.75 -1.5 -1.25
Data variables:
    number      (time, latitude, longitude) int64 0 0 0 0 0 0 0 ... 0 0 0 0 0 0
    step        (time, latitude, longitude) timedelta64[ns] 00:00:00 ... 00:0...
    surface     (time, latitude, longitude) float64 0.0 0.0 0.0 ... 0.0 0.0 0.0
    valid_time  (time, latitude, longitude) datetime64[ns] 1979-01-01 ... 199...
    direction   (time, latitude, longitude) float32 17.76 26.89 ... 180.1 178.5

Only difference I can see is that some of the original coordinates are now data variables.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

一梦浮鱼 2025-02-01 23:52:17

最重要的是，您的所有变量现在都由（时间，纬度，经度）索引，因此将具有的完整数组大小（7 x 8 x 113952）。以前，号码， step 和 surface 是标量（不是数组 - 只是一个值），有效_Time 仅由 time 索引。由于所有这些都是64位，因此您的新数组现在有效4个新变量，每个变量的大小是 u10 的两倍。这样一来，就可以增加9倍的增长。

为了确保不会发生这种情况，请尝试真正小心地只做数学＆amp;使用数据阵列重塑操作，而不是数据集。 Xarray在使用数据集时运行良好，但是当您掌握它时的行为并不总是直观的，而这样的自动广播是可以使您措手不及的事情之一。我始终建议人们使用数据阵列进行工作，然后在出于这个原因之前创建数据集。请参阅 and 自动对准有关此主题的更多信息。

我还希望，如果您从ECMWF获得此数据，源数据可能会在磁盘上压缩，这不是 ds.to_netcdf 。请参阅写NetCdfs 压缩选项。