与 HDF5 或 netCDF 相比,使用 .Rdata 文件有哪些缺点?

发布于 2024-12-10 18:14:22 字数 654 浏览 3 评论 0原文

我被要求更改当前导出 .Rdata 文件的软件,以便它以“平台独立的二进制格式”(例如 HDF5 或 netCDF)导出。给出了两个原因:

  1. Rdata 文件只能由 R 二进制信息读取
  2. ,具体存储方式取决于操作系统或体系结构

我还发现 "R 数据导入导出手册" 没有讨论 Rdata 文件,尽管它讨论了 HDF5 和 netCDF。

关于 R-help 的讨论建议.Rdata 文件与平台无关。

问题:

  1. 这些担忧在多大程度上是合理的?
    • 例如,Matlab 可以在不调用 R 的情况下读取 .Rdata 吗?
  2. 在这方面,其他格式是否比 .Rdata 文件更有用?
  3. 是否可以编写一个脚本来创建所有 .Rdata 文件的 .hdf5 类似物,从而最大限度地减少对程序本身的更改?

I have been asked to change a software that currently exports .Rdata files so that it exports in a 'platform independent binary format' such as HDF5 or netCDF. Two reasons were given:

  1. Rdata files can only be read by R
  2. binary information is stored differently depending on operating systems or architecture

I also found that the "R Data import export manual" does not discuss Rdata files although it does discuss HDF5 and netCDF.

A discussion on R-help suggests that .Rdata files are platform independent.

Questions:

  1. To what extent are these concerns valid?
    • e.g. can Matlab read .Rdata without invoking R?
  2. Are other formats more useful in this respect than .Rdata files?
  3. Would it be possible to write a script that would create .hdf5 analogues of all .Rdata files, minimizing changes to the program itself?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

深陷 2024-12-17 18:14:22

以下是多种答案:

  1. 丰富的选择首先,您的担忧是有道理的,但您的选择列表比应有的范围要窄一些。 HDF5/netCDF4 是一个很好的选择,并且可以与 Python、Matlab 和许多其他系统配合良好。 HDF5 在很多方面都优于 Python 的 pickle 存储 - 查看 PyTables,您很可能会看到良好的加速。 Matlab 曾经(并且可能仍然)存在一些关于 HDF5 中存储多大单元(或可能是结构)数组的问题。并不是说它做不到,而是它慢得要命。这是Matlab 的问题,而不是HDF5 的问题。虽然这些都是不错的选择,但您也可以考虑 HDF5 是否足够:考虑一下您是否有一些非常大的文件,并且可以从专有编码中受益,无论是在访问速度还是压缩方面。在任何语言中进行原始二进制存储都不太难,您可以轻松设计诸如 BigMemory 的文件存储(即访问速度)之类的东西。事实上,您甚至可以使用其他语言的 bigmemory 文件 - 这确实是一种非常简单的格式。 HDF5 无疑是一个很好的起点,但没有一种通用的数据存储和访问解决方案,尤其是当需要处理非常大的数据集时。 (对于较小的数据集,您还可以查看 Protocol Buffers 或其他序列化格式;Dirk 使用 RProtoBuf 来在 R 中访问这些数据。)对于压缩,请参阅下一个建议。

  2. 大小 正如 Dirk 提到的,文件格式可以描述为应用程序中立和应用程序相关。另一个轴是域无关(或域无知)或域相关(域智能;-))存储。如果您对数据的产生方式有一定的了解,尤其是可用于压缩的任何信息,那么您可能能够构建比标准压缩器更好的格式。这需要一些工作。 gzip 和 bzip 之外的替代压缩器还允许您分析大量数据并开发适当的压缩“字典”,以便您可以获得比 .Rdat 文件更好的压缩。对于许多类型的数据集,在表中存储不同行之间的增量是一个更好的选择 - 它可以带来更大的压缩性(例如可能会出现大量 0),但只有您知道这是否适用于您的数据。

  3. 速度和访问 .Rdat 不支持随机访问。它没有对并行 I/O 的内置支持(尽管您可以根据需要串行化到并行 I/O 存储)。人们可以做很多事情来改进,但是一遍又一遍地把东西粘到 .Rdat 上需要一千次削减,而不是仅仅切换到不同的存储机制并消除速度和访问问题。 (这不仅仅是 HDF5 的优势:我经常使用多核函数来并行化其他 I/O 方法,例如 bigmemory。)

  4. 更新功能 R没有很好的方法将对象添加到 .Rdat 文件。据我所知,它没有提供任何“查看器”来允许用户直观地检查或搜索 .Rdat 文件集合。据我所知,它不提供文件中对象的任何内置版本控制记录保存。 (我通过文件中的一个单独的对象来执行此操作,该对象记录生成对象的脚本的版本,但我将在未来的迭代中将其外包给 SQLite。)HDF5 拥有所有这些。 (此外,随机访问会影响数据的更新 - .Rdat 文件,您必须保存整个对象。)

  5. 公共支持 虽然我提倡您自己的格式,但这是为了极端数据大小。为多种语言构建库对于减少交换数据的摩擦非常有帮助。对于大多数简单的数据集(在大多数情况下简单仍然意味着“相当复杂”)或中等到相当大的数据集,HDF5 是一种很好的格式。当然,有一些方法可以在专门的系统上击败它。尽管如此,它仍然是一个很好的标准,并且意味着将花费更少的组织精力来支持专有或特定于应用程序的格式。我看到组织在使用生成数据的应用程序之后多年来一直坚持某种格式,只是因为编写了大量代码来以该应用程序的格式加载和保存,并且 GB 或 TB 的数据已经以其格式存储(有一天,这可能是您和 R,但这源于不同的统计套件,该套件以字母“S”开头并以字母“S”结尾;-))。这对以后的工作来说是一个非常严重的摩擦。如果您使用广泛使用的标准格式,则可以更轻松地在它和其他广泛使用的标准之间进行移植:很可能其他人也决定解决同样的问题。尝试一下 - 如果您现在制作转换器,但实际上并未将其转换以供使用,那么至少您已经创建了一个工具,其他人可以在需要转换为另一种数据格式时使用该工具.

  6. 内存 对于 .Rdat 文件,您必须加载附加它才能访问对象。大多数时候,人们加载文件。好吧,如果文件很大,就会占用大量 RAM。因此,任一者都更聪明地使用attach 或将对象分成多个文件。这对于访问对象的小部分来说是相当麻烦的。为此,我使用内存映射。 HDF5 允许随机访问文件的某些部分,因此您无需仅仅为了访问一小部分而加载所有数据。这只是事情运作方式的一部分。因此,即使在 R 中,也有比 .Rdat 文件更好的选择。

  7. 转换脚本 至于您关于编写脚本的问题 - 是的,您可以编写一个加载对象并将其保存到 HDF5 中的脚本。但是,在大量异构文件上执行此操作不一定是明智的,除非您对将要创建的内容有很好的了解。我无法开始为自己的数据集设计这个:其中有太多一次性对象,创建一个庞大的 HDF5 文件库将是荒谬的。最好将其视为启动数据库:您要存储什么,如何存储它,以及如何表示和访问它?

一旦制定了数据转换计划,您就可以使用 Hadoop 等工具甚至基本的多核功能来启动您的转换程序并尽快完成此任务。

简而言之,即使您留在 R,我们也强烈建议您考虑其他可能的存储格式,特别是对于大型、不断增长的数据集。如果您必须与其他人共享数据,或者至少提供读或写访问权限,那么强烈建议使用其他格式。没有理由花时间维护其他语言的读取器/编写器 - 这只是数据而不是代码。 :) 将代码重点放在如何以合理的方式操作数据上,而不是花时间在存储上 - 其他人已经在这方面做得很好。

Here are a variety of answers:

  1. Abundance of options First, the concern is valid, but your list of choices is a little more narrow than it should be. HDF5/netCDF4 is an excellent option, and work well with Python, Matlab, and many other systems. HDF5 is superior to Python's pickle storage in many ways - check out PyTables and you'll very likely see good speedups. Matlab used to have (and may still have) some issues with how large cell (or maybe struct) arrays are stored in HDF5. It's not that it can't do it, but that it was god-awful slow. That's Matlab's problem, not HDF5's. While these are great choices, you may also consider whether HDF5 is adequate: consider if you have some very large files and could benefit from a proprietary encoding, either for speed of access or compression. It's not too hard to do raw binary storage in any language and you could easily design something like the file storage of bigmemory (i.e. speed of access). In fact, you could even use bigmemory files in other languages - it's really a very simple format. HDF5 is certainly a good starting point, but there is no one universal solution for data storage and access, especially when one gets to very large data sets. (For smaller data sets, you might also take a look at Protocol Buffers or other serialization formats; Dirk did RProtoBuf for accessing these in R.) For compression, see the next suggestion.

  2. Size As Dirk mentioned, the file formats can be described as application neutral and application dependent. Another axis is domain-independent (or domain-ignorant) or domain-dependent (domain-smart ;-)) storage. If you have some knowledge of how your data will arise, especially any information that can be used in compression, you may be able to build a better format than anything that standard compressors may be able to do. This takes a bit of work. Alternative compressors than gzip and bzip also allow you to analyze large volumes of data and develop appropriate compression "dictionaries" so that you can get much better compression that you would with .Rdat files. For many kinds of datasets, storing the delta between different rows in a table is a better option - it can lead to much greater compressibility (e.g. lots of 0s may appear), but only you know whether that will work for your data.

  3. Speed and access .Rdat does not support random access. It does not have built-in support for parallel I/O (though you can serialize to a parallel I/O storage, if you wish). There are many things one could do here to improve things, but it's a thousand cuts to glue stuff on to .Rdat over and over again, rather than just switch to a different storage mechanism and blow the speed and access issues away. (This isn't just an advantage of HDF5: I have frequently used multicore functions to parallelize other I/O methods, such as bigmemory.)

  4. Update capabilities R does not have a very nice way to add objects to a .Rdat file. It does not, to my knowledge, offer any "Viewers" to allow users to visually inspect or search through a collection of .Rdat files. It does not, to my knowledge, offer any built-in versioning record-keeping of objects in the file. (I do this via a separate object in the file, which records the versions of scripts that generated the objects, but I will outsource that to SQLite in a future iteration.) HDF5 has all of these. (Also, the random access affects updating of the data - .Rdat files, you have to save the whole object.)

  5. Communal support Although I've advocated your own format, that is for extreme data sizes. Having libraries built for many languages is very helpful in reducing the friction of exchanging data. For most simple datasets (and simple still means "fairly complex" in most cases) or moderate to fairly large datasets, HDF5 is a good format. There's ways to beat it on specialized systems, certainly. Still, it is a nice standard and will mean less organizational effort will be spent supporting either a proprietary or application-specific format. I have seen organizations stick to a format for many years past the use of the application that generated the data, just because so much code was written to load and save in that application's format and GBs or TBs of data were already stored in its format (this could be you & R someday, but this arose from a different statistical suite, one that begins with the letter "S" and ends with the letter "S" ;-)). That's a very serious friction for future work. If you use a widespread standard format, you can then port between it and other widespread standards with much greater ease: it's very likely someone else has decided to tackle the same problem, too. Give it a try - if you do the converter now, but don't actually convert it for use, at least you have created a tool that others could pick up and use if there comes a time when it's necessary to move to another data format.

  6. Memory With .Rdat files, you have to load or attach it in order to access objects. Most of the time, people load the file. Well, if the file is very big, there goes a lot of RAM. So, either one is a bit smarter about using attach or separates objects into multiple files. This is quite a nuisance for accessing small parts of an object. To that end, I use memory mapping. HDF5 allows for random access to parts of a file, so you need not load all of your data just to access a small part. It's just part of the way things work. So, even within R, there are better options than just .Rdat files.

  7. Scripts for conversion As for your question about writing a script - yes, you can write a script that loads objects and saves them into HDF5. However, it is not necessarily wise to do this on a huge set of heterogenous files, unless you have a good understanding of what's going to be created. I couldn't begin to design this for my own datasets: there are too many one-off objects in there, and creating a massive HDF5 file library would be ridiculous. It's better to think of it like starting a database: what will you want to store, how will you store it, and how will it be represented and accessed?

Once you get your data conversion plan in place, you can then use tools like Hadoop or even basic multicore functionality to unleash your conversion program and get this done as quickly as possible.

In short, even if you stay in R, you are well advised to look at other possible storage formats, especially for large, growing, data sets. If you have to share data with others, or at least provide read or write access, then other formats are very much advised. There's no reason to spend your time maintaining readers/writers for other languages - it's just data not code. :) Focus your code on how to manipulate data in sensible ways, rather than spend time working on storage - other people have done a very good job on that already.

蓝天 2024-12-17 18:14:22

(二进制)文件格式有两种基本风格:

  • 应用程序中立,由公共库和 API 支持(netCDF 和 HDF5 都属于这个阵营)促进不同程序和应用程序之间的数据交换,只要它们使用API​​​​通过附加包进行扩展

  • netCDF 和 HDF5 都属于这个阵营),这有助于不同程序和应用程序之间的数据交换,只要它们使用 API

    特定于应用程序的 只设计用于与一个程序配合使用,尽管效率更高:这就是 .RData 所做的

因为 R 是开源的,所以您可以从 Matlab 文件重新创建 RData 的格式:没有什么可以阻止您编写正确的 mex 文件。也许有人已经做到了。没有任何技术原因不去尝试,但如果两个旨在共享数据的应用程序同样支持该格式,则其他途径可能会更容易。

无论如何,早在 20 世纪 90 年代初/中期,我确实编写了自己的 C 代码,以 Octave 使用的二进制格式编写模拟文件(我使用它然后对数据进行切片)。能够使用开源软件来做到这一点是一个很大的优势。

(Binary) file formats come in two basic flavors:

  • application-neutral, supported by public libraries and APIs (and both netCDF and HDF5 fall into this camp) which facilitates exchange of data among different programs and applications provided they are extended with add-on packages using the APIs

  • application-specific ones only designed to work with one program, albeit more efficiently: that is what .RData does

Because R is open-source, you could re-create the format for RData from your Matlab files: Nothing stops you from writing a proper mex file that. Maybe someone has even done it already. There is no technical reason not to try---but the other route may be easier if both applications meant to share the data support the format equally well.

For what it is worth, back in the early/mid-1990s, I did write my own C code to write simulation files in the binary format used by Octave (which I used then slice the data). Being able to do this with open source software is a big plus.

拥抱我好吗 2024-12-17 18:14:22

我想我可以回答一些问题,但不是全部。

  1. 好吧,任何用心的人都可以直接读取 .Rdata 文件,但这是一项艰苦的工作,而且没有多大好处。所以我怀疑Matlab是否做到了这一点。您可能还记得,R 可以读取各种其他系统格式,正是因为有人付出了很多努力才能做到这一点。

  2. 对于文本格式 csv 似乎相当“标准”,但对于二进制格式我不知道 - 而且 csv 不是一个好的标准 - 它的处理方式(特别是)日期和引号的方式差异很大(当然它只适用于数据表)。

  3. 当然!

例子:

for(f in list.files(".", pattern="\\.Rdata$") {
    e <- new.env()
    load(f, e)       # load all values into environment e
    x <- as.list(e)

    #saveInOtherFormat(x, file=sub("\\.Rdata$", ".Other", f))
}

I think I can answer some, but not all of these questions.

  1. Well, anybody who puts their mind to it can probably read an .Rdata file directly, but it's hard work and not much benefit. So I doubt that Matlab has done that. As you might recall, R can read various other system formats precisely because someone put in a lot of effort to do so.

  2. For text formats csv seem pretty "standard", but for binary formats I don't know - and csv is not a good standard at that - it varies wildly how (especially) dates and quotes are handled (and of course it only works for data tables).

  3. Of course!

Example:

for(f in list.files(".", pattern="\\.Rdata$") {
    e <- new.env()
    load(f, e)       # load all values into environment e
    x <- as.list(e)

    #saveInOtherFormat(x, file=sub("\\.Rdata$", ".Other", f))
}
时常饿 2024-12-17 18:14:22

第 2 点是错误的:二进制 .RData 文件可以跨硬件移植操作系统平台。引用 ?save 的帮助页面:

所有 R 平台都在二进制 save-d 文件中使用 C 整型和双精度的 XDR(bigendian)表示,并且这些可在所有 R 平台上移植。

第 1 点是数据内容以及可有效应用于该数据的其他程序的函数。如果您的代码库使用 save() 写入指定的数据帧或矩阵对象,您可以轻松编写一个小函数 save2hdf() 将它们写为 hdf 或 ncdf 二进制文件,然后使用 sed 将所有出现的 save( 更改为save2hdf( 在您的代码库中。至少 ncdf 会对读取产生性能影响,但影响还不错。如果您的代码使用保存对象(例如异构对象列表),您可能如果没有大量重新编码就无法使用 ncdf 或 hdf 来写出单独的组件对象。

另请注意,netCDF 4 在 R 中仍然存在问题。

Point 2 is wrong: binary .RData files are portable across hardware & OS platforms. To quote from the help page for ?save:

All R platforms use the XDR (bigendian) representation of C ints and doubles in binary save-d files, and these are portable across all R platforms.

Point 1 is a function of what the data are, and what other programs might usefully be applied to the data. If your code base uses save() to write specified objects which are dataframes or matrices, you could easily write a small function save2hdf() to write them out as hdf or ncdf binary files, then use sed to change all occurrences of save( to save2hdf( in your codebase. At least ncdf will have a performance hit on the reads, but not too bad of a hit. If your code uses saves objects like lists of heterogeneous objects, you probably can't use ncdf or hdf without a great deal of recoding to write out separate component objects.

Also note that netCDF 4 is still problematic in R.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文