大型天体物理模拟数据的数据存储

发布于 2024-11-17 17:52:13 字数 1301 浏览 7 评论 0原文

我是天体物理学的研究生。我使用主要由其他人在十多年左右开发的代码来运行大型模拟。有关这些代码的示例，您可以查看小工具 http://www.mpa-garching.mpg .de/gadget/ 和 enzo http://code.google.com/p/enzo /。这绝对是最成熟的两个代码（他们使用不同的方法）。

这些模拟的输出巨大。根据您的代码，您的数据略有不同，但它始终是大数据。你通常需要数十亿个粒子和细胞来完成任何现实的事情。最大的运行量是每个快照 TB 和每个模拟数百个快照。

目前，读写此类数据的最佳方式似乎是使用 HDF5 http://www.hdfgroup.org/HDF5/" rel="nofollow">http://www.hdfgroup.org/HDF5/ hdfgroup.org/HDF5/，这基本上是使用二进制文件的有组织的方式。与带有自定义标头块的未格式化二进制文件相比，这是一个巨大的改进（仍然让我做噩梦），但我忍不住认为可能有更好的方法来做到这一点。

我想这里的问题是纯粹的数据大小，但是是否有某种数据存储可以有效地处理 TB 级的二进制数据，或者二进制文件是目前唯一的方法？

如果有帮助，我们通常按列存储数据。也就是说，你有一个所有粒子 ID 的块、所有粒子位置的块、粒子速度的块等。它不是最漂亮的，但对于在某个体积中执行粒子查找之类的操作来说，它是最快的。

编辑：很抱歉对这些问题含糊其辞。史蒂夫是对的，这可能只是数据结构的问题，而不是数据存储方法的问题。我现在必须跑步，但我将在今晚或明天提供更多详细信息。

编辑2：因此，我越深入地研究这一点，就越意识到这可能不再是数据存储问题。未格式化的二进制文件的主要问题是正确读取数据（获得正确的块大小和顺序并确定）。 HDF5 几乎修复了这个问题，并且在文件系统限制得到改善之前不会有更快的选项（感谢 Matt Turk）。

新问题可能归结为数据结构。 HDF5 具有我们所能达到的最高性能，即使它不是最好的查询接口。由于习惯了数据库，我认为能够查询诸如“随时给我速度超过 x 的所有粒子”之类的内容会非常有趣/强大。你现在可以做类似的事情，但你必须在较低的层次上工作。当然，考虑到数据有多大，并且取决于您使用数据做什么，出于性能考虑，在低级别上工作可能是一件好事。

原文

I'm a grad student in astrophysics. I run big simulations using codes mostly developed by others over a decade or so. For examples of these codes, you can check out gadget http://www.mpa-garching.mpg.de/gadget/ and enzo http://code.google.com/p/enzo/. Those are definitely the two most mature codes (they use different methods).

The outputs from these simulations are huge. Depending on your code, your data is a bit different, but it's always big data. You usually take billions of particles and cells to do anything realistic. The biggest runs are terabytes per snapshot and hundreds of snapshots per simulation.

Currently, it seems that the best way to read and write this kind of data is to use HDF5 http://www.hdfgroup.org/HDF5/, which is basically an organized way of using binary files. It's a huge improvement over unformatted binary files with a custom header block (still give me nightmares), but I can't help but think there could be a better way to do this.

I imagine the sheer data size is the issue here, but is there some sort of datastore that can handle terabytes of binary data efficiently, or are binary files the only way at this point?

If it helps, we typically store data columnwise. That is, you have a block of all particle id's, block of all particle positions, block of particle velocites, etc. It's not the prettiest, but it is the fastest for doing something like a particle lookup in some volume.

edit: Sorry for being vague about the issues. Steve is right that this might just be an issue of data structure rather than the data storage method. I have to run now, but I will provide more details late tonight or tomorrow.

edit 2: So the more I look into this, the more I realize that this probably isn't a datastore issue anymore. The main issue with unformatted binary was all the headaches reading the data correctly (getting the block sizes and order right and being sure about it). HDF5 pretty much fixed that and there isn't going to be a faster option until the file system limitations are improved (thanks Matt Turk).

The new issues probably come down to data structure. HDF5 is as performant as we can get, even if it is not the nicest interface to query against. Being used to databases, I thought it would be really interesting/powerful to be able to query something like "give me all particles with velocity over x at any time". You can do something like that now, but you have to work at a lower level. Of course, given how big the data is and depending on what you are doing with it, it might be a good thing to work at a low level for performance sake.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

单身情人 2024-11-24 17:52:13

MongoDB：http://www.mongodb.org/
Netezza
产品：
http://www.netezza.com/data-warehouse-appliance-products /skimmer.aspx
Hadoop：http://hadoop.apache.org/
维基百科的列表分布式文件
系统：
我的

编辑

理由缺乏解释/等：

OP 说：“[HDF5] 相对于带有自定义标头块的未格式化二进制文件来说是一个巨大的改进（仍然给出我做噩梦），但我忍不住想可能有更好的方法来做到这一点。”

“更好”是什么意思？结构更好？他似乎暗示“未格式化的二进制文件”是一个问题 - 所以也许这就是他所说的更好的意思。如果是这样，他将需要一些具有某种结构的东西 - 因此是前几个建议。

OP 说：“我认为绝对的数据大小是这里的问题，但是是否有某种数据存储可以有效地处理 TB 的二进制数据，或者二进制文件是目前唯一的方法？”

是的，有几个。结构化和“非结构化”——他想要结构化，还是他乐意将它们保留为某种“未格式化的二进制格式”？我们仍然不知道 - 所以我建议检查一些分布式文件系统。

OP 说：“如果有帮助的话，我们通常按列存储数据。也就是说，你有一个所有粒子 id 的块、所有粒子位置的块、粒子速度的块等。它不是最漂亮的，但它是最快的做一些类似于在某个体积中查找粒子的事情。”

再说一次，OP 是否想要更好的结构，或者不是？看起来他想要两者兼而有之——更好的结构和更快的……也许扩展会给他带来这个。这进一步强化了我列出的前几个选项。

OP 说（在评论中）：“我不知道我们是否可以承受 io 的打击。”

有IO要求吗？成本限制？这些是什么？

我们在这里不能不劳而获。不存在“灵丹妙药”的存储解决方案。我们在这里所要做的就是“大量数据”和“我不知道我是否喜欢缺乏结构，但我不愿意增加我的 IO 来适应任何额外的结构”......所以我不知道他在期待什么样的答案。除了缺乏结构之外，他没有列出任何关于他当前解决方案的抱怨 - 而且他已经说过他不愿意支付任何管理费用来解决这个问题......所以......？

MongoDB: http://www.mongodb.org/
Netezza
Products:
http://www.netezza.com/data-warehouse-appliance-products/skimmer.aspx
Hadoop: http://hadoop.apache.org/
Wikipedia's List of Distributed File
Systems:
http://en.wikipedia.org/wiki/List_of_file_systems#Distributed_file_systems

EDIT

Rationale for my lack of explanation / etc.:

OP says: "[HDF5]'s a huge improvement over unformatted binary files with a custom header block (still give me nightmares), but I can't help but think there could be a better way to do this."

What does "better" mean? Better structured? He seems to allude to the "unformatted binary files" as being an issue - so maybe that's what he means by better. If so, he'll need something with some structure - hence the first couple suggestions.

OP says: "I imagine the sheer data size is the issue here, but is there some sort of datastore that can handle terabytes of binary data efficiently, or are binary files the only way at this point?"

Yes, there are several. Both structured, and "unstructured" - does he want structure, or is he happy to leave them in some sort of "unformatted binary format"? We still don't know - so I suggest checking out some Distributed File Systems.

OP says: "If it helps, we typically store data columnwise. That is, you have a block of all particle id's, block of all particle positions, block of particle velocites, etc. It's not the prettiest, but it is the fastest for doing something like a particle lookup in some volume."

Again, Does the OP want better structure, or doesn't he? Seems like he wants both - better structure AND faster.... maybe scaling OUT will give him this. This further reinforces the first few options I listed.

OP says (in comments): "I don't know if we can take the hit on io though."

Are there IO requirements? Cost restrictions? What are they?

We can't get something for nothing here. There is no "silver-bullet" storage solution. All we have to go on here for requirements is "lots of data" and "I don't know if I like the lack of structure, but I'm not willing to increase my IO to accommodate any additional structure"... so I don't know what kind of answer he's expecting. He hasn't listed a single complaint about the current solution he has other than the lack of structure - and he's already said he's not willing to pay any overhead to do anything about that... so.... ?

回复收藏 0 原文

~没有更多了~