存储栅格数据的好方法是什么？

发布于 2024-07-05 16:02:58 字数 622 浏览 9 评论 0原文

我有各种时间序列数据存储在或多或少的地理参考网格上，例如每 0.2 度的纬度和经度一个值。目前数据存储在文本文件中，因此在第 251 天您可能会看到：

251
 12.76 12.55 12.55 12.34 [etc., 200 more values...]
 13.02 12.95 12.70 12.40 [etc., 200 more values...]
 [etc., 250 more lines]
252
 [etc., etc.]

我想提高抽象级别，提高性能并降低脆弱性（例如，当前代码无法插入两个现有的之间的一天！）。我们搞乱了 BLOB-y RDBMS hack，甚至将文本文件格式的每一行复制为表中的一行（每个时间戳/纬度对一行，每个经度增量一列 - 是的！）。

我们可以使用“真正的”地理数据库，但是用纬度和经度标记每个单独值的开销似乎令人望而却步。数据的大小和分辨率十年来没有改变，而且不太可能改变。

我一直在考虑将所有内容放入 NetCDF 文件中，但认为我们需要完全超越文件思维模式 - 我讨厌我的所有软件都必须从日期中找出文件名，处理多个文件多年，等等。另一种方法是将所有十年（和计数）的数据放入一个文件中，似乎也不可行。

有什么好的想法或产品吗？

原文

I have a variety of time-series data stored on a more-or-less georeferenced grid, e.g. one value per 0.2 degrees of latitude and longitude. Currently the data are stored in text files, so at day-of-year 251 you might see:

251
 12.76 12.55 12.55 12.34 [etc., 200 more values...]
 13.02 12.95 12.70 12.40 [etc., 200 more values...]
 [etc., 250 more lines]
252
 [etc., etc.]

I'd like to raise the level of abstraction, improve performance, and reduce fragility (for example, the current code can't insert a day between two existing ones!). We'd messed around with BLOB-y RDBMS hacks and even replicating each line of the text file format as a row in a table (one row per timestamp/latitude pair, one column per longitude increment -- yecch!).

We could go to a "real" geodatabase, but the overhead of tagging each individual value with a lat and long seems prohibitive. The size and resolution of the data haven't changed in ten years and are unlikely to do so.

I've been noodling around with putting everything in NetCDF files, but think we need to get past the file mindset entirely -- I hate that all my software has to figure out filenames from dates, deal with multiple files for multiple years, etc.. The alternative, putting all ten years' (and counting) data into a single file, doesn't seem workable either.

Any bright ideas or products?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

可是我不能没有你 2024-07-12 16:02:58

我肯定会从文本更改为二进制，但仍将每一天保留在单独的文件中。您可以以这样的方式命名它们，即在它们之间插入不会导致索引出现任何奇怪的情况，例如在文件名中包含日期和可能的时间。例如，如果每个位置有多个字段，您还可以考虑文件结构。从大量时间步中寻找小图块是否很常见？在这种情况下，您可能希望将它们存储为包含几天数据的图块。您没有提到如何访问数据，这对于如何有效地组织数据起着重要作用。

回复收藏 0 原文

初雪 2024-07-12 16:02:58

我在这里汇总了您的评论：

我想完成所有这些“无需编写自己的文件 I/O 代码”
我需要从“Java Ruby MATLAB”和“FORTRAN 例程”访问

当您将这些添加起来时，您绝对不想要新的文件格式。 坚持使用您已有的代码。

如果我们可以让您放宽您的第一个要求 - 即，如果您愿意编写自己的文件 I/O 代码，那么这里有一些有趣的内容为您提供的选择。我会编写 C++ 类，并使用 SWIG 之类的东西使您的新类可用于您需要的多种语言。（但我不确定您是否能够使用 SWIG 来从 Java、Ruby、MATLAB 和 FORTRAN 进行访问。您可能需要其他东西。我自己不太确定如何做到这一点。）

您还说，“实际上，如果我必须有文件，我更喜欢文本，因为这样我就可以在必要时进入并手动编辑。”

我认为这是一个误导性的陈述。如果您愿意创建自己的文件 I/O 例程，那么您可以做一些非常聪明的事情...作为最终的后备方案，您可以给自己一个从新文件格式转换为相同旧文件格式的工具您习惯的文本格式...以及另一个可以转换回来的工具。我将在文章的结尾回到这一点...

您说了一些我想解决的问题：

“利用 40 年的数据库优化”

数据库适用于关系数据，而不是栅格数据。您不会利用任何人对此类数据的数据库优化。您也许可以将数据塞入数据库中，但这几乎不是一回事。

根据您告诉我们的一切，这是我可以告诉您的最有用的事情。您是这样说的：

“我对优化我的时间比对 CPU 的时间更感兴趣，虽然执行速度很好！”

坦率地说，这需要工具。不要再将其视为文本文件。开始思考您执行的常见任务，并编写小工具 - 无论使用哪种语言 - 使这些事情变得简单。

如果您的工具性能不佳怎么办？你猜怎么着 - 这是因为你的平面文本文件是一种粗糙的格式。但那只是我的个人意见。 :)

I've assembled your comments here:

I'd like to do all this "w/o writing my own file I/O code"
I need access from "Java Ruby MATLAB" and "FORTRAN routines"

When you add these up, you definitely don't want a new file format. Stick with the one you've got.

If we can get you to relax your first requirement - ie, if you'd be willing to write your own file I/O code, then there are some interesting options for you. I'd write C++ classes, and I'd use something like SWIG to make your new classes available to the multiple languages you need. (But I'm not sure you'd be able to use SWIG to give you access from Java, Ruby, MATLAB and FORTRAN. You might need something else. Not really sure how to do it, myself.)

You also said, "Actually, if I have to have files, I prefer text because then I can just go in and hand-edit when necessary."

My belief is that this is a misguided statement. If you'd be willing to make your own file I/O routines then there are very clever things you could do... And as an ultimate fallback, you could give yourself a tool that converts from the new file format to the same old text format you're used to... And another tool that converts back. I'll come back to this at the end of my post...

You said something that I want to address:

"leverage 40 yrs of DB optimization"

Databases are meant for relational data, not raster data. You will not leverage anyone's DB optimizations with this kind of data. You might be able to cram your data into a DB, but that's hardly the same thing.

Here's the most useful thing I can tell you, based on everything you've told us. You said this:

"I am more interested in optimizing my time than the CPU's, though exec speed is good!"

This is frankly going to require TOOLS. Stop thinking of it as a text file. Start thinking of the common tasks you do, and write small tools - in WHATEVER LANGAUGE(S) - to make those things TRIVIAL to do.

And if your tools turn out to have lousy performance? Guess what - it's because your flat text file is a cruddy format. But that's just my opinion. :)

回复收藏 0 原文

等待圉鍢 2024-07-12 16:02:58

澄清：

我很惊讶您添加“数据库”作为标签之一，并将其视为一个选项。你为什么要这么做？

本质上，您在每个时间步都有一个 2D、单分量浮点图像。您同意这种查看数据的方式吗？

您还提到希望在两个现有日期之间插入一天 - 这似乎是一件非常奇怪的事情。为什么你需要这样做？ 5月4日到5月5日之间有没有我不知道的新的一天？

“压缩”是您关心的事情之一，还是您只是厌倦了平面文件？

float 或 double 是否足以存储您的数据，或者您是否认为需要更高的任意精度？

另外，您想使用什么编程语言来访问这些数据？

回复收藏 0 原文

神经暖 2024-07-12 16:02:58

马特，非常感谢，还有长颈和吉夫。

这篇文章部分是一个实验，测试 stackoverflow 讨论的质量。如果你们这些家伙/女孩/外星生命形式具有代表性，我就被卖了。

说到点子上，你已经大大澄清了我的想法。请注意，我可能仍然不一定执行您的建议，但我会非常认真地考虑。 >;-)

我很可能保持文件格式不变，添加到现有的 C 和/或 Ruby 例程以补充我缺少的一些低级功能（例如插入缺少的时间步），并挂起 HTTP 前端总的来说，这样数据就可以被任何需要它的盒子使用，无论使用当前流行的语言。虽然构建这些数据的遗留软件大多是不变的，但我们总是会为其提供新的消费者，因此多语言/多计算机要求（哎呀，我忘记了吗？）适用于阅读端，而不是书写面。这也消除了一系列安全问题。

再次感谢各位。

回复收藏 0 原文