存储栅格数据的好方法是什么?

发布于 2024-07-05 16:02:58 字数 622 浏览 6 评论 0原文

我有各种时间序列数据存储在或多或少的地理参考网格上,例如每 0.2 度的纬度和经度一个值。 目前数据存储在文本文件中,因此在第 251 天您可能会看到:

251
 12.76 12.55 12.55 12.34 [etc., 200 more values...]
 13.02 12.95 12.70 12.40 [etc., 200 more values...]
 [etc., 250 more lines]
252
 [etc., etc.]

我想提高抽象级别,提高性能并降低脆弱性(例如,当前代码无法插入两个现有的之间的一天!)。 我们搞乱了 BLOB-y RDBMS hack,甚至将文本文件格式的每一行复制为表中的一行(每个时间戳/纬度对一行,每个经度增量一列 - 是的!)。

我们可以使用“真正的”地理数据库,但是用纬度和经度标记每个单独值的开销似乎令人望而却步。 数据的大小和分辨率十年来没有改变,而且不太可能改变。

我一直在考虑将所有内容放入 NetCDF 文件中,但认为我们需要完全超越文件思维模式 - 我讨厌我的所有软件都必须从日期中找出文件名,处理多个文件多年,等等。另一种方法是将所有十年(和计数)的数据放入一个文件中,似乎也不可行。

有什么好的想法或产品吗?

I have a variety of time-series data stored on a more-or-less georeferenced grid, e.g. one value per 0.2 degrees of latitude and longitude. Currently the data are stored in text files, so at day-of-year 251 you might see:

251
 12.76 12.55 12.55 12.34 [etc., 200 more values...]
 13.02 12.95 12.70 12.40 [etc., 200 more values...]
 [etc., 250 more lines]
252
 [etc., etc.]

I'd like to raise the level of abstraction, improve performance, and reduce fragility (for example, the current code can't insert a day between two existing ones!). We'd messed around with BLOB-y RDBMS hacks and even replicating each line of the text file format as a row in a table (one row per timestamp/latitude pair, one column per longitude increment -- yecch!).

We could go to a "real" geodatabase, but the overhead of tagging each individual value with a lat and long seems prohibitive. The size and resolution of the data haven't changed in ten years and are unlikely to do so.

I've been noodling around with putting everything in NetCDF files, but think we need to get past the file mindset entirely -- I hate that all my software has to figure out filenames from dates, deal with multiple files for multiple years, etc.. The alternative, putting all ten years' (and counting) data into a single file, doesn't seem workable either.

Any bright ideas or products?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

可是我不能没有你 2024-07-12 16:02:58

我肯定会从文本更改为二进制,但仍将每一天保留在单独的文件中。 您可以以这样的方式命名它们,即在它们之间插入不会导致索引出现任何奇怪的情况,例如在文件名中包含日期和可能的时间。 例如,如果每个位置有多个字段,您还可以考虑文件结构。 从大量时间步中寻找小图块是否很常见? 在这种情况下,您可能希望将它们存储为包含几天数据的图块。 您没有提到如何访问数据,这对于如何有效地组织数据起着重要作用。

I'd definitely change from text to binary but keep each day in a separate file still. You could name them in such a way that insertions in between don't cause any strangeness with indices, such as by including the date and possible time in the filename. You could also consider the file structure if you have several fields per location for example. Is it common to look for a small tile from a large number of timesteps? In that case you might want to store them as tiles containing data from several days. You didn't mention how the data is accessed which plays a big role in how to organise it efficiently.

初雪 2024-07-12 16:02:58

我在这里汇总了您的评论:

  1. 我想完成所有这些“无需编写自己的文件 I/O 代码”
  2. 我需要从“Java Ruby MATLAB”和“FORTRAN 例程”访问

当您将这些添加起来时,您绝对不想要新的文件格式。 坚持使用您已有的代码。

如果我们可以让您放宽您的第一个要求 - 即,如果您愿意编写自己的文件 I/O 代码,那么这里有一些有趣的内容为您提供的选择。 我会编写 C++ 类,并使用 SWIG 之类的东西使您的新类可用于您需要的多种语言。 (但我不确定您是否能够使用 SWIG 来从 Java、Ruby、MATLAB 和 FORTRAN 进行访问。您可能需要其他东西。我自己不太确定如何做到这一点。)

您还说,“实际上,如果我必须有文件,我更喜欢文本,因为这样我就可以在必要时进入并手动编辑。”

我认为这是一个误导性的陈述。 如果您愿意创建自己的文件 I/O 例程,那么您可以做一些非常聪明的事情...作为最终的后备方案,您可以给自己一个从新文件格式转换为相同旧文件格式的工具您习惯的文本格式...以及另一个可以转换回来的工具。 我将在文章的结尾回到这一点...

您说了一些我想解决的问题:

“利用 40 年的数据库优化”

数据库适用于关系数据,而不是栅格数据。 您不会利用任何人对此类数据的数据库优化。 您也许可以将数据塞入数据库中,但这几乎不是一回事。

根据您告诉我们的一切,这是我可以告诉您的最有用的事情。您是这样说的:

“我对优化我的时间比对 CPU 的时间更感兴趣,虽然执行速度很好!”

坦率地说,这需要工具。 不要再将其视为文本文件。 开始思考您执行的常见任务,并编写小工具 - 无论使用哪种语言 - 使这些事情变得简单。

如果您的工具性能不佳怎么办? 你猜怎么着 - 这是因为你的平面文本文件是一种粗糙的格式。 但那只是我的个人意见。 :)

I've assembled your comments here:

  1. I'd like to do all this "w/o writing my own file I/O code"
  2. I need access from "Java Ruby MATLAB" and "FORTRAN routines"

When you add these up, you definitely don't want a new file format. Stick with the one you've got.

If we can get you to relax your first requirement - ie, if you'd be willing to write your own file I/O code, then there are some interesting options for you. I'd write C++ classes, and I'd use something like SWIG to make your new classes available to the multiple languages you need. (But I'm not sure you'd be able to use SWIG to give you access from Java, Ruby, MATLAB and FORTRAN. You might need something else. Not really sure how to do it, myself.)

You also said, "Actually, if I have to have files, I prefer text because then I can just go in and hand-edit when necessary."

My belief is that this is a misguided statement. If you'd be willing to make your own file I/O routines then there are very clever things you could do... And as an ultimate fallback, you could give yourself a tool that converts from the new file format to the same old text format you're used to... And another tool that converts back. I'll come back to this at the end of my post...

You said something that I want to address:

"leverage 40 yrs of DB optimization"

Databases are meant for relational data, not raster data. You will not leverage anyone's DB optimizations with this kind of data. You might be able to cram your data into a DB, but that's hardly the same thing.

Here's the most useful thing I can tell you, based on everything you've told us. You said this:

"I am more interested in optimizing my time than the CPU's, though exec speed is good!"

This is frankly going to require TOOLS. Stop thinking of it as a text file. Start thinking of the common tasks you do, and write small tools - in WHATEVER LANGAUGE(S) - to make those things TRIVIAL to do.

And if your tools turn out to have lousy performance? Guess what - it's because your flat text file is a cruddy format. But that's just my opinion. :)

等待圉鍢 2024-07-12 16:02:58

澄清:

我很惊讶您添加“数据库”作为标签之一,并将其​​视为一个选项。 你为什么要这么做?

本质上,您在每个时间步都有一个 2D、单分量浮点图像。 您同意这种查看数据的方式吗?

您还提到希望在两个现有日期之间插入一天 - 这似乎是一件非常奇怪的事情。 为什么你需要这样做? 5月4日到5月5日之间有没有我不知道的新的一天?

“压缩”是您关心的事情之一,还是您只是厌倦了平面文件?

float 或 double 是否足以存储您的数据,或者您是否认为需要更高的任意精度?

另外,您想使用什么编程语言来访问这些数据?

Clarifications:

I'm surprised you added "database" as one of the tags, and considered it as an option. Why did you do this?

Essentially, you have a 2D, single component floating point image at every time step. Would you agree with this way of viewing your data?

You also mentioned the desire to insert a day between two existing ones - which seems to be a very odd thing to do. Why would you need to do that? Is there a new day between May 4 and May 5 that I don't know about?

Is "compression" one of the things you care about, or are you just sick of flat files?

Would a float or a double be sufficient to store your data, or do you feel you need more arbitrary precision?

Also, what programming language(s) do you want to access this data with?

神经暖 2024-07-12 16:02:58

马特,非常感谢,还有长颈和吉夫。

这篇文章部分是一个实验,测试 stackoverflow 讨论的质量。 如果你们这些家伙/女孩/外星生命形式具有代表性,我就被卖了。

说到点子上,你已经大大澄清了我的想法。 请注意,我可能仍然不一定执行您的建议,但我会非常认真地考虑。 >;-)

我很可能保持文件格式不变,添加到现有的 C 和/或 Ruby 例程以补充我缺少的一些低级功能(例如插入缺少的时间步),并挂起 HTTP 前端总的来说,这样数据就可以被任何需要它的盒子使用,无论使用当前流行的语言。 虽然构建这些数据的遗留软件大多是不变的,但我们总是会为其提供新的消费者,因此多语言/多计算机要求(哎呀,我忘记了吗?)适用于阅读端,而不是书写面。 这也消除了一系列安全问题。

再次感谢各位。

Matt, thanks very much, and likewise longneck and jirv.

This post was partly an experiment, testing the quality of stackoverflow discourse. If you guys/gals/alien lifeforms are representative, I'm sold.

And on point, you've clarified my thinking considerably. Mind, I still might not necessarily implement your advice, but know that I will be thinking about it very seriously. >;-)

I may very well leave the file format the same, add to the extant C and/or Ruby routines to tack on the few low-level features I lack (e.g. inserting missing timesteps), and hang an HTTP front end on the whole thing so that the data can be consumed by whatever box needs it, in whatever language is currently hoopy. While it's mostly unchanging legacy software that construct these data, we're always coming up with new consumers for it, so the multi-language/multi-computer requirement (gee, did I forget that one?) applies to the reading side, not the writing side. That also obviates a whole slew of security issues.

Thanks again, folks.

绝不放开 2024-07-12 16:02:58

您对如何存储数据的答案完全取决于您将如何处理数据。 例如,如果您只需要通过指定日期或日期范围进行检索,那么将其作为 BLOB 存储在数据库中是有意义的。 但如果您需要查找具有特定值的记录,则需要执行不同的操作。

请描述您需要如何访问数据/

your answer on how to store the data depends entirely on what you're going to do with the data. for example, if you only ever need to retrieve by specifying the date or a date range, then storing in a database as a BLOB makes some sense. but if you need to find records that have certain values, you'll need to do something different.

please describe how you need to be able to access the data/

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文