将扩展有序文件写入磁盘的策略

发布于 2024-07-29 11:20:31 字数 1891 浏览 10 评论 0 原文

我是一名核物理研究生，目前正在从事数据分析项目。数据由数十亿个多维点组成。

无论如何，我使用空间填充曲线将多个维度映射到单个维度，并使用 B+ 树来索引数据页。每个页面内都有一些恒定的最大点数。

当我从原始文件中读取原始数据（数百GB）并对其进行预处理和索引时，我需要将各个点插入到页面中。显然，页面数量太多，无法简单地将它们存储在内存中，然后将它们转储到磁盘。所以我的问题是：将页面写入磁盘的好策略是什么，以便在页面达到最大大小并需要拆分时最小化数据重新整理。

根据评论让我减少一点。

我有一个包含有序记录的文件。这些记录正在被插入到文件中，并且这些记录太多，无法简单地在内存中执行此操作，然后写入文件。我应该使用什么策略来最大限度地减少插入记录时所需的重新整理次数。

如果这有任何意义，我将不胜感激您可能拥有的任何解决方案。

编辑：
数据是多维空间中的点。本质上是整数列表。每个整数都是 2 个字节，但每个整数还有另外 2 个字节与其关联的元数据。因此每个坐标 4 个字节以及 3 到 20 个坐标之间的任意位置。因此，本质上，数据由数十亿个块组成，每个块大小在 12 到 100 字节之间。（显然，一旦提取了 4 维点，它们将与 5 维点位于不同的文件中）。

我正在使用与本文中讨论的技术类似的技术： http://www.ddj.com/184410998

编辑2：我有点后悔在这里问这个问题，所以考虑正式废除它；但这是我不使用现成产品的原因。我的数据是 3 到 22 维范围内的点。如果您将每个点视为简单的列表，您可以想到我要如何查询这些点，因为与这些数字出现在同一列表中的所有数字是什么。以下是一些低维度的示例（并且数据点比正常情况少得多）例子：数据 237、661、511、1021 1047、661、237 511、237、1021 511, 661, 1047, 1021

Queries:
511
1021
237, 661
1021, 1047
511, 237, 1047

Responses:
237, 661, 1021, 237, 1021, 661, 1047, 1021
237, 661, 511, 511, 237, 511, 661, 1047
511, 1021, 1047
511, 661
_

因此，对于大多数数据库程序来说，这是一个困难的小问题，尽管我知道有些数据库程序可以很好地处理这个问题。

但问题变得更加复杂。并非所有坐标都相同。很多时候我们只是使用伽马球本身运行，因此每个坐标代表伽马射线能量。但有时我们将中子探测器插入伽马球或称为微球的探测器系统，或者有时将伽马球中产生的核素引导到碎片质量分析器中，所有这些和更多探测器系统都可以单独使用或与伽马球任意组合使用。不幸的是，我们几乎总是希望能够以与上述类似的方式选择这些附加数据。所以现在坐标可以有不同的含义，如果除了伽马球之外还有微球，你就可以用与方程 x + y = n 的正解一样多的方式组成一个 n 维事件。此外，每个坐标都有与其关联的元数据。因此，我显示的每个数字都至少有 2 个与之相关的附加数字，第一个是探测器编号，用于拾取事件的探测器，第二个是效率值，用于描述特定伽马射线的次数（因为实际检测到的进入探测器的伽马射线的百分比随探测器和能量的不同而变化）。

我真诚地怀疑任何现成的数据库解决方案都可以完成所有这些事情，并且在没有大量定制的情况下同时表现良好。我相信花在这上面的时间最好花在编写我自己的解决方案上，更不用说通用的解决方案了。由于失去通用性，我不需要为任何数据库代码实现删除函数，我不需要构建二级索引来对不同类型的坐标进行门控（只需一组，有效地仅对每个点进行一次计数）， ETC。

原文

I an a graduate student of nuclear physics currently working on a data analysis program. The data consists of billions of multidimensional points.

Anyways I am using space filling curves to map the multiple dimensions to a single dimension and I am using a B+ tree to index the pages of data. Each page will have some constant maximum number of points within it.

As I read the raw data (several hundred gigs) in from the original files and preprocess and index it I need to insert the individual points into pages. Obviously there will be far too many pages to simply store them in memory and then dump them to disk. So my question is this: What is a good strategy for writing the pages to the disk so that there is a minimum of reshuffling of data when a page hits it's maximum size and needs to be split.

Based on the comments let me reduce this a little.

I have a file that will contain ordered records. These records are being inserted into the file and there are too many of these records to simply do this in memory and then write to the file. What strategy should I use to minimize the amount of reshuffling needed when I insert a record.

If this is making any sense at all I would appreciate any solutions to this that you might have.

Edit:
The data are points in multidimensional spaces. Essentially lists of integers. Each of these integers is 2 bytes but each integer also has an additional 2 bytes of meta-data associated with it. So 4 bytes per coordinate and anywhere between 3 and 20 coordinate. So essentially the data consists of billions of chunks each chunk somewhere between 12 and 100 bytes. (obviously points with 4 dimensions will be located in a different file than points with 5 dimensions once they have been extracted).

I am using techniques similar to those discussed in this article:
http://www.ddj.com/184410998

Edit 2:
I kinda regret asking this question here so consider it officially rescinded; but here is my reason for not using off the shelf products. My data are points that range anywhere from 3 to 22 dimensions. If you think of each point as simply a list you can think of how I want to query the points as what are all the numbers that appeared in the same lists as these numbers. Below are some examples with low dimensionality (and many fewer data points than normal)
Example:
Data
237, 661, 511, 1021
1047, 661, 237
511, 237, 1021
511, 661, 1047, 1021

Queries:
511
1021
237, 661
1021, 1047
511, 237, 1047

Responses:
237, 661, 1021, 237, 1021, 661, 1047, 1021
237, 661, 511, 511, 237, 511, 661, 1047
511, 1021, 1047
511, 661
_

So that is a difficult little problem for most database programs, though I know of some that exist that can handle this well.

But the problem gets more complex. Not all the coordinates are the same. Many times we just run with gammasphere by itself and so each coordinate represents a gamma ray energy. But at other times we insert neutron detectors into gammasphere or a detector system called microball, or sometimes the nuclides produced in gammasphere are channeled into the fragment mass analyzer, all those and more detector systems can beused singly or in any combination with gammasphere. Unfortunately we almost always want to be able to select on this additional data in a manner similar to that described above. So now coordinates can have different meanings, if one just has microball in addition to gammasphere you make make up an n dimensional event in as many ways as there are positive solutions to the equation x + y = n. Additionally each coordinate has metadata associated with it. so each of the numbers I showed would have at least 2 additional numbers associated with them, the first, a detector number, for the detector that picked up the event, the second, an effeciency value, to describe how many times that particular gamma ray counts for (since the percentage of gamma rays entering the detector that are actually detected, varies with teh detector and with the energy).

I sincerely doubt that any off the shelf database solution can do all these things and perform well at the same time without an enourmous amount of customization. I believe that the time spent on that is better spent on writing my own, much less general, solution. Because of the loss of generality I do not need to implement a delete function for any of the databasing code, I do not need to build secondary indices to gate on different types of coordinates (just one set, effectively counting each point only once), etc.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

梦里人 2024-08-05 11:20:31

我相信您应该首先看看商业和免费数据库必须提供什么。它们旨在执行快速范围搜索（给定正确的索引）并有效管理内存和将页面读/写到磁盘。

如果做不到这一点，请查看二进制空间分区的变体之一 (BSP）树。

回复收藏 0 原文

情独悲 2024-08-05 11:20:31

因此，第一个方面是在线程应用程序中执行此操作，以更快地完成它。将数据块分解为可行的部分。这让我想到...

我最初建议您使用 Lucene...但考虑到这听起来确实像是您应该使用 Hadoop。它是为此类工作而设计的（假设您拥有相应的基础设施）。

我肯定不会在数据库中这样做。

当您谈到索引数据并用数据点填充文档时...并且您没有基础设施、知道如何操作或没有时间来实现 hadoop，那么您应该回到我最初的想法并使用 Lucene。实际上，您可以以这种方式对数据进行索引，并将数据点直接存储到索引中（我认为按数字范围），并使用您认为最好的“文档”（对象）结构。

回复收藏 0 原文

半夏半凉 2024-08-05 11:20:31

我自己已经给出了答案。当页面需要拆分时，事件会插入到页面中，因此会在文件末尾创建一个新页面。原始页面的一半事件将移至该页面。这使得页面未排序，这在一定程度上破坏了快速检索机制。

然而，由于我只在一次大的初始仓促中写入数据库（可能持续几天），我可以证明在写入后花费一点额外的时间来浏览页面并在它们全部构建后对它们进行排序。事实上，由于用于索引页面的 B+ 树的性质，这部分非常简单。我只是从 B+ 树的最左边的叶节点开始，读取第一页并将其放在最终文件的第一个位置，然后读取第二页并将其放在第二个位置，依此类推。

以这种方式，在插入结束时，所有页面都将在其文件中排序，从而允许我用来将多维请求映射到单维索引的方法在从磁盘读取数据时高效、快速地工作。