C# protocol-buffers binaryfiles protobuf-net

快速随机访问二进制文件，但在需要时也可以顺序访问。如何布局？

发布于 2024-10-25 18:35:36 字数 2218 浏览 4 评论 0原文

我有大约 10 亿个具有 DatasetKey 的数据集，每个数据集都有 1 到 50 000 000 个子条目（某些对象），平均值约为 100，但有很多肥尾。

数据写入后，不会对数据进行更新，只会读取。

我需要通过 DatasetKey 和以下之一读取数据：
获取子条目数
获取前 1000 个子条目（如果少于 1000，则最多）
获取前 5000 个子条目（如果少于 5000，则最多）
获取前 100000 个子条目（如果少于 100000，则最多）
获取所有子条目

每个子条目的大小约为 20 字节到 2KB（平均 450 字节）。

我想要使用的布局如下：

我创建一个大小至少为 5MB 的文件。
每个文件至少包含一个 DatasetKey，但如果文件仍然小于 5MB，我会添加新的 DatasetKey（带有子条目），直到超过 5 MB。
首先，我存储一个标头，说明我将在哪个文件偏移处找到哪种数据。
此外，我计划使用协议缓冲区存储序列化包。
前 1000 名参赛者可获赠一个包裹，
一个用于接下来的 4000 个条目，
一个用于接下来的 95000 个条目，
一个用于接下来的剩余条目。

我将文件大小存储在 RAM 中（存储所有标头将占用我使用的计算机所需的大量 RAM）。当我需要访问特定的 DatasetKey 时，我会在 RAM 中查找我需要的文件。然后我从 RAM 中获取文件大小。当文件大小约为 5MB 或更小时，我会将整个文件读取到内存并对其进行处理。如果超过 5MB，我将只读取第一个 xKB 以获取标头。然后我从磁盘加载我需要的位置。

这听起来怎么样？这完全是无稽之谈吗？或者有什么好的方法可以走？

使用这种设计，我想到了以下几点：

我想将数据存储在自己的二进制文件中，而不是数据库中，以便将来更轻松地备份和处理文件。
我本来会使用 postgresql，但我发现存储二进制数据将使 postgresqls-toast 执行多次访问数据的操作。
为每个 DatasetKey 存储一个文件需要太多时间才能将所有值写入磁盘。
数据在 RAM 中计算（因为并非所有数据同时适合 RAM，而是按块计算）。
5MB 的文件大小只是粗略估计。

你怎么说？提前感谢您的帮助！

编辑

更多背景信息：

DatasetKey 是 ulong 类型。

子条目（有不同的类型）大多数时候如下所示：

public struct ChildDataSet
{
    public string Val1;
    public string Val2;
    public byte Val3;
    public long Val4;
}

我无法判断到底访问了哪些数据。计划是用户可以访问特定 DatasetKeys 的前 1000、5000、100000 或所有数据。根据他们的设置。

我希望保持尽可能短的响应时间并使用尽可能少的磁盘空间。

@关于随机访问（Marc Gravells 问题）：

我不需要访问元素号。 123456 用于特定 DatasetKey。

当在一个文件中存储多个 DatasetKey（带有子条目）时（我设计它的方式是不必创建太多文件），我需要随机访问该文件中特定 DatasetKey 的前 1000 个条目，或前 5000 个条目（因此我将读取 1000 和 4000 包）。

我只需要访问有关一个特定 DatasetKey (uint) 的以下内容：
1000 个子条目（如果少于 1000 个，则为所有子条目）
5000 个子条目（如果少于 5000 个，则为所有子条目）
100000 个子条目（如果少于 100000 个，则为所有子条目）
所有子条目

我提到的所有其他内容都只是我的设计尝试:-)

编辑，为类中的一个列表进行流式传输？

public class ChildDataSet
{
    [ProtoMember(1)]
    public List<Class1> Val1;
    [ProtoMember(2)]
    public List<Class2> Val2;
    [ProtoMember(3)]
    public List<Class3> Val3;
}

我可以为 Val1 进行流式传输吗，例如获取 Val1 的前 5000 个条目

原文

I have about 1 billion datasets that have a DatasetKey and each has between 1 and 50 000 000 child entries (some objects), average is about 100, but there are many fat tails.

Once the data is written, there is no update to the data, only reads.

I need to read the data by DatasetKey and one of the following:
Get number of child entries
Get first 1000 child entries (max if less than 1000)
Get first 5000 child entries (max if less than 5000)
Get first 100000 child entries (max if less than 100000)
Get all child entries

Each child entry has a size of about 20 bytes to 2KB (450 bytes averaged).

My layout I want to use would be the following:

I create a file of a size of at least 5MB.
Each file contains at least one DatasetKey, but if the file is still less than 5MB I add new DatasetKeys (with child entries) till I exceed the 5 MB.
First I store a header that says at which file-offsets I will find what kind of data.
Further I plan to store serialized packages using protocol-buffers.
One package for the first 1000 entries,
one for the next 4000 entries,
one for the next 95000 entries,
one for the next remaining entries.

I store the file sizes in RAM (storing all the headers would be to much RAM needed on the machine I use).
When I need to access a specific DatasetKey I look in the RAM which file I need. Then I get the file size from the RAM.
When the file-size is about 5MB or less I will read the whole file to memory and process it.
If it is more than 5MB I will read only the first xKB to get the header. Then I load the position I need from disk.

How does this sound? Is this totaly nonsense? Or a good way to go?

Using this design I had the following in mind:

I want to store my data in an own binary file instead a database to have it easier to backup and process the files in future.
I would have used postgresql but I figured out storing binary data would make postgresqls-toast to do more than one seek to access the data.
Storing one file for each DatasetKey needs too much time for writing all the values to disk.
The data is calculated in the RAM (as not the whole data is fitting simultaniously in the RAM, it is calculated block wise).
The Filesize of 5MB is only a rough estimation.

What do you say?
Thank you for your help in advance!

edit

Some more background information:

DatasetKey is of type ulong.

A child entry (there are different types) is most of the time like the following:

public struct ChildDataSet
{
    public string Val1;
    public string Val2;
    public byte Val3;
    public long Val4;
}

I cannot tell what data exactly is accessed. Planned is that the users get access to first 1000, 5000, 100000 or all data of particular DatasetKeys. Based on their settings.

I want to keep the response time as low as possible and use as less as possible disk space.

@Regarding random access (Marc Gravells question):

I do not need access to element no. 123456 for a specific DatasetKey.

When storing more than one DatasetKey (with the child entries) in one file (the way I designed it to have not to create to much files),
I need random access to to first 1000 entries of a specific DatasetKey in that file, or the first 5000 (so I would read the 1000 and the 4000 package).

I only need access to the following regarding one specific DatasetKey (uint):
1000 child entries (or all child entries if less than 1000)
5000 child entries (or all child entries if less than 5000)
100000 child entries (or all child entries if less than 100000)
all child entries

All other things I mentioned where just a design try from me :-)

EDIT, streaming for one List in a class?

public class ChildDataSet
{
    [ProtoMember(1)]
    public List<Class1> Val1;
    [ProtoMember(2)]
    public List<Class2> Val2;
    [ProtoMember(3)]
    public List<Class3> Val3;
}

Could I stream for Val1, for example get the first 5000 entries of Val1

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

合约呢 2024-11-01 18:35:36

使用单个文件。在文件的前面，存储 ID 到偏移量的映射。假设您的 ID 空间稀疏，请存储 ID + 偏移量对的数组，按 ID 排序。使用二分查找来找到正确的条目。大约log(n/K) 查找，其中“K”是可以存储在单个磁盘块上的 ID+偏移对的数量（尽管操作系统可能需要额外的一两次查找才能找到）每个块）。

如果您想花费一些内存来减少磁盘寻道，请在内存中存储每 10,000 个 ID 的排序数组。查找ID时，找到最接近的ID，不要跳过。这将在标头中提供 10,000 个 ID 范围，您可以在其中进行二分搜索。您可以通过增加/减少内存表中键的数量来非常精确地增加/减少内存使用量。

密集 ID 空间：但是，如果您的 ID 空间相对密集，那么所有这些都是完全不必要的，这似乎可能是因为您在总共可能的约 40 亿个 ID 中拥有 10 亿个 ID（假设 uint 是 32 位）。

上述排序数组技术需要存储 10 亿个 ID 的 ID+偏移量。假设偏移量为 8 个字节，则文件头需要 12 GB。如果您使用直接偏移数组，文件头中将需要 32 GB，但现在只有单个磁盘查找（加上操作系统的查找）并且没有内存中查找表。

如果 32 GB 太多，您可以使用混合方案，在前 16 或 24 位上使用数组，并在后 16 或 8 位上使用排序数组。如果您有多个级别的数组，那么您基本上有一个特里（正如其他人建议的那样）。

关于多个文件的注意事项：对于多个文件，您基本上是在尝试使用操作系统的名称查找机制来处理一级 ID 到偏移量查找。这不如您自己处理整个查找那么有效。

不过，将内容存储为多个文件可能还有其他原因。对于单个文件，如果发生任何变化，您需要重写整个数据集。对于多个文件，您只需重写一个文件。这就是操作系统的名称查找机制派上用场的地方。

但如果您最终使用多个文件，则 ID 查找确保它们具有大致相同数量的键而不是相同的文件大小可能更有效。

Go with a single file. At the front of the file, store the ID-to-offset mapping. Assuming your ID space is sparse, store an array of ID+offset pairs, sorted by ID. Use binary search to find the right entry. Roughly log(n/K) seeks, where "K" is the number of ID+offset pairs you can store on a single disk block (though the OS might need an additional additional seek or two to find each block).

If you want spend some memory to reduce disk seeks, store an in-memory sorted array of every 10,000th ID. When looking up an ID, find the closest ID without going over. This will give you a 10,000-ID range in the header that you can binary search over. You can very precisely scale up/down your memory usage by increasing/decreasing the number of keys in the in-memory table.

Dense ID space: But all of this is completely unnecessary if your ID space is relatively dense, which it seems it might be since you have 1 billion IDs out of a total possible ~4 billion (assuming uint is 32-bits).

The sorted array technique described above requires storing ID+offset for 1 billion IDs. Assuming offsets are 8 bytes, this requires 12 GB in the file header. If you went with a straight array of offsets it would require 32 GB in the file header, but now only a single disk seek (plus the OS's seeks) and no in-memory lookup table.

If 32 GB is too much, you can use a hybrid scheme where you use an array on the first 16 or 24 bits and use a sorted array for the last 16 or 8. If you have multiple levels of arrays, then you basically have a trie (as someone else suggested).

Note on multiple files: With multiple files you're basically trying to use the operating system's name lookup mechanism to handle one level of your ID-to-offset lookup. This is not as efficient as handling the entire lookup yourself.

There may be other reasons to store things as multiple files, though. With a single file, you need to rewrite your entire dataset if anything changes. With multiple files you only have to rewrite a single file. This is where the operating system's name lookup mechanism comes in handy.

But if you do end up using multiple files, it's probably more efficient for ID lookup to make sure they have roughly the same number of keys rather than the same file size.

回复收藏 0 原文

白馒头 2024-11-01 18:35:36

焦点似乎集中在前 n 个项目上；在这种情况下，protobuf-net 是理想的选择。请允许我演示一下：

using System;
using System.IO;
using System.Linq;
using ProtoBuf;


class Program
{
    static void Main()
    {
        // invent some data
        using (var file = File.Create("data.bin"))
        {
            var rand = new Random(12346);
            for (int i = 0; i < 100000; i++)
            {
                // nothing special about these numbers other than convenience
                var next = new MyData { Foo = i, Bar = rand.NextDouble() };

                Serializer.SerializeWithLengthPrefix(file, next, PrefixStyle.Base128, Serializer.ListItemTag);
            }
        }
        // read it back
        using (var file = File.OpenRead("data.bin"))
        {
            MyData last = null;
            double sum = 0;
            foreach (var item in Serializer.DeserializeItems<MyData>(file, PrefixStyle.Base128, Serializer.ListItemTag)
                .Take(4000))
            {
                last = item;
                sum += item.Foo; // why not?
            }
            Console.WriteLine(last.Foo);
            Console.WriteLine(sum);
        }
    }
}
 [ProtoContract]
class MyData
{
     [ProtoMember(1)]
     public int Foo { get; set; }
     [ProtoMember(2)]
     public double Bar { get; set; }
}

特别是，由于 DeserializeItems 是一个流式 API，因此可以使用 LINQ 的 Take（或者只是 <代码>foreach与break）。

但请注意，现有的公共 dll 不会喜欢您使用 struct； v2 在那里做得更好，但就我个人而言，我会将其设为一个类。

The focus seems to be on the first n items; in which case, protobuf-net is ideal. Allow me to demonstrate:

using System;
using System.IO;
using System.Linq;
using ProtoBuf;


class Program
{
    static void Main()
    {
        // invent some data
        using (var file = File.Create("data.bin"))
        {
            var rand = new Random(12346);
            for (int i = 0; i < 100000; i++)
            {
                // nothing special about these numbers other than convenience
                var next = new MyData { Foo = i, Bar = rand.NextDouble() };

                Serializer.SerializeWithLengthPrefix(file, next, PrefixStyle.Base128, Serializer.ListItemTag);
            }
        }
        // read it back
        using (var file = File.OpenRead("data.bin"))
        {
            MyData last = null;
            double sum = 0;
            foreach (var item in Serializer.DeserializeItems<MyData>(file, PrefixStyle.Base128, Serializer.ListItemTag)
                .Take(4000))
            {
                last = item;
                sum += item.Foo; // why not?
            }
            Console.WriteLine(last.Foo);
            Console.WriteLine(sum);
        }
    }
}
 [ProtoContract]
class MyData
{
     [ProtoMember(1)]
     public int Foo { get; set; }
     [ProtoMember(2)]
     public double Bar { get; set; }
}

In particular, because DeserializeItems<T> is a streaming API, it is easy to pull a capped quantity of data by using LINQ's Take (or just foreach with break).

Note, though, that the existing public dll won't love you for using struct; v2 does better there, but personally I would make that a class.

回复收藏 0 原文