使用 C# 对巨大的二进制文件进行排序

发布于 2024-12-07 11:00:48 字数 799 浏览 6 评论 0原文

我有一个大约 400 GB 的大文件。由外部封闭系统每天生成。它是一个二进制文件，格式如下：

byte[8]byte[4]byte[n]

其中 n 等于 byte[4] 的 int32 值。

该文件没有分隔符，要读取整个文件，您只需重复直到 EOF。每个“项目”表示为 byte[8]byte[4]byte[n]。

该文件看起来像

byte[8]byte[4]byte[n]byte[8]byte[4]byte[n]...EOF

byte[8] 是一个 64 位数字，表示由 .NET Ticks 表示的一段时间。我需要对该文件进行排序，但似乎无法找出最快的方法。

目前，我将 Ticks 加载到结构体中以及 byte[n] 开始和结束位置，并读取到文件末尾。之后，我按 Ticks 属性对内存中的列表进行排序，然后打开 BinaryReader 并按 Ticks 顺序查找每个位置，读取 byte[n] 值，然后写入外部文件。

在该过程结束时，我最终得到了一个排序的二进制文件，但这需要很长时间。我正在使用 C# .NET 和一个非常强大的服务器，但磁盘 IO 似乎是一个问题。

服务器规格：

2x 2.6 GHz Intel Xeon（具有 HT 的六核）（24 线程）
32GB RAM
500GB RAID 1+0
2TB RAID 5

我查遍了整个互联网，只能找到 1GB 大文件的示例（让我咯咯笑）。

有人有什么建议吗？

原文

I have a large file of roughly 400 GB of size. Generated daily by an external closed system. It is a binary file with the following format:

byte[8]byte[4]byte[n]

Where n is equal to the int32 value of byte[4].

This file has no delimiters and to read the whole file you would just repeat until EOF. With each "item" represented as byte[8]byte[4]byte[n].

The file looks like

byte[8]byte[4]byte[n]byte[8]byte[4]byte[n]...EOF

byte[8] is a 64-bit number representing a period of time represented by .NET Ticks. I need to sort this file but can't seem to figure out the quickest way to do so.

Presently, I load the Ticks into a struct and the byte[n] start and end positions and read to the end of the file. After this, I sort the List in memory by the Ticks property and then open a BinaryReader and seek to each position in Ticks order, read the byte[n] value, and write to an external file.

At the end of the process I end up with a sorted binary file, but it takes FOREVER. I am using C# .NET and a pretty beefy server, but disk IO seems to be an issue.

Server Specs:

2x 2.6 GHz Intel Xeon (Hex-Core with HT) (24-threads)
32GB RAM
500GB RAID 1+0
2TB RAID 5

I've looked all over the internet and can only find examples where a huge file is 1GB (makes me chuckle).

Does anyone have any advice?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

遇到 2024-12-14 11:00:48

加快此类文件访问速度的好方法是内存映射整个文件进入地址空间并让操作系统负责从文件中读取它需要的任何位。因此，请执行与现在相同的操作，只不过从内存中读取而不是使用 BinaryReader/seek/read。

您有大量的主内存，因此这应该提供相当好的性能（只要您使用 64 位操作系统）。

回复收藏 0 原文

一抹淡然 2024-12-14 11:00:48

使用归并排序。
它是在线的并且并行性很好。

http://en.wikipedia.org/wiki/Merge_sort

回复收藏 0 原文

梨涡 2024-12-14 11:00:48

如果您可以学习 Erlang 或 Go，它们可能非常强大并且扩展性非常好，因为您有 24 个线程。使用异步 I/O。归并排序。
由于您有 32GB 的 RAM，请尝试将尽可能多的加载到 RAM 中并在那里排序，然后写回磁盘。

回复收藏 0 原文

臻嫒无言 2024-12-14 11:00:48

我会分几次完成此操作。在第一遍中，我将创建一个蜱虫列表，然后将它们均匀地分布到许多（数百个？）桶中。如果您提前知道价格变动是均匀分布的，则可以跳过此初始过程。在第二遍中，我会将记录分成几百个大小相同的独立文件（这些小得多的文件按照您想要的顺序表示刻度组）。然后我会在内存中单独对每个文件进行排序。然后连接文件。

它有点类似于哈希排序（我认为）。

回复收藏 0 原文

~没有更多了~