使用 C# 对巨大的二进制文件进行排序
我有一个大约 400 GB 的大文件。由外部封闭系统每天生成。它是一个二进制文件,格式如下:
byte[8]byte[4]byte[n]
其中 n 等于 byte[4] 的 int32 值。
该文件没有分隔符,要读取整个文件,您只需重复直到 EOF。每个“项目”表示为 byte[8]byte[4]byte[n]。
该文件看起来像
byte[8]byte[4]byte[n]byte[8]byte[4]byte[n]...EOF
byte[8] 是一个 64 位数字,表示由 .NET Ticks 表示的一段时间。我需要对该文件进行排序,但似乎无法找出最快的方法。
目前,我将 Ticks 加载到结构体中以及 byte[n] 开始和结束位置,并读取到文件末尾。之后,我按 Ticks 属性对内存中的列表进行排序,然后打开 BinaryReader 并按 Ticks 顺序查找每个位置,读取 byte[n] 值,然后写入外部文件。
在该过程结束时,我最终得到了一个排序的二进制文件,但这需要很长时间。我正在使用 C# .NET 和一个非常强大的服务器,但磁盘 IO 似乎是一个问题。
服务器规格:
- 2x 2.6 GHz Intel Xeon(具有 HT 的六核)(24 线程)
- 32GB RAM
- 500GB RAID 1+0
- 2TB RAID 5
我查遍了整个互联网,只能找到 1GB 大文件的示例(让我咯咯笑)。
有人有什么建议吗?
I have a large file of roughly 400 GB of size. Generated daily by an external closed system. It is a binary file with the following format:
byte[8]byte[4]byte[n]
Where n is equal to the int32 value of byte[4].
This file has no delimiters and to read the whole file you would just repeat until EOF. With each "item" represented as byte[8]byte[4]byte[n].
The file looks like
byte[8]byte[4]byte[n]byte[8]byte[4]byte[n]...EOF
byte[8] is a 64-bit number representing a period of time represented by .NET Ticks. I need to sort this file but can't seem to figure out the quickest way to do so.
Presently, I load the Ticks into a struct and the byte[n] start and end positions and read to the end of the file. After this, I sort the List in memory by the Ticks property and then open a BinaryReader and seek to each position in Ticks order, read the byte[n] value, and write to an external file.
At the end of the process I end up with a sorted binary file, but it takes FOREVER. I am using C# .NET and a pretty beefy server, but disk IO seems to be an issue.
Server Specs:
- 2x 2.6 GHz Intel Xeon (Hex-Core with HT) (24-threads)
- 32GB RAM
- 500GB RAID 1+0
- 2TB RAID 5
I've looked all over the internet and can only find examples where a huge file is 1GB (makes me chuckle).
Does anyone have any advice?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
加快此类文件访问速度的好方法是内存映射整个文件进入地址空间并让操作系统负责从文件中读取它需要的任何位。因此,请执行与现在相同的操作,只不过从内存中读取而不是使用
BinaryReader
/seek/read。您有大量的主内存,因此这应该提供相当好的性能(只要您使用 64 位操作系统)。
At great way to speed up this kind of file access is to memory-map the entire file into address space and let the OS take care of reading whatever bits from the file it needs to. So do the same thing as you're doing right now, except read from memory instead of using a
BinaryReader
/seek/read.You've got lots of main memory, so this should provide pretty good performance (as long as you're using a 64-bit OS).
使用归并排序。
它是在线的并且并行性很好。
http://en.wikipedia.org/wiki/Merge_sort
Use merge sort.
It's online and parallelizes well.
http://en.wikipedia.org/wiki/Merge_sort
如果您可以学习 Erlang 或 Go,它们可能非常强大并且扩展性非常好,因为您有 24 个线程。使用异步 I/O。归并排序。
由于您有 32GB 的 RAM,请尝试将尽可能多的加载到 RAM 中并在那里排序,然后写回磁盘。
If you can learn Erlang or Go, they could be very powerful and scale extremely well, as you have 24 threads. Utilize Async I/O. Merge Sort.
And since you have 32GB of Ram, try to load as much as you can into RAM and sort it there then write back to disk.
我会分几次完成此操作。在第一遍中,我将创建一个蜱虫列表,然后将它们均匀地分布到许多(数百个?)桶中。如果您提前知道价格变动是均匀分布的,则可以跳过此初始过程。在第二遍中,我会将记录分成几百个大小相同的独立文件(这些小得多的文件按照您想要的顺序表示刻度组)。然后我会在内存中单独对每个文件进行排序。然后连接文件。
它有点类似于哈希排序(我认为)。
I would do this in several passes. On the first pass, I would create a list of ticks, then distribute them evenly into many (hundreds?) buckets. If you know ahead of time that the ticks are evenly distributed, you can skip this initial pass. On a second pass, I would split the records into these few hundred separate files of about same size (these much smaller files represent groups of ticks in the order that you want). Then I would sort each file separately in memory. Then concatenate the files.
It is somewhat similar to the hashsort (I think).