Java 中排序(内存映射?)文件中的二分搜索
我正在努力将 Perl 程序移植到 Java,并一边学习 Java。 原始程序的核心组件是 Perl 模块< /a> 使用二分搜索在 +500 GB 排序文本文件中进行字符串前缀查找 (本质上,“查找”到文件中间的字节偏移量,回溯到最近的换行符,将行前缀与搜索字符串进行比较,“查找”到该字节偏移量的一半/两倍,重复直到找到......)
我有尝试了几种数据库解决方案,但发现对于这种大小的数据集,在纯粹的查找速度上没有什么比这更好的了。 您知道现有的 Java 库可以实现此类功能吗? 如果做不到这一点,你能给我指出一些随机访问读取文本文件的惯用示例代码吗?
或者,我不熟悉新的(?)Java I/O 库,但是可以选择内存映射 500 GB 文本文件(我在一台有空闲内存的 64 位机器上)并执行二进制操作搜索内存映射字节数组? 我非常有兴趣听到您分享有关此问题和类似问题的任何经验。
I am struggling to port a Perl program to Java, and learning Java as I go. A central component of the original program is a Perl module that does string prefix lookups in a +500 GB sorted text file using binary search
(essentially, "seek" to a byte offset in the middle of the file, backtrack to nearest newline, compare line prefix with the search string, "seek" to half/double that byte offset, repeat until found...)
I have experimented with several database solutions but found that nothing beats this in sheer lookup speed with data sets of this size. Do you know of any existing Java library that implements such functionality? Failing that, could you point me to some idiomatic example code that does random access reads in text files?
Alternatively, I am not familiar with the new (?) Java I/O libraries but would it be an option to memory-map the 500 GB text file (I'm on a 64-bit machine with memory to spare) and do binary search on the memory-mapped byte array? I would be very interested to hear any experiences you have to share about this and similar problems.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(8)
我是 Java
MappedByteBuffers
对于这种情况。 它的速度非常快。 下面是我为您整理的一个片段,它将缓冲区映射到文件,查找中间,然后向后搜索到换行符。 这应该足以让你继续下去吧?我在自己的应用程序中有类似的代码(查找、读取、重复直到完成),并进行了基准测试
java.io
在生产环境中针对MappedByteBuffer
进行流式传输,并将结果发布在我的博客上 (Geekomatic 帖子标记为“java.nio”),包含原始数据、图表等。两秒总结? 我基于
MappedByteBuffer
的实现速度提高了约 275%。 YMMV。为了处理大于 ~2GB 的文件(由于强制转换和
.position(int pos)
,这是一个问题),我精心设计了由MappedByteBuffer
数组支持的分页算法>s。 您需要在 64 位系统上工作才能处理大于 2-4GB 的文件,因为 MBB 使用操作系统的虚拟内存系统来发挥其魔力。I am a big fan of Java's
MappedByteBuffers
for situations like this. It is blazing fast. Below is a snippet I put together for you that maps a buffer to the file, seeks to the middle, and then searches backwards to a newline character. This should be enough to get you going?I have similar code (seek, read, repeat until done) in my own application, benchmarked
java.io
streams againstMappedByteBuffer
in a production environment and posted the results on my blog (Geekomatic posts tagged 'java.nio' ) with raw data, graphs and all.Two second summary? My
MappedByteBuffer
-based implementation was about 275% faster. YMMV.To work for files larger than ~2GB, which is a problem because of the cast and
.position(int pos)
, I've crafted paging algorithm backed by an array ofMappedByteBuffer
s. You'll need to be working on a 64-bit system for this to work with files larger than 2-4GB because MBB's use the OS's virtual memory system to work their magic.我也有同样的问题。 我试图在排序的文件中查找以某个前缀开头的所有行。
这是我编写的一个方法,它主要是在这里找到的Python代码的一个端口: http:// www.logarithmic.net/pfh/blog/01186620415
我已经测试过它,但还没有彻底测试。 但它不使用内存映射。
I have the same problem. I am trying to find all lines that start with some prefix in a sorted file.
Here is a method I cooked up which is largely a port of Python code found here: http://www.logarithmic.net/pfh/blog/01186620415
I have tested it but not thoroughly just yet. It does not use memory mapping, though.
我不知道有哪个库具有该功能。 然而,Java 中外部二分搜索的正确代码应该与此类似:
请注意:我临时编写了此代码:极端情况测试得不够好,该代码假设没有单行大于 64K等等。
我还认为建立行开始处的偏移量索引可能是一个好主意。 对于 500 GB 的文件,该索引应存储在索引文件中。 您应该通过该索引获得一个不那么小的常数因子,因为不需要在每个步骤中搜索下一行。
我知道这不是问题,但构建一个前缀树数据结构(如(Patrica)Tries(在磁盘/SSD 上))可能是进行前缀搜索的好主意。
I am not aware of any library that has that functionality. However, a correct code for a external binary search in Java should be similar to this:
Please note: I made up this code ad-hoc: Corner cases are not tested nearly good enough, the code assumes that no single line is larger than 64K, etc.
I also think that building an index of the offsets where lines start might be a good idea. For a 500 GB file, that index should be stored in an index file. You should gain a not-so-small constant factor with that index because than there is no need to search for the next line in each step.
I know that was not the question, but building a prefix tree data structure like (Patrica) Tries (on disk/SSD) might be a good idea to do the prefix search.
这是您想要实现的目标的一个简单示例。 我可能会首先索引文件,跟踪每个字符串的文件位置。 我假设字符串由换行符(或回车符)分隔:
最后一步是转换为数组,以便在进行大量查找时稍微提高速度。 我可能也会将
Long[]
转换为long[]
,但我在上面没有展示这一点。 最后是从给定索引位置读取字符串的代码:This is a simple example of what you want to achieve. I would probably first index the file, keeping track of the file position for each string. I'm assuming the strings are separated by newlines (or carriage returns):
The last step is to convert to an array for a slight speed improvement when doing lots of lookups. I would probably convert the
Long[]
to along[]
also, but I did not show that above. Finally the code to read the string from a given indexed position:如果您正在处理 500GB 的文件,那么您可能需要使用比二分搜索更快的查找方法 - 即基数排序,它本质上是散列的一种变体。 执行此操作的最佳方法实际上取决于您的数据分布和查找类型,但如果您正在查找字符串前缀,那么应该有一个好方法来执行此操作。
我发布了一个整数基数排序解决方案的示例,但您可以使用相同的想法 - 基本上通过将数据划分为存储桶来减少排序时间,然后使用 O(1) 查找来检索相关数据存储桶。
If you are dealing with a 500GB file, then you might want to use a faster lookup method than binary search - namely a radix sort which is essentially a variant of hashing. The best method for doing this really depends on your data distributions and types of lookup, but if you are looking for string prefixes there should be a good way to do this.
I posted an example of a radix sort solution for integers, but you can use the same idea - basically to cut down the sort time by dividing the data into buckets, then using O(1) lookup to retrieve the bucket of data that is relevant.
我发布了一个要点 https://gist.github.com/mikee805/c6c2e6a35032a3ab74f643a1d0f8249c
这是相当基于我在堆栈溢出和一些博客上发现的完整示例,希望其他人可以使用它
I post a gist https://gist.github.com/mikee805/c6c2e6a35032a3ab74f643a1d0f8249c
that is rather complete example based on what I found on stack overflow and some blogs hopefully someone else can use it
我遇到了类似的问题,因此我从此线程中提供的解决方案创建了(Scala)库:
https://github.com/ avast/BigMap
它包含用于对大文件进行排序以及在此排序文件中进行二进制搜索的实用程序...
I had similar problem, so I created (Scala) library from solutions provided in this thread:
https://github.com/avast/BigMap
It contains utility for sorting huge file and binary search in this sorted file...
如果您确实想尝试内存映射文件,我找到了一个 关于如何在Java nio中使用内存映射。
If you truly want to try memory mapping the file, I found a tutorial on how to use memory mapping in Java nio.