在java中扫描非常大的文件的最快方法是什么?
想象一下我有一个非常大的文本文件。 性能确实很重要。
我想做的就是扫描它以查找某个字符串。 也许我想数一下我有多少个,但这确实不是重点。
关键是:最快的方法是什么?
我不关心维护,它需要很快。
快速是关键。
Imagine I have a very large text file.
Performance really matters.
All I want to do is to scan it to look for a certain string.
Maybe I want to count how many I have of those, but it really is not the point.
The point is: what's the fastest way ?
I don't care about maintainance it needs to be fast.
Fast is key.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(8)
对于一次性搜索,请使用扫描仪,按照此处的建议
For a one off search use a Scanner, as suggested here
首先,使用 nio (
FileChannel
) 而不是java.io
类。其次,使用高效的字符串搜索算法,例如 Boyer-Moore。如果您需要多次搜索同一文件以查找不同的字符串,您将需要构建某种索引,因此请查看 Lucene。
First of all, use nio (
FileChannel
) rather than thejava.io
classes. Second, use an efficient string search algorithm like Boyer-Moore.If you need to search through the same file multiple times for different strings, you'll want to construct some kind of index, so take a look at Lucene.
将整个文件加载到内存中,然后查看使用字符串搜索算法,例如 Knuth Morris Pratt。
编辑:
快速谷歌显示这个字符串搜索库似乎已经实现了一些不同的字符串搜索算法。请注意,我从未使用过它,因此无法保证它。
Load the whole file into memory and then look at using a string searching algorithm such as Knuth Morris Pratt.
Edit:
A quick google shows this string searching library that seems to have implemented a few different string search algorithms. Note I've never used it so can't vouch for it.
无论具体情况如何,内存映射 IO 通常就是答案。
编辑:根据您的要求,您可以尝试将文件导入 SQL 数据库,然后通过 JDBC 利用性能改进。
Edit2:JavaRanch的这个线程还有其他一些想法,涉及FileChannel。我认为这可能正是您正在寻找的内容。
Whatever may be the specifics, memory mapped IO is usually the answer.
Edit: depending on your requirements, you could try importing the file into an SQL database and then leveraging the performance improvements through JDBC.
Edit2: this thread at JavaRanch has some other ideas, involving FileChannel. I think it might be exactly what you are searching.
我想说最快的方法是在 FileInputStreams 之上使用 BufferedInputStreams...或者如果您想避免 BufferedInputStream 实例化,则使用自定义缓冲区。
这会比我更好地解释它: http://java.sun.com/developer/技术文章/编程/PerfTuning/
I'd say the fastest you can get will be to use BufferedInputStreams on top of FileInputStreams... or use custom buffers if you want to avoid the BufferedInputStream instantiation.
This will explain it better than me : http://java.sun.com/developer/technicalArticles/Programming/PerfTuning/
使用正确的工具:全文搜索库
我的建议是执行内存中索引(或启用缓存的基于文件的索引),然后对其执行搜索。正如@Michael Borgwardt 所建议的,Lucene 是最好的库。
Use the right tool: full text-search library
My suggestion is to do a in-memory index (or file based index with caching enabled) and then perform the search on it. As @Michael Borgwardt suggested, Lucene is the best library out there.
我不知道这是否是一个愚蠢的建议,但是 grep 不是一个非常有效的文件搜索工具吗?也许您可以使用 Runtime.getRuntime().exec(..) 来调用它
I don't know if this is a stupid suggestion, but isn't grep a pretty efficient file searching tool? Maybe you can call it using
Runtime.getRuntime().exec(..)
这取决于您是否需要对每个文件执行多次搜索。如果您只需要进行一项搜索,请从磁盘读取文件并使用 Michael Bogwart 建议的工具对其进行解析。如果您需要进行多次搜索,您可能应该使用 Lucene:读入文件,对其进行标记,将标记粘贴到索引中。如果索引足够小,请将其放在 RAM 中(Lucene 提供 RAM 或磁盘支持索引的选项)。如果没有将其保留在磁盘上。如果它对于 RAM 来说太大,并且您非常非常关心速度,请将索引存储在固态/闪存驱动器上。
It depends on whether you need to do more than one search per file. If you need to do just one search, read the file in from disk and parse it using the tools suggested by Michael Bogwart. If you need to do more than one search, you should probably build an index of the file with a tool like Lucene: read the file in, tokenise it, stick tokens in index. If the index is small enough, have it in RAM (Lucene gives option of RAM or disk-backed index). If not keep it on disk. And if it is too large for RAM and you are very, very, very concerned about speed, store your index on a solid state/flash drive.