java处理大文本文件的有效方法
我正在做一本频率词典,其中我读取了 1000 个文件,每个文件大约有 1000 行。我遵循的方法是:
- BufferedReader 读取 fileByFile
- 读取第一个文件,获取第一个句子,将句子拆分为数组字符串,然后使用字符串数组中的值填充哈希图。
- 对该文件中的所有句子执行
- 此操作 对所有 1000 个文件执行此
操作 我的问题是,这不是一个非常有效的方法,我需要大约 4 分钟才能完成所有这些操作。我增加了堆大小,重构了代码以确保我没有做错什么。对于这种方法,我完全确定代码中没有任何可以改进的地方。
我敢打赌,每次读取一个句子时,都会应用一个拆分,乘以文件中的 1000 个句子和 1000 个文件,就是需要处理的大量拆分。 我的想法是,我可以将每个文件读取到一个字符数组,然后每个文件只进行一次分割,而不是逐个文件地读取和处理。这将减少分割所消耗的处理时间。任何实施建议将不胜感激。
Im doing a frequency dictionary, in which i read 1000 files, each one with about 1000 lines. The approach i'm following is:
- BufferedReader to read fileByFile
- read the first file, get the first sentence, split the sentence to an array string, then fill in an hashmap with the values from the string array.
- do this for all the senteces in that file
- do this for all 1000 files
My problem is, this is not a very efficient way to do it, i'm taking about 4 minutes to do all this. I'v increased heap size, refactored the code to make sure i'm not doind something wrong. For this approach, i'm completly sure there's nothing i can improve in the code.
My bet is, each time a sentece is read, a split is applied, which, multiplied by 1000 sentences in a file and by 1000 files is a huge ammount of splits to process.
My idea is, instead of read and process file-by-file, i could read each file to a char array, and then make the split only once per file. That would ease the ammount of processing times consuming with the split. Any suggestions of implementation would be appreciated.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
好的,我刚刚实现了你们词典的 POC。又快又脏。我的文件每个包含 868 行,但我创建了同一文件的 1024 个副本。 (这是 Spring 框架文档的目录。)
我运行了测试,花费了 14020 毫秒(14 秒!)。顺便说一句,我从 eclipse 运行它,这可能会稍微降低速度。
所以,我不知道你的问题出在哪里。请在您的计算机上尝试我的代码,如果它运行得更快,请尝试将其与您的代码进行比较并了解根本问题在哪里。
无论如何,我的代码不是我能写的最快的。
我可以在循环之前创建 Pattern 并使用它而不是 String.split()。 String.split() 每次都会调用 Pattern.compile()。创建模式非常昂贵。
这是代码:
OK, I have just implemented the POC of your dictionary. Fast and dirty. My files contained 868 lines each one but I created 1024 copies of the same file. (This is table of contents of Spring Framework documentation.)
I ran my test and it took 14020 ms (14 seconds!). BTW I ran it from eclipse that could decrease the speed a little bit.
So, I do not know where your problem is. Please try my code on your machine and if it runs faster try to compare it with your code and understand where the root problem.
Anyway my code is not the fastest I can write.
I can create Pattern before loop and the use it instead of String.split(). String.split() calls Pattern.compile() every time. Creating pattern is very expensive.
Here is the code:
如果您不关心内容位于不同的文件中,我会采用您推荐的方法。将所有文件和所有行读入内存(字符串或字符数组等),然后根据一个字符串/数据集进行 1 次分割和哈希填充。
If you dont care about the the contents are in different files I would do the approach your are recommending. Read all files and all lines into memory (string, or char array, whatever) and then do the 1 split and hash populate based on the one string/dataset.
如果我理解你在做什么,我认为你不想使用字符串,除非你访问地图。
您想要:
循环遍历文件
将每个文件读入类似 1024 的缓冲区
处理缓冲区寻找单词结束字符
从字符数组创建一个字符串
检查你的地图
如果找到,则更新您的计数,如果没有,则创建一个新条目
当到达缓冲区末尾时,从文件中获取下一个缓冲区
最后,循环到下一个文件
Split 可能非常昂贵,因为它每次都必须解释表达式。
If I understand what you're doing, I don't think you want to use strings except when you access your map.
You want to:
loop through files
read each file into a buffer of something like 1024
process the buffer looking for word end characters
create a String from the character array
check your map
if found, update your count, if not, create a new entry
when you reach end of buffer, get the next buffer from the file
at end, loop to next file
Split is probably pretty expensive since it has to interpret the expression each time.
将文件作为一个大字符串读取,然后拆分,这听起来是个好主意。当涉及到垃圾收集时,字符串分割/修改可能会令人惊讶地“繁重”。多行/句子意味着多个字符串,并且所有的分割意味着大量的字符串(字符串是不可变的,因此对它们的任何更改实际上都会创建一个新的字符串或多个字符串)...这会产生大量垃圾收集,并且垃圾收集可能成为瓶颈(使用较小的堆,始终达到最大内存量,启动垃圾收集,这可能需要清理数十万或数百万个单独的 String 对象) 。
当然,在不知道你的代码的情况下,这只是一个疯狂的猜测,但在那天,我得到了一个旧的命令行 Java 程序(它是一个生成巨大 SVG 文件的图形算法)运行时间从大约只需修改字符串处理以使用 StringBuffers/Builders,即可将 18 秒缩短至不到 0.5 秒。
我想到的另一件事是使用多个线程(或线程池)同时处理不同的文件,然后在最后合并结果。一旦你让程序“尽可能快”地运行,剩下的瓶颈将是磁盘访问,而克服这个瓶颈的唯一方法(afaik)是更快的磁盘(SSD等)。
Reading the file as one big string and and then splitting that sounds like a good idea. String splitting/modifying can be surprisingly 'heavy' when it comes to garbage collection. Multiple lines/sentences means multiple Strings and with all the splits it means a huge amount of Strings (Strings are immutable, so any change to them will actually create a new String or multiple Strings)... this produces a lot of garbage to be collected, and the garbage collection could become a bottleneck (with a smaller heap, the maximum amount of memory is reached all the time, kicking off a garbage collection, which potentially needs to clean up hundreds of thousands or millions of separate String-objects).
Of course, without knowing your code this is just a wild guess, but back in the day, I got an old command line Java-programs' (it was a graph-algorithm producing a huge SVG-file) running time to drop from about 18 seconds to less than 0.5 seconds just by modifying the string-handling to use StringBuffers/Builders.
Another thing that springs to mind is using multiple threads (or a threadpool) to handle different files concurrently, and then combine the results at the end. Once you get the program to run "as fast as possible", the remaining bottleneck will be the disk access, and the only way (afaik) to get past that is faster disks (SSDs etc.).
既然您使用的是 bufferedReader,为什么需要显式读取整个文件?如果你追求速度,我绝对不会使用 split,记住,每次运行它时它都必须评估正则表达式。
在您的内部循环中尝试类似的操作(注意,我还没有编译它或尝试运行它):
您可以尝试显式使用不同大小的缓冲区,但您可能不会因此获得性能改进。
Since you're using a bufferedReader, why do you need to read in a whole file explicitly? I definitely wouldn't use split if you're after speed, remember, it has to evaluate a regular expression each time you run it.
Try something like this for your inner loop (note, I have not compiled this or tried to run it):
You could try using different sized buffers explicitly, but you probably won't get a performance improvement over this.
一种非常简单的方法,它使用最小的堆空间,并且应该(几乎)与其他任何方法一样快,就像根据需要
扩展更多分隔符,可能使用多线程同时处理多个文件,直到磁盘 IO 成为瓶颈。 。
One very simple approach which uses minimum heap space and should be (almost) as fast as anything else would be like
extend for more separator characters as needed, possibly use multi-threading to process multiple files concurrently until disc IO becomes the bottle neck...