如何加快 Java 文本文件解析器的速度?
我正在阅读大约 600 个文本文件,然后单独解析每个文件并将所有术语添加到地图中,这样我就可以知道 600 个文件中每个单词的频率。 (约400MB)。
我的解析器函数包括以下步骤(按顺序):
- 查找两个标签之间的文本,这是每个文件中要读取的相关文本。
- 将所有文本
- string.lowecase.split 与多个分隔符。
- 创建一个包含这样的单词的 arrayList:“aaa-aa”,然后添加到上面分割的字符串,并将“aaa”和“aa”折扣到字符串 []。 (我这样做是因为我希望“-”成为分隔符,但我也希望“aaa-aa”只是一个单词,而不是“aaa”和“aa”。
- 获取字符串[]并映射到Map = new HashMap ...(单词,频率)
- 打印所有内容,
在双核 2.2GHz、2GB RAM 中,我需要大约 8 分钟 48 秒,我想了解如何加快此过程。这么慢吗?如果可能的话,我怎么知道(在netbeans中)哪些函数需要更多时间来执行?
找到的独特单词:398752。
代码:
File file = new File(dir);
String[] files = file.list();
for (int i = 0; i < files.length; i++) {
BufferedReader br = new BufferedReader(
new InputStreamReader(
new BufferedInputStream(
new FileInputStream(dir + files[i])), encoding));
try {
String line;
while ((line = br.readLine()) != null) {
parsedString = parseString(line); // parse the string
m = stringToMap(parsedString, m);
}
} finally {
br.close();
}
}
编辑:检查这个:
![在此处输入图像描述][1]
我不知道该得出什么结论。
编辑:80% 的时间用于此函数
public String [] parseString(String sentence){
// separators; ,:;'"\/<>()[]*~^ºª+&%$ etc..
String[] parts = sentence.toLowerCase().split("[,\\s\\-:\\?\\!\\«\\»\\'\\´\\`\\\"\\.\\\\\\/()<>*º;+&ª%\\[\\]~^]");
Map<String, String> o = new HashMap<String, String>(); // save the hyphened words, aaa-bbb like Map<aaa,bbb>
Pattern pattern = Pattern.compile("(?<![A-Za-zÁÉÍÓÚÀÃÂÊÎÔÛáéíóúàãâêîôû-])[A-Za-zÁÉÍÓÚÀÃÂÊÎÔÛáéíóúàãâêîôû]+-[A-Za-zÁÉÍÓÚÀÃÂÊÎÔÛáéíóúàãâêîôû]+(?![A-Za-z-])");
Matcher matcher = pattern.matcher(sentence);
// Find all matches like this: ("aaa-bb or bbb-cc") and put it to map to later add this words to the original map and discount the single words "aaa-aa" like "aaa" and "aa"
for(int i=0; matcher.find(); i++){
String [] tempo = matcher.group().split("-");
o.put(tempo[0], tempo[1]);
}
//System.out.println("words: " + o);
ArrayList temp = new ArrayList();
temp.addAll(Arrays.asList(parts));
for (Map.Entry<String, String> entry : o.entrySet()) {
String key = entry.getKey();
String value = entry.getValue();
temp.add(key+"-"+value);
if(temp.indexOf(key)!=-1){
temp.remove(temp.indexOf(key));
}
if(temp.indexOf(value)!=-1){
temp.remove(temp.indexOf(value));
}
}
String []strArray = new String[temp.size()];
temp.toArray(strArray);
return strArray;
}
600 个文件,每个文件约 0.5MB
EDIT3#- 每次读取一行时,该模式不再进行编译。新图像为:
2:
I am reading about 600 text files, and then parsing each file individually and add all the terms to a map so i can know the frequency of each word within the 600 files. (about 400MB).
My parser functions includes the following steps (ordered):
- find text between two tags, which is the relevant text to read in each file.
- lowecase all the text
- string.split with multiple delimiters.
- creating an arrayList with words like this: "aaa-aa", then adding to the string splitted above, and discounting "aaa" and "aa" to the String []. (i did this because i wanted "-" to be a delimiter, but i also wanted "aaa-aa" to be one word only, and not "aaa" and "aa".
- get the String [] and map to a Map = new HashMap ... (word, frequency)
- print everything.
It is taking me about 8min and 48 seconds, in a dual-core 2.2GHz, 2GB Ram. I would like advice on how to speed this process up. Should I expect it to be this slow? And if possible, how can I know (in netbeans), which functions are taking more time to execute?
unique words found: 398752.
CODE:
File file = new File(dir);
String[] files = file.list();
for (int i = 0; i < files.length; i++) {
BufferedReader br = new BufferedReader(
new InputStreamReader(
new BufferedInputStream(
new FileInputStream(dir + files[i])), encoding));
try {
String line;
while ((line = br.readLine()) != null) {
parsedString = parseString(line); // parse the string
m = stringToMap(parsedString, m);
}
} finally {
br.close();
}
}
EDIT: Check this:
![enter image description here][1]
I don't know what to conclude.
EDIT: 80% TIME USED WITH THIS FUNCTION
public String [] parseString(String sentence){
// separators; ,:;'"\/<>()[]*~^ºª+&%$ etc..
String[] parts = sentence.toLowerCase().split("[,\\s\\-:\\?\\!\\«\\»\\'\\´\\`\\\"\\.\\\\\\/()<>*º;+&ª%\\[\\]~^]");
Map<String, String> o = new HashMap<String, String>(); // save the hyphened words, aaa-bbb like Map<aaa,bbb>
Pattern pattern = Pattern.compile("(?<![A-Za-zÁÉÍÓÚÀÃÂÊÎÔÛáéíóúàãâêîôû-])[A-Za-zÁÉÍÓÚÀÃÂÊÎÔÛáéíóúàãâêîôû]+-[A-Za-zÁÉÍÓÚÀÃÂÊÎÔÛáéíóúàãâêîôû]+(?![A-Za-z-])");
Matcher matcher = pattern.matcher(sentence);
// Find all matches like this: ("aaa-bb or bbb-cc") and put it to map to later add this words to the original map and discount the single words "aaa-aa" like "aaa" and "aa"
for(int i=0; matcher.find(); i++){
String [] tempo = matcher.group().split("-");
o.put(tempo[0], tempo[1]);
}
//System.out.println("words: " + o);
ArrayList temp = new ArrayList();
temp.addAll(Arrays.asList(parts));
for (Map.Entry<String, String> entry : o.entrySet()) {
String key = entry.getKey();
String value = entry.getValue();
temp.add(key+"-"+value);
if(temp.indexOf(key)!=-1){
temp.remove(temp.indexOf(key));
}
if(temp.indexOf(value)!=-1){
temp.remove(temp.indexOf(value));
}
}
String []strArray = new String[temp.size()];
temp.toArray(strArray);
return strArray;
}
600 files, each file about 0.5MB
EDIT3#- The pattern is no longer compiling each time a line is read. The new images are:
2:
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(8)
如果尚未增加堆大小,请务必使用 -Xmx 来增加堆大小。对于这个应用程序来说,影响可能是惊人的。
代码中可能对性能影响最大的部分是执行次数最多的部分 - 也就是您未显示的部分。
更新内存屏幕截图后
查看屏幕截图中的所有 Pattern$6 对象。我认为您正在多次重新编译该模式 - 也许是针对每一行。那会花费很多时间。
更新 2 - 将代码添加到问题后。
是的 - 每行编译两个模式 - 显式模式,以及分割中的“-”(当然便宜得多)。我希望他们没有将 split() 添加到 String 而不将编译模式作为参数。我看到其他一些可以改进的地方,但没有什么比大编译更好的了。只需在该函数之外编译该模式一次,可能作为静态类成员。
Be sure to increase your heap size, if you haven't already, using -Xmx. For this app, the impact may be striking.
The parts of your code that are likely to have the largest performance impact are the ones that are executed the most - which are the parts you haven't shown.
Update after memory screenshot
Look at all those Pattern$6 objects in the screenshot. I think you're recompiling the pattern a lot - maybe for every line. That would take a lot of time.
Update 2 - after code added to question.
Yup - two patterns compiled on every line - the explicit one, and also the "-" in the split (much cheaper, of course). I wish they hadn't added split() to String without it taking a compiled pattern as an argument. I see some other things that could be improved, but nothing else like the big compile. Just compile the pattern once, outside this function, maybe as a static class member.
尝试使用单个正则表达式,该正则表达式具有与标签内的每个单词相匹配的组 - 因此单个正则表达式可以用于您的整个输入,并且不会有单独的“拆分”阶段。
否则你的方法似乎是合理的,尽管我不明白你所说的“获取 String [] ...”是什么意思 - 我以为你正在使用 ArrayList。无论如何,尽量减少对象的创建,以减少构建成本和垃圾收集成本。
Try to use to single regex that has a group that matches each word that is within tags - so a single regex could be used for your entire input and there would be not separate "split" stage.
Otherwise your approach seems reasonable, although I don't understand what you mean by "get the String [] ..." - I thought you were using an ArrayList. In any event, try to minimize the creation of objects, for both construction cost and garbage collection cost.
仅仅是解析花了这么长时间,还是文件读取也花了这么长时间?
对于文件读取,您可以通过在多个线程上读取文件来加快速度。但第一步是弄清楚是否一直在阅读或解析,以便您可以解决正确的问题。
Is it just the parsing that's taking so long, or is it the file reading as well?
For the file reading, you can probably speed that up by reading the files on multiple threads. But first step is to figure out whether it's the reading or the parsing that's taking all the time so you can address the right issue.
通过 Netbeans 分析器运行代码,找出花费最多时间的地方(右键单击项目并选择配置文件,确保您花时间而不是记忆)。
Run the code through the Netbeans profiler and find out where it is taking the most time (right mouse click on the project and select profile, make sure you do time not memory).
您向我们展示的代码中没有任何内容是性能问题的明显根源。问题可能与您解析线条或提取单词并将其放入地图的方式有关。如果您需要更多建议,您需要发布这些方法的代码以及声明/初始化映射的代码。
我的一般建议是分析应用程序并查看瓶颈在哪里,然后使用该信息来找出需要优化的内容。
@Ed Staub 的建议也很合理。运行堆太小的应用程序可能会导致严重的性能问题。
Nothing in the code that you have shown us is an obvious source of performance problems. The problem is likely to be something to do with the way that you are parsing the lines or extracting the words and putting them into the map. If you want more advice you need to post the code for those methods, and the code that declares / initializes the map.
My general advice would be to profile the application and see where the bottlenecks are, and use that information to figure out what needs to be optimized.
@Ed Staub's advice is also sound. Running an application with a heap that is too small can result serious performance problems.
如果您还没有这样做,请使用 BufferedInputStream 和 BufferedReader 来读取文件。像这样的双缓冲明显比单独使用 BufferedInputStream 或 BufferedReader 更好。例如:
如果您发布代码的相关部分,我们就有机会评论如何改进处理。
编辑:
根据您的编辑,这里有一些建议:
parseString
时都进行编译。temp.indexOf(key)
和temp.indexOf(value)
的值,然后使用存储的值而不是调用indexOf< /code> 第二次。
If you aren't already doing it, use BufferedInputStream and BufferedReader to read the files. Double-buffering like that is measurably better than using BufferedInputStream or BufferedReader alone. E.g.:
If you post relevant parts of your code, there'd be a chance we could comment on how to improve the processing.
EDIT:
Based on your edit, here are a couple of suggestions:
parseString
.temp.indexOf(key)
andtemp.indexOf(value)
when you first call them and then use the stored values instead of callingindexOf
a second time.看起来大部分时间都花在正则表达式上。我首先尝试在不使用正则表达式的情况下编写代码,然后使用多个线程,就好像进程仍然受 CPU 限制一样。
对于计数器,我会考虑使用 TObjectIntHashMap 来减少计数器的开销。我只会使用一张地图,而不是创建一个字符串计数数组,然后用它来构建另一张地图,这可能会浪费大量时间。
It looks like its spending most of it time in regular expressions. I would firstly try writing the code without using a regular expression and then using multiple threads as if the process still appears to be CPU bound.
For the counter, I would look at using TObjectIntHashMap to reduce the overhead of the counter. I would use only one map, not create an array of string - counts which I then use to build another map, this could be a significant waste of time.
预编译该模式,而不是每次都通过该方法进行编译,并消除双缓冲:使用 new BufferedReader(new FileReader(...))。
Precompile the pattern instead of compiling it every time through that method, and rid of the double buffering: use new BufferedReader(new FileReader(...)).