如何加快 Java 文本文件解析器的速度？

发布于 2024-12-07 08:24:07 字数 3131 浏览 2 评论 0原文

我正在阅读大约 600 个文本文件，然后单独解析每个文件并将所有术语添加到地图中，这样我就可以知道 600 个文件中每个单词的频率。（约400MB）。

我的解析器函数包括以下步骤（按顺序）：

查找两个标签之间的文本，这是每个文件中要读取的相关文本。
将所有文本
string.lowecase.split 与多个分隔符。
创建一个包含这样的单词的 arrayList：“aaa-aa”，然后添加到上面分割的字符串，并将“aaa”和“aa”折扣到字符串 []。（我这样做是因为我希望“-”成为分隔符，但我也希望“aaa-aa”只是一个单词，而不是“aaa”和“aa”。
获取字符串[]并映射到Map = new HashMap ...（单词，频率）
打印所有内容，

在双核 2.2GHz、2GB RAM 中，我需要大约 8 分钟 48 秒，我想了解如何加快此过程。这么慢吗？如果可能的话，我怎么知道（在netbeans中）哪些函数需要更多时间来执行？

找到的独特单词：398752。

代码：

File file = new File(dir);
String[] files = file.list();

for (int i = 0; i < files.length; i++) {
    BufferedReader br = new BufferedReader(
        new InputStreamReader(
            new BufferedInputStream(
                new FileInputStream(dir + files[i])), encoding));
    try {
        String line;
        while ((line = br.readLine()) != null) {
            parsedString = parseString(line); // parse the string
            m = stringToMap(parsedString, m);
        }
    } finally {
        br.close();
    }
}

编辑：检查这个：

！[在此处输入图像描述][1]

我不知道该得出什么结论。

编辑：80% 的时间用于此函数

    public String [] parseString(String sentence){
         // separators; ,:;'"\/<>()[]*~^ºª+&%$ etc..
        String[] parts = sentence.toLowerCase().split("[,\\s\\-:\\?\\!\\«\\»\\'\\´\\`\\\"\\.\\\\\\/()<>*º;+&ª%\\[\\]~^]");

        Map<String, String> o = new HashMap<String, String>(); // save the hyphened words, aaa-bbb like Map<aaa,bbb>

        Pattern pattern = Pattern.compile("(?<![A-Za-zÁÉÍÓÚÀÃÂÊÎÔÛáéíóúàãâêîôû-])[A-Za-zÁÉÍÓÚÀÃÂÊÎÔÛáéíóúàãâêîôû]+-[A-Za-zÁÉÍÓÚÀÃÂÊÎÔÛáéíóúàãâêîôû]+(?![A-Za-z-])");
        Matcher matcher = pattern.matcher(sentence);

    // Find all matches like this: ("aaa-bb or bbb-cc") and put it to map to later add this words to the original map and discount the single words "aaa-aa" like "aaa" and "aa"
        for(int i=0; matcher.find(); i++){
           String [] tempo = matcher.group().split("-");
           o.put(tempo[0], tempo[1]);
        }
        //System.out.println("words: " + o);


        ArrayList temp = new ArrayList();
        temp.addAll(Arrays.asList(parts));

        for (Map.Entry<String, String> entry : o.entrySet()) {
            String key = entry.getKey();
            String value = entry.getValue();
            temp.add(key+"-"+value);
            if(temp.indexOf(key)!=-1){
                temp.remove(temp.indexOf(key));
            }
            if(temp.indexOf(value)!=-1){
                temp.remove(temp.indexOf(value));
            }
        }


        String []strArray = new String[temp.size()];
        temp.toArray(strArray);
                return strArray;

  }

600 个文件，每个文件约 0.5MB

EDIT3#- 每次读取一行时，该模式不再进行编译。新图像为：

在此处输入图像描述

2: 在此处输入图像描述

原文

I am reading about 600 text files, and then parsing each file individually and add all the terms to a map so i can know the frequency of each word within the 600 files. (about 400MB).

My parser functions includes the following steps (ordered):

find text between two tags, which is the relevant text to read in each file.
lowecase all the text
string.split with multiple delimiters.
creating an arrayList with words like this: "aaa-aa", then adding to the string splitted above, and discounting "aaa" and "aa" to the String []. (i did this because i wanted "-" to be a delimiter, but i also wanted "aaa-aa" to be one word only, and not "aaa" and "aa".
get the String [] and map to a Map = new HashMap ... (word, frequency)
print everything.

It is taking me about 8min and 48 seconds, in a dual-core 2.2GHz, 2GB Ram. I would like advice on how to speed this process up. Should I expect it to be this slow? And if possible, how can I know (in netbeans), which functions are taking more time to execute?

unique words found: 398752.

CODE:

File file = new File(dir);
String[] files = file.list();

for (int i = 0; i < files.length; i++) {
    BufferedReader br = new BufferedReader(
        new InputStreamReader(
            new BufferedInputStream(
                new FileInputStream(dir + files[i])), encoding));
    try {
        String line;
        while ((line = br.readLine()) != null) {
            parsedString = parseString(line); // parse the string
            m = stringToMap(parsedString, m);
        }
    } finally {
        br.close();
    }
}

EDIT: Check this:

![enter image description here][1]

I don't know what to conclude.

EDIT: 80% TIME USED WITH THIS FUNCTION

    public String [] parseString(String sentence){
         // separators; ,:;'"\/<>()[]*~^ºª+&%$ etc..
        String[] parts = sentence.toLowerCase().split("[,\\s\\-:\\?\\!\\«\\»\\'\\´\\`\\\"\\.\\\\\\/()<>*º;+&ª%\\[\\]~^]");

        Map<String, String> o = new HashMap<String, String>(); // save the hyphened words, aaa-bbb like Map<aaa,bbb>

        Pattern pattern = Pattern.compile("(?<![A-Za-zÁÉÍÓÚÀÃÂÊÎÔÛáéíóúàãâêîôû-])[A-Za-zÁÉÍÓÚÀÃÂÊÎÔÛáéíóúàãâêîôû]+-[A-Za-zÁÉÍÓÚÀÃÂÊÎÔÛáéíóúàãâêîôû]+(?![A-Za-z-])");
        Matcher matcher = pattern.matcher(sentence);

    // Find all matches like this: ("aaa-bb or bbb-cc") and put it to map to later add this words to the original map and discount the single words "aaa-aa" like "aaa" and "aa"
        for(int i=0; matcher.find(); i++){
           String [] tempo = matcher.group().split("-");
           o.put(tempo[0], tempo[1]);
        }
        //System.out.println("words: " + o);


        ArrayList temp = new ArrayList();
        temp.addAll(Arrays.asList(parts));

        for (Map.Entry<String, String> entry : o.entrySet()) {
            String key = entry.getKey();
            String value = entry.getValue();
            temp.add(key+"-"+value);
            if(temp.indexOf(key)!=-1){
                temp.remove(temp.indexOf(key));
            }
            if(temp.indexOf(value)!=-1){
                temp.remove(temp.indexOf(value));
            }
        }


        String []strArray = new String[temp.size()];
        temp.toArray(strArray);
                return strArray;

  }

600 files, each file about 0.5MB

EDIT3#- The pattern is no longer compiling each time a line is read. The new images are:

enter image description here

2: enter image description here

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

不知在何时 2024-12-14 08:24:07

如果尚未增加堆大小，请务必使用 -Xmx 来增加堆大小。对于这个应用程序来说，影响可能是惊人的。

代码中可能对性能影响最大的部分是执行次数最多的部分 - 也就是您未显示的部分。

更新内存屏幕截图后

查看屏幕截图中的所有 Pattern$6 对象。我认为您正在多次重新编译该模式 - 也许是针对每一行。那会花费很多时间。

更新 2 - 将代码添加到问题后。

是的 - 每行编译两个模式 - 显式模式，以及分割中的“-”（当然便宜得多）。我希望他们没有将 split() 添加到 String 而不将编译模式作为参数。我看到其他一些可以改进的地方，但没有什么比大编译更好的了。只需在该函数之外编译该模式一次，可能作为静态类成员。

回复收藏 0 原文

酷遇一生 2024-12-14 08:24:07

尝试使用单个正则表达式，该正则表达式具有与标签内的每个单词相匹配的组 - 因此单个正则表达式可以用于您的整个输入，并且不会有单独的“拆分”阶段。

否则你的方法似乎是合理的，尽管我不明白你所说的“获取 String [] ...”是什么意思 - 我以为你正在使用 ArrayList。无论如何，尽量减少对象的创建，以减少构建成本和垃圾收集成本。

回复收藏 0 原文

回梦 2024-12-14 08:24:07

仅仅是解析花了这么长时间，还是文件读取也花了这么长时间？

对于文件读取，您可以通过在多个线程上读取文件来加快速度。但第一步是弄清楚是否一直在阅读或解析，以便您可以解决正确的问题。

回复收藏 0 原文

李不 2024-12-14 08:24:07

通过 Netbeans 分析器运行代码，找出花费最多时间的地方（右键单击项目并选择配置文件，确保您花时间而不是记忆）。

回复收藏 0 原文

江城子 2024-12-14 08:24:07

您向我们展示的代码中没有任何内容是性能问题的明显根源。问题可能与您解析线条或提取单词并将其放入地图的方式有关。如果您需要更多建议，您需要发布这些方法的代码以及声明/初始化映射的代码。

我的一般建议是分析应用程序并查看瓶颈在哪里，然后使用该信息来找出需要优化的内容。

@Ed Staub 的建议也很合理。运行堆太小的应用程序可能会导致严重的性能问题。

回复收藏 0 原文

御弟哥哥 2024-12-14 08:24:07

如果您还没有这样做，请使用 BufferedInputStream 和 BufferedReader 来读取文件。像这样的双缓冲明显比单独使用 BufferedInputStream 或 BufferedReader 更好。例如：

BufferedReader rdr = new BufferedReader(
    new InputStreamReader(
        new BufferedInputStream(
            new FileInputStream(aFile)
        )
        /* add an encoding arg here (e.g., ', "UTF-8"') if appropriate */
    )
);

如果您发布代码的相关部分，我们就有机会评论如何改进处理。

编辑：

根据您的编辑，这里有一些建议：

编译一次模式并将其保存为静态变量，而不是每次调用parseString时都进行编译。
首次调用时存储 temp.indexOf(key) 和 temp.indexOf(value) 的值，然后使用存储的值而不是调用 indexOf< /code> 第二次。

If you aren't already doing it, use BufferedInputStream and BufferedReader to read the files. Double-buffering like that is measurably better than using BufferedInputStream or BufferedReader alone. E.g.:

BufferedReader rdr = new BufferedReader(
    new InputStreamReader(
        new BufferedInputStream(
            new FileInputStream(aFile)
        )
        /* add an encoding arg here (e.g., ', "UTF-8"') if appropriate */
    )
);

If you post relevant parts of your code, there'd be a chance we could comment on how to improve the processing.

EDIT:

Based on your edit, here are a couple of suggestions:

Compile the pattern once and save it as a static variable, rather than compiling every time you call parseString.
Store the values of temp.indexOf(key) and temp.indexOf(value) when you first call them and then use the stored values instead of calling indexOf a second time.

回复收藏 0 原文