java处理大文本文件的有效方法

发布于 2024-12-12 09:35:53 字数 469 浏览 0 评论 0原文

我正在做一本频率词典，其中我读取了 1000 个文件，每个文件大约有 1000 行。我遵循的方法是：

BufferedReader 读取 fileByFile
读取第一个文件，获取第一个句子，将句子拆分为数组字符串，然后使用字符串数组中的值填充哈希图。
对该文件中的所有句子执行
此操作对所有 1000 个文件执行此

操作我的问题是，这不是一个非常有效的方法，我需要大约 4 分钟才能完成所有这些操作。我增加了堆大小，重构了代码以确保我没有做错什么。对于这种方法，我完全确定代码中没有任何可以改进的地方。

我敢打赌，每次读取一个句子时，都会应用一个拆分，乘以文件中的 1000 个句子和 1000 个文件，就是需要处理的大量拆分。我的想法是，我可以将每个文件读取到一个字符数组，然后每个文件只进行一次分割，而不是逐个文件地读取和处理。这将减少分割所消耗的处理时间。任何实施建议将不胜感激。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

胡渣熟男 2024-12-19 09:35:53

好的，我刚刚实现了你们词典的 POC。又快又脏。我的文件每个包含 868 行，但我创建了同一文件的 1024 个副本。（这是 Spring 框架文档的目录。）

我运行了测试，花费了 14020 毫秒（14 秒！）。顺便说一句，我从 eclipse 运行它，这可能会稍微降低速度。

所以，我不知道你的问题出在哪里。请在您的计算机上尝试我的代码，如果它运行得更快，请尝试将其与您的代码进行比较并了解根本问题在哪里。

无论如何，我的代码不是我能写的最快的。
我可以在循环之前创建 Pattern 并使用它而不是 String.split()。 String.split() 每次都会调用 Pattern.compile()。创建模式非常昂贵。

这是代码：

public static void main(String[] args) throws IOException {
    Map<String, Integer> words = new HashMap<String, Integer>();

    long before = System.currentTimeMillis();

    File dir = new File("c:/temp/files");
    for (File file : dir.listFiles()) {
        BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(file)));
        for (String line = reader.readLine();  line != null;  line = reader.readLine()) {
            String[] lineWords = line.split("\\s+");
            for (String word : lineWords) {
                int count = 1;
                Integer currentCount = words.get(word);
                if (currentCount != null) {
                    count = currentCount + 1;
                }
                words.put(word, count);
            }
        }
    }

    long after = System.currentTimeMillis();

    System.out.println("run took " + (after - before) + " ms");
    System.out.println(words);
}

OK, I have just implemented the POC of your dictionary. Fast and dirty. My files contained 868 lines each one but I created 1024 copies of the same file. (This is table of contents of Spring Framework documentation.)

I ran my test and it took 14020 ms (14 seconds!). BTW I ran it from eclipse that could decrease the speed a little bit.

So, I do not know where your problem is. Please try my code on your machine and if it runs faster try to compare it with your code and understand where the root problem.

Anyway my code is not the fastest I can write.
I can create Pattern before loop and the use it instead of String.split(). String.split() calls Pattern.compile() every time. Creating pattern is very expensive.

Here is the code:

public static void main(String[] args) throws IOException {
    Map<String, Integer> words = new HashMap<String, Integer>();

    long before = System.currentTimeMillis();

    File dir = new File("c:/temp/files");
    for (File file : dir.listFiles()) {
        BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(file)));
        for (String line = reader.readLine();  line != null;  line = reader.readLine()) {
            String[] lineWords = line.split("\\s+");
            for (String word : lineWords) {
                int count = 1;
                Integer currentCount = words.get(word);
                if (currentCount != null) {
                    count = currentCount + 1;
                }
                words.put(word, count);
            }
        }
    }

    long after = System.currentTimeMillis();

    System.out.println("run took " + (after - before) + " ms");
    System.out.println(words);
}

回复收藏 0 原文

廻憶裏菂餘溫 2024-12-19 09:35:53

如果您不关心内容位于不同的文件中，我会采用您推荐的方法。将所有文件和所有行读入内存（字符串或字符数组等），然后根据一个字符串/数据集进行 1 次分割和哈希填充。

回复收藏 0 原文

独﹏钓一江月 2024-12-19 09:35:53

如果我理解你在做什么，我认为你不想使用字符串，除非你访问地图。

您想要：

循环遍历文件
将每个文件读入类似 1024 的缓冲区
处理缓冲区寻找单词结束字符
从字符数组创建一个字符串
检查你的地图
如果找到，则更新您的计数，如果没有，则创建一个新条目
当到达缓冲区末尾时，从文件中获取下一个缓冲区
最后，循环到下一个文件

Split 可能非常昂贵，因为它每次都必须解释表达式。

回复收藏 0 原文

真心难拥有 2024-12-19 09:35:53

将文件作为一个大字符串读取，然后拆分，这听起来是个好主意。当涉及到垃圾收集时，字符串分割/修改可能会令人惊讶地“繁重”。多行/句子意味着多个字符串，并且所有的分割意味着大量的字符串（字符串是不可变的，因此对它们的任何更改实际上都会创建一个新的字符串或多个字符串）...这会产生大量垃圾收集，并且垃圾收集可能成为瓶颈（使用较小的堆，始终达到最大内存量，启动垃圾收集，这可能需要清理数十万或数百万个单独的 String 对象）。

当然，在不知道你的代码的情况下，这只是一个疯狂的猜测，但在那天，我得到了一个旧的命令行 Java 程序（它是一个生成巨大 SVG 文件的图形算法）运行时间从大约只需修改字符串处理以使用 StringBuffers/Builders，即可将 18 秒缩短至不到 0.5 秒。

我想到的另一件事是使用多个线程（或线程池）同时处理不同的文件，然后在最后合并结果。一旦你让程序“尽可能快”地运行，剩下的瓶颈将是磁盘访问，而克服这个瓶颈的唯一方法（afaik）是更快的磁盘（SSD等）。

回复收藏 0 原文

撩心不撩汉 2024-12-19 09:35:53

既然您使用的是 bufferedReader，为什么需要显式读取整个文件？如果你追求速度，我绝对不会使用 split，记住，每次运行它时它都必须评估正则表达式。

在您的内部循环中尝试类似的操作（注意，我还没有编译它或尝试运行它）：

StringBuilder sb = null;
String delimiters = " .,\t"; //Build out all your word delimiters in a string here
for(int nextChar = br.read(); nextChar >= 0; nextChar = br.read()) {
    if(delimiters.indexOf(nextChar) < 0) {
        if(sb == null) sb = new StringBuilder();
        sb.append((char)(nextChar));
    } else {
        if(sb != null) {
            //Add sb.toString() to your map or increment it
            sb = null;
        }
    }
}

您可以尝试显式使用不同大小的缓冲区，但您可能不会因此获得性能改进。

Since you're using a bufferedReader, why do you need to read in a whole file explicitly? I definitely wouldn't use split if you're after speed, remember, it has to evaluate a regular expression each time you run it.

Try something like this for your inner loop (note, I have not compiled this or tried to run it):

StringBuilder sb = null;
String delimiters = " .,\t"; //Build out all your word delimiters in a string here
for(int nextChar = br.read(); nextChar >= 0; nextChar = br.read()) {
    if(delimiters.indexOf(nextChar) < 0) {
        if(sb == null) sb = new StringBuilder();
        sb.append((char)(nextChar));
    } else {
        if(sb != null) {
            //Add sb.toString() to your map or increment it
            sb = null;
        }
    }
}

You could try using different sized buffers explicitly, but you probably won't get a performance improvement over this.

回复收藏 0 原文

多彩岁月 2024-12-19 09:35:53

一种非常简单的方法，它使用最小的堆空间，并且应该（几乎）与其他任何方法一样快，就像根据需要

  int c;

  final String SEPARATORS = " \t,.\n"; // extend as needed

  final StringBuilder word = new StringBuilder();

  while( ( c = fileInputStream.read() ) >= 0 ) {
    final char letter = (char) c;

    if ( SEPARATORS.indexOf(letter) < 0 ) {

      word.append(letter);

    } else {

      processWord( word.toString() );
      word.setLength( 0 );

    }

  }

扩展更多分隔符，可能使用多线程同时处理多个文件，直到磁盘 IO 成为瓶颈。。

One very simple approach which uses minimum heap space and should be (almost) as fast as anything else would be like

  int c;

  final String SEPARATORS = " \t,.\n"; // extend as needed

  final StringBuilder word = new StringBuilder();

  while( ( c = fileInputStream.read() ) >= 0 ) {
    final char letter = (char) c;

    if ( SEPARATORS.indexOf(letter) < 0 ) {

      word.append(letter);

    } else {

      processWord( word.toString() );
      word.setLength( 0 );

    }

  }

extend for more separator characters as needed, possibly use multi-threading to process multiple files concurrently until disc IO becomes the bottle neck...

回复收藏 0 原文

~没有更多了~