Java 中使用哪个 API 来读取文件才能获得最佳性能?

发布于 2024-08-12 07:16:37 字数 637 浏览 10 评论 0原文

在我工作的地方,过去每个文件的行数超过一百万行。尽管服务器内存超过 10GB,其中 JVM 内存为 8GB,但有时服务器会挂起片刻并阻塞其他任务。

我分析了代码,发现文件读取时内存使用量频繁增加千兆字节(1GB 到 3GB),然后突然恢复正常。看来这种频繁的高内存和低内存使用挂起了我的服务器。当然,这是由于垃圾收集造成的。

我应该使用哪个 API 来读取文件以获得更好的性能?

现在我正在使用 BufferedReader(new FileReader(...)) 来读取这些 CSV 文件。

流程:我如何读取文件?

  1. 我逐行读取文件。
  2. 每行都有几列。根据我相应地解析它们的类型(双精度数中的成本列、int 中的访问列、字符串中的关键字列等)。
  3. 我将符合条件的内容(访问> 0)推送到 HashMap 中,并最终在任务结束时清除该映射

更新

我读取 30 或 31 个文件(一个月的数据)并存储符合地图条件。后来这张地图被用来在不同的表中找到一些罪魁祸首。因此读取是必须的,存储数据也是必须的。虽然我现在已经将 HashMap 部分切换到 BerkeleyDB,但是读取文件时的问题是相同的甚至更糟。

In my place where I work, used to have files with more than million rows per file. Even though the server memory are more than 10GB with 8GB for JVM, sometimes the server get hanged for few moments and chokes the other tasks.

I profiled the code and found that while file reading memory use rises in Giga bytes frequently(1GB to 3GB) and then suddenly comes back to normal. It seems that this frequent high and low memory uses hangs my servers. Of course this was due to Garbage collection.

Which API should I use to read the files for better performance?

Righ now I am using BufferedReader(new FileReader(...)) to read these CSV files.

Process : How am I reading the file?

  1. I read files line by line.
  2. Every line has few columns. based on the types I parse them correspondingly(cost column in double, visit column in int, keyword column in String, etc..).
  3. I push the eligible content(visit > 0) in a HashMap and finally clears that Map at the end of the task

Update

I do this reading of 30 or 31 files(one month's data) and store the eligible in a Map. Later this map is used to get some culprits in different tables. Therefore reading is must and storing that data is also must. Although I have switched the HashMap part to BerkeleyDB now but the issue at the time of reading file is same or even worse.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

恍梦境° 2024-08-19 07:16:37

BufferedReader 是用于此目的的两个最佳 API 之一。如果您确实在文件读取方面遇到困难,另一种方法可能是使用 NIO 对文件进行内存映射,然后直接从内存中读取内容。

但你的问题不在于读者。您的问题是,每次读取操作都会创建一堆新对象,很可能是您在读取后立即执行的操作。

您应该考虑清理输入处理,着眼于减少创建的对象的数量和/或大小,或者在不再需要时更快地删除对象。是否可以一次处理一行或一大块文件,而不是将整个文件吸入内存进行处理?

另一种可能性是摆弄垃圾收集。您有两种机制:

  • 每隔一段时间显式调用垃圾收集器,例如每 10 秒或每 1000 个输入行或其他。这将增加 GC 完成的工作量,但每次 GC 花费的时间会更少,您的内存不会膨胀太多,因此希望对服务器其余部分的影响较小。

  • 摆弄 JVM 的垃圾收集器选项。这些在 JVM 之间有所不同,但是 java -X 应该会给您一些提示。

更新:最有前途的方法:

您真的需要一次性将整个数据集放入内存中进行处理吗?

BufferedReader is one of the two best APIs to use for this. If you really had trouble with file reading, an alternative might be to use the stuff in NIO to memory-map your files and then read the contents directly out of memory.

But your problem is not with the reader. Your problem is that every read operation creates a bunch of new objects, most likely in the stuff you do just after reading.

You should consider cleaning up your input processing with an eye on reducing the number and/or size of objects you create, or simply getting rid of objects more quickly once no longer needed. Would it be possible to process your file one line or chunk at a time rather than inhaling the whole thing into memory for processing?

Another possibility would be to fiddle with garbage collection. You have two mechanisms:

  • Explicitly call the garbage collector every once in a while, say every 10 seconds or every 1000 input lines or something. This will increase the amount of work done by the GC, but it will take less time for each GC, your memory won't swell as much and so hopefully there will be less impact on the rest of the server.

  • Fiddle with the JVM's garbage collector options. These differ between JVMs, but java -X should give you some hints.

Update: Most promising approach:

Do you really need the whole dataset in memory at one time for processing?

梦旅人picnic 2024-08-19 07:16:37

我分析了代码并发现
而文件读取内存使用量则增加
经常使用千兆字节(1GB 到 3GB)并且
然后突然恢复正常。它
看来这频繁的高低
内存使用使我的服务器挂起。的
当然这是由于垃圾造成的
收藏。

使用 BufferedReader(new FileReader(...)) 不会导致这种情况。

我怀疑问题是您正在将行/行读入数组或列表,处理它们,然后丢弃数组/列表。这将导致内存使用量增加然后再次减少。如果是这种情况,您可以通过在读取时处理每一行来减少内存使用量。

编辑:我们一致认为问题在于内存中用于表示文件内容的空间。巨大的内存哈希表的替代方法是回到我们在以千字节为单位测量计算机内存时使用的旧“排序合并”方法。 (我假设处理过程主要由使用键 K 进行查找以获取关联行 R 的步骤主导。)

  1. 如果有必要,请预处理每个输入文件,以便可以对它们进行排序键 K。

  2. 使用高效的文件排序实用程序将所有输入文件按 K 上的顺序排序。想要使用一个使用经典合并排序算法的实用程序。这将
    将每个文件分割成可以在内存中排序的较小块,对块进行排序,将它们写入临时文件,然后合并排序后的临时文件。 UNIX / Linux sort 实用程序是一个不错的选择。

  3. 并行读取排序后的文件,从所有文件中读取与每个键值相关的所有行,处理它们,然后继续处理下一个键值。

    并行读取排序后的文件

事实上,我有点惊讶使用 BerkeleyDB 没有帮助。但是,如果分析告诉您大部分时间都用于构建数据库,则您可以在构建数据库之前通过将输入文件(如上所述!)按升序键顺序进行排序来加快速度。 (创建基于文件的大型索引时,如果按键顺序添加条目,您将获得更好的性能。)

I profiled the code and found that
while file reading memory use rises in
Giga bytes frequently(1GB to 3GB) and
then suddenly comes back to normal. It
seems that this frequent high and low
memory uses hangs my servers. Of
course this was due to Garbage
collection.

Using BufferedReader(new FileReader(...)) won't cause that.

I suspect that the problem is that you are reading the lines/rows into an array or list, processing them and then discarding the array/list. This will cause the memory usage to increase and then decrease again. If this is the case, you can reduce memory usage by processing each line/row as you read it.

EDIT: We are agreed that the problem is about the space used to represent the file content in memory. An alternative to a huge in-memory hashtable is to go back to the old "sort merge" approach we used when computer memory was measured in Kbytes. (I'm assuming that the processing is dominated by a step where you are doing a lookup with keys K to get the associated row R.)

  1. If necessary, preprocess each of the input files so that they can be sorted on the key K.

  2. Use an efficient file sort utility to sort all of the input files into order on the K. You want to use a utility that will use a classical merge sort algorithm. This will
    split each file into smaller chunks that can be sorted in memory, sort the chunks, write them to temporary files, then merge the sorted temporary files. The UNIX / Linux sort utility is a good option.

  3. Read the sorted files in parallel, reading all rows that relate to each key value from all files, processing them and then stepping on to the next key value.

Actually, I'm a bit surprised that using BerkeleyDB didn't help. However, if profiling tells you that most time was going in building the DB, you may be able to speed it up by sorting the input file (as above!) into ascending key order before you build the DB. (When creating a large file-based index, you get better performance if the entries are added in key order.)

月下伊人醉 2024-08-19 07:16:37

尝试使用以下 vm 选项来调整 gc(并执行一些 gc 打印):

-verbose:gc -XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode -XX:+PrintGCDetails -XX:+PrintGCTimeStamps

Try using the following vm options in order to tune the gc (and do some gc printing):

-verbose:gc -XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode -XX:+PrintGCDetails -XX:+PrintGCTimeStamps
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文