Java 内存有限环境中的智能缓冲
亲爱的 StackOverflowers,
我正在编写一个应用程序,用于对二进制文件中的大量整数进行排序。我需要尽快完成此操作,主要的性能问题是磁盘访问时间,因为我进行了大量读取,这会显着减慢算法速度。
执行此操作的标准方法是用某种类型的缓冲对象(BufferedInputStream 等)填充约 50% 的可用内存,然后将整数从缓冲对象传输到整数数组中(这会占用剩余的可用空间) )并对数组中的整数进行排序。将排序后的块保存回磁盘,重复该过程,直到整个文件被拆分为排序后的块,然后将这些块合并在一起。 对块进行排序的策略仅利用 50% 的可用内存,因为数据本质上是重复的(50% 用于缓存,50% 用于阵列,同时它们存储相同的数据)。
我希望我可以通过编写自己的缓冲类来优化算法的这个阶段(对块进行排序),该类允许将数据直接缓存到 int 数组中,以便该数组可以占用所有可用空间,而不仅仅是 50%它,这将使该阶段的磁盘访问次数减少 2 倍。问题是我不知道从哪里开始。
编辑: 本质上,我想找到一种通过仅对文件执行一次读取来填充整数数组的方法。另一个限制是数组必须使用大部分可用内存。
如果我所做的任何陈述是错误的或至少看起来是错误的,请纠正我,
任何帮助表示赞赏,
问候
Dear StackOverflowers,
I am in the process of writing an application that sorts a huge amount of integers from a binary file. I need to do it as quickly as possible and the main performance issue is the disk access time, since I make a multitude of reads it slows down the algorithm quite significantly.
The standard way of doing this would be to fill ~50% of the available memory with a buffered object of some sort (BufferedInputStream etc) then transfer the integers from the buffered object into an array of integers (which takes up the rest of free space) and sort the integers in the array. Save the sorted block back to disk, repeat the procedure until the whole file is split into sorted blocks and then merge the blocks together.
The strategy for sorting the blocks utilises only 50% of the memory available since the data is essentially duplicated (50% for the cache and 50% for the array while they store the same data).
I am hoping that I can optimise this phase of the algorithm (sorting the blocks) by writing my own buffered class that allows caching data straight into an int array, so that the array could take up all of the free space not just 50% of it, this would reduce the number of disk accesses in this phase by a factor of 2. The thing is I am not sure where to start.
EDIT:
Essentially I would like to find a way to fill up an array of integers by executing only one read on the file. Another constraint is the array has to use most of the free memory.
If any of the statements I made are wrong or at least seem to be please correct me,
any help appreciated,
Regards
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
您可能想查看 Java NIO 库,特别是文件通道 和 Int 缓冲区。
You might want to look into the Java NIO libraries, specifically File Channels and Int Buffers.
你没有给出很多提示。但我想到了两件事。首先,如果你有很多整数,但没有那么多独特的值,桶排序可能是解决方案。
其次,当我听到这个词(好的术语)时,我的脑海中尖叫着:外部磁带排序。在早期的计算机时代(即石器时代),数据依赖于磁带,并且很难对分布在多个磁带上的数据进行排序。这与你的情况非常相似。事实上,合并排序是当时最常用的排序,据我所知,Knuths TAOCP 有一个关于它的很好的章节。关于缓存、缓冲区等的大小,可能有一些很好的提示。
You dont give many hints. But two things come to my mind. First, if you have many integers, but not that much distinctive values, bucket sort could be the solution.
Secondly, one word (ok term), screams in my head when I hear that: external tape sorting. In early computer days (i.e. stone age) data relied on tapes, and it was very hard to sort data spread over multiple tapes. It is very similar to your situation. And indeed merge sort was the most often used sorting that days, and as far as I remember, Knuths TAOCP had a nice chapter about it. There might be some good hints about the size of caches, buffers and similar.
当你说有限时,有限到什么程度... <1mb <10mb <64mb?
它会有所不同,因为在大多数情况下,拥有大的 BufferedInputStreams 实际上不会获得太多好处,默认值 8192 (JDK 1.6) 就足够了,增加通常不会产生太大的区别。
使用较小的 BufferedInputStream 应该可以让您在将每个块写入磁盘之前使用几乎所有的堆来创建和排序每个块。
when you say limited, how limited... <1mb <10mb <64mb?
It makes a difference since you won't actually get much benefit if any from having large
BufferedInputStreams
in most cases the default value of 8192 (JDK 1.6) is enough and increasing doesn't ussually make that much difference.Using a smaller
BufferedInputStream
should leave you with nearly all of the heap to create and sort each chunk before writing them to disk.