FileInputStream、JAVA 的意外行为
我正在编写一个应用程序,该应用程序处理二进制文件中的大量整数(最多 50 兆)。我需要尽快完成此操作,主要的性能问题是磁盘访问时间,因为我从磁盘进行大量读取,优化读取时间通常会提高应用程序的性能。
到目前为止,我认为将文件分成的块越少(即我读取的数据越少/读取的大小越大),我的应用程序的运行速度就越快。这是因为 HDD 由于其机械特性,在查找(即定位块的开头)时速度非常慢。但是,一旦它找到了您要求它读取的块的开头,它应该会相当快地执行实际读取。
好吧,直到我进行这个测试:
旧测试已删除,由于 HDD 缓存而出现问题
新测试(HDD 缓存在这里没有帮助,因为文件太大(1GB)并且我访问其中的随机位置):
int mega = 1024 * 1024;
int giga = 1024 * 1024 * 1024;
byte[] bigBlock = new byte[mega];
int hundredKilo = mega / 10;
byte[][] smallBlocks = new byte[10][hundredKilo];
String location = "C:\\Users\\Vladimir\\Downloads\\boom.avi";
RandomAccessFile raf;
FileInputStream f;
long start;
long end;
int position;
java.util.Random rand = new java.util.Random();
int bigBufferTotalReadTime = 0;
int smallBufferTotalReadTime = 0;
for (int j = 0; j < 100; j++)
{
position = rand.nextInt(giga);
raf = new RandomAccessFile(location, "r");
raf.seek((long) position);
f = new FileInputStream(raf.getFD());
start = System.currentTimeMillis();
f.read(bigBlock);
end = System.currentTimeMillis();
bigBufferTotalReadTime += end - start;
f.close();
}
for (int j = 0; j < 100; j++)
{
position = rand.nextInt(giga);
raf = new RandomAccessFile(location, "r");
raf.seek((long) position);
f = new FileInputStream(raf.getFD());
start = System.currentTimeMillis();
for (int i = 0; i < 10; i++)
{
f.read(smallBlocks[i]);
}
end = System.currentTimeMillis();
smallBufferTotalReadTime += end - start;
f.close();
}
System.out.println("Average performance of small buffer: " + (smallBufferTotalReadTime / 100));
System.out.println("Average performance of big buffer: " + (bigBufferTotalReadTime / 100));
结果: 小缓冲区的平均值 - 35ms 大缓冲区的平均值 - 40 毫秒?! (在 Linux 和 Windows 上尝试过,在这两种情况下,较大的块大小都会导致较长的读取时间,为什么?)
在多次运行此测试后,我意识到由于某种神奇的原因,读取一个大块平均比读取 10 个块花费更长的时间依次减小尺寸。我认为这可能是由于 Windows 过于智能并试图优化其文件系统中的某些内容所致,因此我在 Linux 上运行了相同的代码,令我惊讶的是,我得到了相同的结果。
我不知道为什么会发生这种情况,有人能给我提示吗?在这种情况下,最佳的块大小是多少?
亲切的问候
I am in the process of writing an application that processes a huge number of integers from a binary file (up to 50 meg). I need to do it as quickly as possible and the main performance issue is the disk access time, since I make a large number of reads from the disk, optimizing read time would improve performance of the app in general.
Up until now I thought that the fewer blocks I split my file into (i.e. the fewer reads I have / the larger the read size is) the faster my app should work. This is because HDD is very slow on seeking i.e. locating the beginning of the block due to its mechanical nature. However, once it locates the beginning of the block you asked it to read off it should perform the actual read fairly quickly.
Well, that was up until I ran this test:
Old test removed, had issues due to HDD Caching
NEW TEST (HDD Cache doesn't help here since the file is too big (1gb) and I access random locations within it):
int mega = 1024 * 1024;
int giga = 1024 * 1024 * 1024;
byte[] bigBlock = new byte[mega];
int hundredKilo = mega / 10;
byte[][] smallBlocks = new byte[10][hundredKilo];
String location = "C:\\Users\\Vladimir\\Downloads\\boom.avi";
RandomAccessFile raf;
FileInputStream f;
long start;
long end;
int position;
java.util.Random rand = new java.util.Random();
int bigBufferTotalReadTime = 0;
int smallBufferTotalReadTime = 0;
for (int j = 0; j < 100; j++)
{
position = rand.nextInt(giga);
raf = new RandomAccessFile(location, "r");
raf.seek((long) position);
f = new FileInputStream(raf.getFD());
start = System.currentTimeMillis();
f.read(bigBlock);
end = System.currentTimeMillis();
bigBufferTotalReadTime += end - start;
f.close();
}
for (int j = 0; j < 100; j++)
{
position = rand.nextInt(giga);
raf = new RandomAccessFile(location, "r");
raf.seek((long) position);
f = new FileInputStream(raf.getFD());
start = System.currentTimeMillis();
for (int i = 0; i < 10; i++)
{
f.read(smallBlocks[i]);
}
end = System.currentTimeMillis();
smallBufferTotalReadTime += end - start;
f.close();
}
System.out.println("Average performance of small buffer: " + (smallBufferTotalReadTime / 100));
System.out.println("Average performance of big buffer: " + (bigBufferTotalReadTime / 100));
RESULTS:
Average for small buffer - 35ms
Average for large buffer - 40ms ?!
(Tried on linux and windows, in both cases larger block size results in longer read time, why?)
After running this test for many many times I have realised that for some magical reason reading one big block takes on average longer than reading 10 blocks of smaller size sequentially. I thought that it might have been a result of Windows being too smart and trying to optimize something in its file system, so I ran the same code on Linux and to my surprise I got the same result.
I have no clue as to why this is happening, could anyone please give me a hint? Also what would be the best block size in this case?
Kind Regards
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
第一次读取数据后,数据将位于磁盘缓存中。第二次读取应该快得多。您需要首先运行您认为更快的测试。 ;)
如果您有 50 MB 内存,您应该能够一次读取整个文件。
在我的笔记本电脑上打印
看来最佳读取大小可能是 32KB。注意:由于文件完全位于磁盘缓存中,因此这可能不是从磁盘读取文件的最佳大小。
After you read the data the first time, the data will be in disk cache. The second read should be much faster. You need to run the test you think is faster first. ;)
If you have 50 MB of memory, you should be able to read the entire file at once.
On my laptop prints
It appears that the optimal read size may be 32KB. Note: as the file is entirely in disk cache this may not be the optimal size for a file which is read from disk.
如前所述,通过读取每个测试的相同数据,您的测试将不可避免地受到影响。
我可能会吐槽,但您可能会从阅读这篇文章中获得更多信息,然后查看 此示例说明如何使用 FileChannel。
As noted, your test is hopelessly compromised by reading the same data for each.
I could spew on, but you'll probably get more out of reading this article, then looking at this example of how to use FileChannel.