有任何加快 Java FileChannel 随机读取速度的代码提示吗?

发布于 2024-08-16 10:50:02 字数 1352 浏览 6 评论 0原文

我有一个大型(3Gb)双精度二进制文件,在我为聚类数据编写的迭代算法期间,我(或多或少)随机访问该文件。每次迭代都会从文件中读取大约 50 万次,并写入大约 10 万次新值。

我像这样创建 FileChannel...

f = new File(_filename);
_ioFile = new RandomAccessFile(f, "rw");
_ioFile.setLength(_extent * BLOCK_SIZE);
_ioChannel = _ioFile.getChannel();

然后使用一个 double 大小的私有 ByteBuffer 从中读取数据

private ByteBuffer _double_bb = ByteBuffer.allocate(8);

,我的读取代码如下所示

public double GetValue(long lRow, long lCol) 
{
    long idx = TriangularMatrix.CalcIndex(lRow, lCol);
    long position = idx * BLOCK_SIZE;
    double d = 0;
    try 
    {
        _double_bb.position(0);
        _ioChannel.read(_double_bb, position);
        d = _double_bb.getDouble(0);
    } 

    ...snip...

    return d;
}

,然后像这样写入...

public void SetValue(long lRow, long lCol, double d) 
{
    long idx = TriangularMatrix.CalcIndex(lRow, lCol);
    long offset = idx * BLOCK_SIZE;
    try 
    {
        _double_bb.putDouble(0, d);
        _double_bb.position(0);
        _ioChannel.write(_double_bb, offset);
    } 

    ...snip...

}

我的代码迭代所需的时间随着读取次数大致线性增加。我对周围的代码添加了一些优化,以最大限度地减少读取次数,但我认为这是必要的核心集,而不会从根本上改变算法的工作方式,这是我目前想要避免的。

所以我的问题是,在读/写代码或 JVM 配置中是否可以采取任何措施来加快读取速度?我意识到我可以更改硬件,但在这样做之前,我想确保我已经从问题中榨取了最后一滴软件汁液。

提前致谢

I have a large (3Gb) binary file of doubles which I access (more or less) randomly during an iterative algorithm I have written for clustering data. Each iteration does about half a million reads from the file and about 100k writes of new values.

I create the FileChannel like this...

f = new File(_filename);
_ioFile = new RandomAccessFile(f, "rw");
_ioFile.setLength(_extent * BLOCK_SIZE);
_ioChannel = _ioFile.getChannel();

I then use a private ByteBuffer the size of a double to read from it

private ByteBuffer _double_bb = ByteBuffer.allocate(8);

and my reading code looks like this

public double GetValue(long lRow, long lCol) 
{
    long idx = TriangularMatrix.CalcIndex(lRow, lCol);
    long position = idx * BLOCK_SIZE;
    double d = 0;
    try 
    {
        _double_bb.position(0);
        _ioChannel.read(_double_bb, position);
        d = _double_bb.getDouble(0);
    } 

    ...snip...

    return d;
}

and I write to it like this...

public void SetValue(long lRow, long lCol, double d) 
{
    long idx = TriangularMatrix.CalcIndex(lRow, lCol);
    long offset = idx * BLOCK_SIZE;
    try 
    {
        _double_bb.putDouble(0, d);
        _double_bb.position(0);
        _ioChannel.write(_double_bb, offset);
    } 

    ...snip...

}

The time taken for an iteration of my code increases roughly linearly with the number of reads. I have added a number of optimisations to the surrounding code to minimise the number of reads, but I am at the core set that I feel are necessary without fundamentally altering how the algorithm works, which I want to avoid at the moment.

So my question is whether there is anything in the read/write code or JVM configuration I can do to speed up the reads? I realise I can change hardware, but before I do that I want to make sure that I have squeezed every last drop of software juice out of the problem.

Thanks in advance

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

猫九 2024-08-23 10:50:02

只要您的文件存储在常规硬盘上,您就可以通过以提供访问局部性的方式组织数据来获得最大可能的加速,即导致连续尽可能多的 get/set 调用来访问相同的小文件。文件的区域。

这比您能做的任何其他事情都更重要,因为访问 HD 上的随机点是现代 PC 迄今为止最慢的事情 - 它花费的时间比其他任何事情都要长大约 10,000 倍。

因此,如果一次可以仅处理数据集的一部分(足够小以适合内存中的 HD 缓存),然后合并结果,那么就这样做。

或者,通过将文件存储在 SSD 或(更好)RAM 中来避免此问题。即使将其存储在一个简单的拇指驱动器上也可能是一个很大的改进。

As long as your file is stored on a regular harddisk, you will get the biggest possible speedup by organizing your data in a way that gives your accesses locality, i.e. causes as many get/set calls in a row as possible to access the same small area of the file.

This is more important than anything else you can do because accessing random spots on a HD is by far the slowest thing a modern PC does - it takes about 10,000 times longer than anything else.

So if it's possible to work on only a part of the dataset (small enough to fit comfortably into the in-memory HD cache) at a time and then combine the results, do that.

Alternatively, avoid the issue by storing your file on an SSD or (better) in RAM. Even storing it on a simple thumb drive could be a big improvement.

下雨或天晴 2024-08-23 10:50:02

我不会读取 ByteBuffer,而是使用文件映射,请参阅:FileChannel.map()

此外,您并没有真正解释 GetValue(row, col)SetValue(row, col) 如何访问存储。 rowcol 或多或少是随机的?我的想法如下:有时,对于图像处理,当您必须访问诸如 row + 1row - 1col - 之类的像素时1col + 1 为平均值;一个技巧是将数据组织为 8 x 8 或 16 x 16 块。这样做有助于将感兴趣的不同像素保留在连续的内存区域中(并且希望在缓存中)。

您可以将这个想法应用到您的算法中(如果适用):您映射文件的一部分一次,以便对 GetValue(row, col)SetValue(row, col) 处理刚刚映射的这部分。

Instead of reading into a ByteBuffer, I would use file mapping, see: FileChannel.map().

Also, you don't really explain how your GetValue(row, col) and SetValue(row, col) access the storage. Are row and col more or less random? The idea I have in mind is the following: sometimes, for image processing, when you have to access pixels like row + 1, row - 1, col - 1, col + 1 to average values; on trick is to organize the data in 8 x 8 or 16 x 16 blocks. Doing so helps keeping the different pixels of interest in a contiguous memory area (and hopefully in the cache).

You might transpose this idea to your algorithm (if it applies): you map a portion of your file once, so that the different calls to GetValue(row, col) and SetValue(row, col) work on this portion that's just been mapped.

一袭水袖舞倾城 2024-08-23 10:50:02

据推测,如果我们可以减少读取次数,那么事情会进展得更快。

对于 64 位 JVM 来说,3Gb 并不是很大,因此内存中可以容纳相当多的文件。

假设您将文件视为缓存的“页面”。当您读取一个值时,请读取它周围的页面并将其保留在内存中。然后,当您进行更多读取时,首先检查缓存。

或者,如果您有能力,请在处理开始时将整个内容读入内存。

Presumably if we can reduce the number of reads then things will go more quickly.

3Gb isn't huge for a 64 bit JVM, hence quite a lot of the file would fit in memory.

Suppose that you treat the file as "pages" which you cache. When you read a value, read the page around it and keep it in memory. Then when you do more reads check the cache first.

Or, if you have the capacity, read the whole thing into memory, in at the start of processing.

顾冷 2024-08-23 10:50:02
  1. 逐字节访问总是会产生较差的性能(不仅在 Java 中)。尝试读/写更大的块(例如行或列)。

  2. 切换到数据库引擎来处理如此大量的数据怎么样?它会为您处理所有优化。

可能这篇文章可以帮助您...

  1. Access byte-by-byte always produce poor performance (not only in Java). Try to read/write bigger blocks (e.g. rows or columns).

  2. How about switching to database engine for handling such amounts of data? It would handle all optimizations for you.

May be This article helps you ...

如果没有 2024-08-23 10:50:02

您可能需要考虑使用专为管理大量数据和随机读取而设计的库,而不是使用原始文件访问例程。

HDF 文件格式可能非常适合。它有一个Java API,但不是纯Java。它是根据 Apache 样式许可证获得许可的。

You might want to consider using a library which is designed for managing large amounts of data and random reads rather than using raw file access routines.

The HDF file format may by a good fit. It has a Java API but is not pure Java. It's licensed under an Apache Style license.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文