有任何加快 Java FileChannel 随机读取速度的代码提示吗？

发布于 2024-08-16 10:50:02 字数 1352 浏览 16 评论 0原文

我有一个大型（3Gb）双精度二进制文件，在我为聚类数据编写的迭代算法期间，我（或多或少）随机访问该文件。每次迭代都会从文件中读取大约 50 万次，并写入大约 10 万次新值。

我像这样创建 FileChannel...

f = new File(_filename);
_ioFile = new RandomAccessFile(f, "rw");
_ioFile.setLength(_extent * BLOCK_SIZE);
_ioChannel = _ioFile.getChannel();

然后使用一个 double 大小的私有 ByteBuffer 从中读取数据

private ByteBuffer _double_bb = ByteBuffer.allocate(8);

，我的读取代码如下所示

public double GetValue(long lRow, long lCol) 
{
    long idx = TriangularMatrix.CalcIndex(lRow, lCol);
    long position = idx * BLOCK_SIZE;
    double d = 0;
    try 
    {
        _double_bb.position(0);
        _ioChannel.read(_double_bb, position);
        d = _double_bb.getDouble(0);
    } 

    ...snip...

    return d;
}

，然后像这样写入...

public void SetValue(long lRow, long lCol, double d) 
{
    long idx = TriangularMatrix.CalcIndex(lRow, lCol);
    long offset = idx * BLOCK_SIZE;
    try 
    {
        _double_bb.putDouble(0, d);
        _double_bb.position(0);
        _ioChannel.write(_double_bb, offset);
    } 

    ...snip...

}

我的代码迭代所需的时间随着读取次数大致线性增加。我对周围的代码添加了一些优化，以最大限度地减少读取次数，但我认为这是必要的核心集，而不会从根本上改变算法的工作方式，这是我目前想要避免的。

所以我的问题是，在读/写代码或 JVM 配置中是否可以采取任何措施来加快读取速度？我意识到我可以更改硬件，但在这样做之前，我想确保我已经从问题中榨取了最后一滴软件汁液。

提前致谢

原文

I have a large (3Gb) binary file of doubles which I access (more or less) randomly during an iterative algorithm I have written for clustering data. Each iteration does about half a million reads from the file and about 100k writes of new values.

I create the FileChannel like this...

f = new File(_filename);
_ioFile = new RandomAccessFile(f, "rw");
_ioFile.setLength(_extent * BLOCK_SIZE);
_ioChannel = _ioFile.getChannel();

I then use a private ByteBuffer the size of a double to read from it

private ByteBuffer _double_bb = ByteBuffer.allocate(8);

and my reading code looks like this

public double GetValue(long lRow, long lCol) 
{
    long idx = TriangularMatrix.CalcIndex(lRow, lCol);
    long position = idx * BLOCK_SIZE;
    double d = 0;
    try 
    {
        _double_bb.position(0);
        _ioChannel.read(_double_bb, position);
        d = _double_bb.getDouble(0);
    } 

    ...snip...

    return d;
}

and I write to it like this...

public void SetValue(long lRow, long lCol, double d) 
{
    long idx = TriangularMatrix.CalcIndex(lRow, lCol);
    long offset = idx * BLOCK_SIZE;
    try 
    {
        _double_bb.putDouble(0, d);
        _double_bb.position(0);
        _ioChannel.write(_double_bb, offset);
    } 

    ...snip...

}

The time taken for an iteration of my code increases roughly linearly with the number of reads. I have added a number of optimisations to the surrounding code to minimise the number of reads, but I am at the core set that I feel are necessary without fundamentally altering how the algorithm works, which I want to avoid at the moment.

So my question is whether there is anything in the read/write code or JVM configuration I can do to speed up the reads? I realise I can change hardware, but before I do that I want to make sure that I have squeezed every last drop of software juice out of the problem.

Thanks in advance

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

猫九 2024-08-23 10:50:02

只要您的文件存储在常规硬盘上，您就可以通过以提供访问局部性的方式组织数据来获得最大可能的加速，即导致连续尽可能多的 get/set 调用来访问相同的小文件。文件的区域。

这比您能做的任何其他事情都更重要，因为访问 HD 上的随机点是现代 PC 迄今为止最慢的事情 - 它花费的时间比其他任何事情都要长大约 10,000 倍。

因此，如果一次可以仅处理数据集的一部分（足够小以适合内存中的 HD 缓存），然后合并结果，那么就这样做。

或者，通过将文件存储在 SSD 或（更好）RAM 中来避免此问题。即使将其存储在一个简单的拇指驱动器上也可能是一个很大的改进。

回复收藏 0 原文

下雨或天晴 2024-08-23 10:50:02

我不会读取 ByteBuffer，而是使用文件映射，请参阅：FileChannel.map()。

此外，您并没有真正解释 GetValue(row, col) 和 SetValue(row, col) 如何访问存储。 row 和 col 或多或少是随机的？我的想法如下：有时，对于图像处理，当您必须访问诸如 row + 1、row - 1、col - 之类的像素时1、col + 1 为平均值；一个技巧是将数据组织为 8 x 8 或 16 x 16 块。这样做有助于将感兴趣的不同像素保留在连续的内存区域中（并且希望在缓存中）。

您可以将这个想法应用到您的算法中（如果适用）：您映射文件的一部分一次，以便对 GetValue(row, col) 和 SetValue(row, col) 处理刚刚映射的这部分。