数组访问可以优化吗?

发布于 2024-09-30 15:11:12 字数 1427 浏览 9 评论 0原文

也许我被我的分析器(Netbeans)误导了,但我看到了一些奇怪的行为,希望这里有人可以帮助我理解它。

我正在开发一个应用程序,它大量使用相当大的哈希表(键是长整型,值是对象)。内置的 java 哈希表(特别是 HashMap)的性能非常差,在尝试了一些替代方案(Trove、Fastutils、Colt、Carrot)后,我开始自己工作。

该代码非常基本,使用双重哈希策略。这工作得很好,并且显示了迄今为止我尝试过的所有其他选项的最佳性能。

问题是,根据探查器,对哈希表的查找是整个应用程序中最昂贵的方法 - 尽管事实上其他方法被调用很多次,和/或执行< em>更多的逻辑。

真正让我困惑的是,这些查找仅由一个类调用;调用方法进行查找并处理结果。两者被调用的次数几乎相同,并且调用查找的方法中有很多逻辑来处理查找结果,但速度大约快 100 倍。

下面是哈希查找的代码。它基本上只是对数组的两次访问(根据分析,计算哈希码的函数实际上是免费的)。我不明白这段代码怎么会这么慢,因为它只是数组访问,而且我没有看到任何让它更快的方法。

请注意,代码只是返回与键匹配的存储桶,调用者应该处理该存储桶。 'size'是hash.length/2,hash1在哈希表的前半部分查找,hash2在后半部分查找。 key_index 是传递给构造函数的哈希表上的最终 int 字段,Entry 对象上的值数组是一个小型 long 数组,通常长度为 10 或更小。

人们对此有任何想法都非常感激。

谢谢。

public final Entry get(final long theKey) {
    Entry aEntry = hash[hash1(theKey, size)];

    if (aEntry != null && aEntry.values[key_index] != theKey) {
        aEntry = hash[hash2(theKey, size)];

        if (aEntry != null && aEntry.values[key_index] != theKey) {
            return null;
        }
    }

    return aEntry;
}

编辑 hash1 和 hash1 的代码哈希2

private static int hash1(final long key, final int hashTableSize) { 
    return (int)(key&(hashTableSize-1)); 
}
private static int hash2(final long key, final int hashTableSize) { 
    return (int)(hashTableSize+((key^(key>>3))&(hashTableSize-1))); 
}

Maybe I'm being misled by my profiler (Netbeans), but I'm seeing some odd behavior, hoping maybe someone here can help me understand it.

I am working on an application, which makes heavy use of rather large hash tables (keys are longs, values are objects). The performance with the built in java hash table (HashMap specifically) was very poor, and after trying some alternatives -- Trove, Fastutils, Colt, Carrot -- I started working on my own.

The code is very basic using a double hashing strategy. This works fine and good and shows the best performance of all the other options I've tried thus far.

The catch is, according to the profiler, lookups into the hash table are the single most expensive method in the entire application -- despite the fact that other methods are called many more times, and/or do a lot more logic.

What really confuses me is the lookups are called only by one class; the calling method does the lookup and processes the results. Both are called nearly the same number of times, and the method that calls the lookup has a lot of logic in it to handle the result of the lookup, but is about 100x faster.

Below is the code for the hash lookup. It's basically just two accesses into an array (the functions that compute the hash codes, according to profiling, are virtually free). I don't understand how this bit of code can be so slow since it is just array access, and I don't see any way of making it faster.

Note that the code simply returns the bucket matching the key, the caller is expected to process the bucket. 'size' is the hash.length/2, hash1 does lookups in the first half of the hash table, hash2 does lookups in the second half. key_index is a final int field on the hash table passed into the constructor, and the values array on the Entry objects is a small array of longs usually of length 10 or less.

Any thoughts people have on this are much appreciated.

Thanks.

public final Entry get(final long theKey) {
    Entry aEntry = hash[hash1(theKey, size)];

    if (aEntry != null && aEntry.values[key_index] != theKey) {
        aEntry = hash[hash2(theKey, size)];

        if (aEntry != null && aEntry.values[key_index] != theKey) {
            return null;
        }
    }

    return aEntry;
}

Edit, the code for hash1 & hash2

private static int hash1(final long key, final int hashTableSize) { 
    return (int)(key&(hashTableSize-1)); 
}
private static int hash2(final long key, final int hashTableSize) { 
    return (int)(hashTableSize+((key^(key>>3))&(hashTableSize-1))); 
}

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

ぽ尐不点ル 2024-10-07 15:11:12

我认为您的实施中没有什么是特别低效的。我承认我并没有真正遵循您的散列/查找策略,但如果您说它在您的情况下表现良好,我会相信您。

我期望的唯一可能会产生一些差异的是将键移出 Entry 的值数组。

而不是这样:

class Entry {
    long[] values;
}

//...
if ( entry.values[key_index] == key ) { //...

试试这个:

class Entry {
    long key;
    long values[];
}

//...
if ( entry.key == key ) { //...

您应该只承担访问成员的成本,而不是承担访问成员的成本,加上进行边界检查,然后获取数组的值。

是否有比数组更快的随机访问数据类型?

我对这个问题的答案很感兴趣,所以搭建了一个测试环境。这是我的数组接口:

interface Array {
    long get(int i);
    void set(int i, long v);
}

当索引超出范围时,此“数组”具有未定义的行为。我将明显的实现放在一起:

class NormalArray implements Array {
    private long[] data;

    public NormalArray(int size) {
        data = new long[size];
    }

    @Override
    public long get(int i) {
        return data[i];
    }

    @Override
    public void set(int i, long v) {
        data[i] = v;
    }
}

然后是一个控件:

class NoOpArray implements Array {
    @Override
    public long get(int i) {
        return 0;
    }
    @Override
    public void set(int i, long v) {
    }
}

最后,我设计了一个“数组”,其中前 10 个索引是硬编码成员。成员是通过开关设置/选择的:

class TenArray implements Array {
    private long v0;
    private long v1;
    private long v2;
    private long v3;
    private long v4;
    private long v5;
    private long v6;
    private long v7;
    private long v8;
    private long v9;
    private long[] extras;

    public TenArray(int size) {
        if (size > 10) {
            extras = new long[size - 10];
        }
    }

    @Override
    public long get(final int i) {
        switch (i) {
        case 0:
            return v0;
        case 1:
            return v1;
        case 2:
            return v2;
        case 3:
            return v3;
        case 4:
            return v4;
        case 5:
            return v5;
        case 6:
            return v6;
        case 7:
            return v7;
        case 8:
            return v8;
        case 9:
            return v9;
        default:
            return extras[i - 10];
        }
    }

    @Override
    public void set(final int i, final long v) {
        switch (i) {
        case 0:
            v0 = v; break;
        case 1:
            v1 = v; break;
        case 2:
            v2 = v; break;
        case 3:
            v3 = v; break;
        case 4:
            v4 = v; break;
        case 5:
            v5 = v; break;
        case 6:
            v6 = v; break;
        case 7:
            v7 = v; break;
        case 8:
            v8 = v; break;
        case 9:
            v9 = v; break;
        default:
            extras[i - 10] = v;
        }
    }
}

我用这个线束测试了它:

import java.util.Random;

public class ArrayOptimization {
    public static void main(String[] args) {
        int size = 10;
        long[] data = new long[size];
        Random r = new Random();
        for ( int i = 0; i < data.length; i++ ) {
            data[i] = r.nextLong();
        }

        Array[] a = new Array[] {
                new NoOpArray(),
                new NormalArray(size),
                new TenArray(size)
        };

        for (;;) {
            for ( int i = 0; i < a.length; i++ ) {
                testSet(a[i], data, 10000000);
                testGet(a[i], data, 10000000);
            }
        }
    }

    private static void testGet(Array a, long[] data, int iterations) {
            long nanos = System.nanoTime();
        for ( int i = 0; i < iterations; i++ ) {
            for ( int j = 0; j < data.length; j++ ) {
                data[j] = a.get(j);
            }
        }
        long stop = System.nanoTime();
        System.out.printf("%s/get took %fms%n", a.getClass().getName(), 
                (stop - nanos) / 1000000.0);
    }

    private static void testSet(Array a, long[] data, int iterations) {
        long nanos = System.nanoTime();
        for ( int i = 0; i < iterations; i++ ) {
            for ( int j = 0; j < data.length; j++ ) {
                a.set(j, data[j]);
            }
        }
        long stop = System.nanoTime();
        System.out.printf("%s/set took %fms%n", a.getClass().getName(), 
                (stop - nanos) / 1000000.0);

    }
}

结果有点令人惊讶。 TenArray 的执行速度比 NormalArray 快得多(对于大小 <= 10)。减去开销(使用 NoOpArray 平均值),您将得到 TenArray,其时间约为普通数组的 65%。因此,如果您知道数组可能的最大大小,我认为它可能会超过数组的速度。我想 switch 使用比数组更少的边界检查或更有效的边界检查。

NoOpArray/set took 953.272654ms
NoOpArray/get took 891.514622ms
NormalArray/set took 1235.694953ms
NormalArray/get took 1148.091061ms
TenArray/set took 1149.833109ms
TenArray/get took 1054.040459ms
NoOpArray/set took 948.458667ms
NoOpArray/get took 888.618223ms
NormalArray/set took 1232.554749ms
NormalArray/get took 1120.333771ms
TenArray/set took 1153.505578ms
TenArray/get took 1056.665337ms
NoOpArray/set took 955.812843ms
NoOpArray/get took 893.398847ms
NormalArray/set took 1237.358472ms
NormalArray/get took 1125.100537ms
TenArray/set took 1150.901231ms
TenArray/get took 1057.867936ms

现在我不确定你在实践中是否可以获得比阵列更快的速度;显然,这种方式会产生与接口/类/方法相关的任何开销。

Nothing in your implementation strikes me as particularly inefficient. I'll admit I don't really follow your hashing/lookup strategy, but if you say it's performant in your circumstances, I'll believe you.

The only thing that I would expect might make some difference is to move the key out of the values array of Entry.

Instead of having this:

class Entry {
    long[] values;
}

//...
if ( entry.values[key_index] == key ) { //...

Try this:

class Entry {
    long key;
    long values[];
}

//...
if ( entry.key == key ) { //...

Instead of incurring the cost of accessing a member, plus doing bounds checking, then getting a value of the array, you should just incur the cost of accessing the member.

Is there a random-access data type faster than an array?

I was interested in the answer to this question, so I set up a test environment. This is my Array interface:

interface Array {
    long get(int i);
    void set(int i, long v);
}

This "Array" has undefined behaviour when indices are out of bounds. I threw together the obvious implementation:

class NormalArray implements Array {
    private long[] data;

    public NormalArray(int size) {
        data = new long[size];
    }

    @Override
    public long get(int i) {
        return data[i];
    }

    @Override
    public void set(int i, long v) {
        data[i] = v;
    }
}

And then a control:

class NoOpArray implements Array {
    @Override
    public long get(int i) {
        return 0;
    }
    @Override
    public void set(int i, long v) {
    }
}

Finally, I designed an "array" where the first 10 indices are hardcoded members. The members are set/selected through a switch:

class TenArray implements Array {
    private long v0;
    private long v1;
    private long v2;
    private long v3;
    private long v4;
    private long v5;
    private long v6;
    private long v7;
    private long v8;
    private long v9;
    private long[] extras;

    public TenArray(int size) {
        if (size > 10) {
            extras = new long[size - 10];
        }
    }

    @Override
    public long get(final int i) {
        switch (i) {
        case 0:
            return v0;
        case 1:
            return v1;
        case 2:
            return v2;
        case 3:
            return v3;
        case 4:
            return v4;
        case 5:
            return v5;
        case 6:
            return v6;
        case 7:
            return v7;
        case 8:
            return v8;
        case 9:
            return v9;
        default:
            return extras[i - 10];
        }
    }

    @Override
    public void set(final int i, final long v) {
        switch (i) {
        case 0:
            v0 = v; break;
        case 1:
            v1 = v; break;
        case 2:
            v2 = v; break;
        case 3:
            v3 = v; break;
        case 4:
            v4 = v; break;
        case 5:
            v5 = v; break;
        case 6:
            v6 = v; break;
        case 7:
            v7 = v; break;
        case 8:
            v8 = v; break;
        case 9:
            v9 = v; break;
        default:
            extras[i - 10] = v;
        }
    }
}

I tested it with this harness:

import java.util.Random;

public class ArrayOptimization {
    public static void main(String[] args) {
        int size = 10;
        long[] data = new long[size];
        Random r = new Random();
        for ( int i = 0; i < data.length; i++ ) {
            data[i] = r.nextLong();
        }

        Array[] a = new Array[] {
                new NoOpArray(),
                new NormalArray(size),
                new TenArray(size)
        };

        for (;;) {
            for ( int i = 0; i < a.length; i++ ) {
                testSet(a[i], data, 10000000);
                testGet(a[i], data, 10000000);
            }
        }
    }

    private static void testGet(Array a, long[] data, int iterations) {
            long nanos = System.nanoTime();
        for ( int i = 0; i < iterations; i++ ) {
            for ( int j = 0; j < data.length; j++ ) {
                data[j] = a.get(j);
            }
        }
        long stop = System.nanoTime();
        System.out.printf("%s/get took %fms%n", a.getClass().getName(), 
                (stop - nanos) / 1000000.0);
    }

    private static void testSet(Array a, long[] data, int iterations) {
        long nanos = System.nanoTime();
        for ( int i = 0; i < iterations; i++ ) {
            for ( int j = 0; j < data.length; j++ ) {
                a.set(j, data[j]);
            }
        }
        long stop = System.nanoTime();
        System.out.printf("%s/set took %fms%n", a.getClass().getName(), 
                (stop - nanos) / 1000000.0);

    }
}

The results were somewhat surprising. The TenArray performs non-trivially faster than a NormalArray does (for sizes <= 10). Subtracting the overhead (using the NoOpArray average) you get TenArray as taking ~65% of the time of the normal array. So if you know the likely max size of your array, I suppose it is possible to exceed the speed of an array. I would imagine switch uses either less bounds checking or more efficient bounds checking than does an array.

NoOpArray/set took 953.272654ms
NoOpArray/get took 891.514622ms
NormalArray/set took 1235.694953ms
NormalArray/get took 1148.091061ms
TenArray/set took 1149.833109ms
TenArray/get took 1054.040459ms
NoOpArray/set took 948.458667ms
NoOpArray/get took 888.618223ms
NormalArray/set took 1232.554749ms
NormalArray/get took 1120.333771ms
TenArray/set took 1153.505578ms
TenArray/get took 1056.665337ms
NoOpArray/set took 955.812843ms
NoOpArray/get took 893.398847ms
NormalArray/set took 1237.358472ms
NormalArray/get took 1125.100537ms
TenArray/set took 1150.901231ms
TenArray/get took 1057.867936ms

Now whether you can in practice get speeds faster than an array I'm not sure; obviously this way you incur any overhead associated with the interface/class/methods.

椵侞 2024-10-07 15:11:12

您很可能在解释分析器结果时受到部分误导。众所周知,分析器过度夸大了经常调用的小型方法对性能的影响。在您的情况下, get() 方法的分析开销可能大于方法本身所花费的实际处理费用。情况会进一步恶化,因为检测还会干扰 JIT 内联方法的能力。

根据这种情况的经验法则 - 如果在分析器下运行时,已知长度的工作的总处理时间增加两到三倍,则分析开销会给您带来不准确的结果。

要验证您的更改确实产生了影响,请始终不使用探查器来衡量性能改进。探查器可以提示您有关瓶颈的信息,但它也可以欺骗您查看没有问题的地方。

数组边界检查可能会对性能产生令人惊讶的巨大影响(如果您执行的其他操作相对较少),但也很难将其与一般内存访问惩罚明确区分开来。在一些微不足道的情况下,JIT 可能能够消除它们(Java 6 中一直在努力消除边界检查),但据我所知,这主要限于简单的循环结构,例如 for(x=0; x

Mark Peters 建议的更改很可能不仅更快,因为它消除了边界检查,而且还因为它以更缓存友好的方式改变了数据结构的局部性属性。

Most likely you are partially misled in your interpretation of the profilers results. Profilers are notoriously overinflating the performance impact of small, frequently called methods. In your case, the profiling overhead for the get()-method is probably larger than the actual processing spent in the method itself. The situation is worsened further, since the instrumentation also interferes with the JIT's capability to inline methods.

As a rule of thumb for this situation - if the total processing time for a piece of work of known length increases more then two- to threefold when running under the profiler, the profiling overhead will give you skewed results.

To verify your changes actually do have impact, always measure performance improvements without the profiler, too. The profiler can hint you about bottlenecks, but it can also deceive you to look at places where nothing is wrong.

Array bounds checking can have a surprisingly large impact on performance (if you do comparably little else), but it can also be hard to clearly separate from general memory access penalties. In some trivial cases, the JIT might be able to eliminate them (there have been efforts towards bounds check elimination in Java 6), but this is AFAIK mostly limited to simple loop constructs like for(x=0; x<array.length; x++).
Under some circumstances you may be able to replace array access by simple member access, completely avoiding the bound checks, but its limited to the rare cases where you access you array exclusively by constant indices. I see no way to apply it to your problem.

The change suggested by Mark Peters is most likely not solely faster because it eliminates a bounds check, but also because it alters the locality properties of your data structures in a more cache friendly way.

淤浪 2024-10-07 15:11:12

许多分析器会告诉您非常令人困惑的事情,部分原因是它们的工作方式,部分原因是人们一开始对性能有一些有趣的想法。
例如,您想知道函数被调用了多少次,并且您正在查看代码并认为它​​看起来有很多逻辑,因此速度很慢。

有一种非常简单的方法来思考这个问题,这使得很容易理解正在发生的事情。

  • 首先,考虑例程或语句处于活动状态的时间百分比,而不是调用它的次数或它所花费的平均时间长度。原因是它相对不受竞争进程或 I/O 等不相关问题的影响,并且您不必将调用次数乘以平均执行时间,然后除以总时间来查看它是否很大。甚至足以关心。另外,百分比告诉您底线,修复它可能会减少总体执行时间多少。

  • 其次,我所说的“活动”是指“在堆栈上”,其中堆栈包括当前正在运行的指令以及其“上方”的所有调用返回到“调用 main”。如果一个例程负责 10% 的时间,包括它调用的例程,那么在这段时间内它位于堆栈上。个人陈述甚至指示也是如此。 (忽略“自我时间”或“独占时间”。这是一种干扰。)

  • 在函数上放置计时器和计数器的探查器只能为您提供其中一些信息。仅对程序计数器进行采样的分析器告诉您的信息甚至更少。您需要的是对调用堆栈进行采样并按行(不仅仅是按函数)向您报告包含该行的堆栈样本的百分比。同样重要的是,它们 a) 在 I/O 或其他阻塞期间对堆栈进行采样,但 b) 不要在等待用户输入时采样。

有分析器可以做到这一点。我不确定Java。

如果你还和我在一起,让我再扔一个铃声。您正在寻找可以优化的东西,对吧?只有那些比例足够大才值得这么麻烦的事情,比如 10% 或更多?这样一行成本为 10% 的代码有 10% 的时间在堆栈上。这意味着如果采集 20,000 个样本,则仅针对其中的约 2,000 个样本。如果采集20 个样本,则平均而言,大约是其中的 2 个样本。现在,你正试图找到这条线,对吧?只要你找到了,百分比稍微偏离一点真的很重要吗?这是分析器的另一个令人愉快的神话——计时的精确性很重要。为了找到值得解决的问题,20,000 个样本所提供的信息并不比 20 个样本所提供的信息多。
那我该怎么办?只需手动获取样本并研究它们即可。值得优化的代码会立即跳到我的面前。

最后,有一大好消息。可能有很多事情你可以优化。假设你解决了 20% 的问题并使其消失。总体时间减少到原来的 4/5,但其他问题所花费的时间并没有减少,所以现在它们的百分比是原来的 5/4,因为分母变小了。从百分比来看,它们变得更大了,而且更容易找到。这种效果就像滚雪球一样,让你真正压缩代码。

Many profilers tell you very confusing things, partly because of how they work, and partly because people have funny ideas about performance to begin with.
For example, you're wondering about how many times functions are called, and you're looking at code and thinking it looks like a lot of logic, therefore slow.

There's a very simple way to think about this stuff, that makes it very easy to understand what's going on.

  • First of all, think in terms of the percent of time a routine or statement is active, rather than the number of times it is called or the average length of time it takes. The reason for that is it is relatively unaffected by irrelevant issues like competing processes or I/O, and it saves you having to multiply the number of calls by the average execution time and divide by the total time just to see if it is a big enough to even care about. Also, percent tells you, bottom line, how much fixing it could potentially reduce the overall execution time.

  • Second, what I mean by "active" is "on the stack", where the stack includes the currently running instruction and all the calls "above" it back to "call main". If a routine is responsible for 10% of the time, including routines that it calls, then during that time it is on the stack. The same is true of individual statements or even instructions. (Ignore "self time" or "exclusive time". It's a distraction.)

  • Profilers that put timers and counters on functions can only give you some of this information. Profilers that only sample the program counter tell you even less. What you need is something that samples the call stack and reports to you by line (not just by function) the percent of stack samples containing that line. It's also important that they sample the stack a) during I/O or other blockage, but b) not while waiting for user input.

There are profilers that can do this. I'm not sure about Java.

If you're still with me, let me throw out another ringer. You're looking for things you can optimize, right? and only things that have a large enough percent to be worth the trouble, like 10% or more? Such a line of code costing 10% is on the stack 10% of the time. That means if 20,000 samples are taken, it is on about 2,000 of them. If 20 samples are taken, it is on about 2 of them, on average. Now, you're trying to find the line, right? Does it really matter if the percent is off a little bit, as long as you find it? That's another one of those happy myths of profilers - that precision of timing matters. For finding problems worth fixing, 20,000 samples won't tell you much more than 20 samples will.
So what do I do? Just take the samples by hand and study them. Code worth optimizing will simply jump out at me.

Finally, there's a big gob of good news. There are probably multiple things you could optimize. Suppose you fix a 20% problem and make it go away. Overall time shrinks to 4/5 of what it was, but the other problems aren't taking any less time, so now their percentage is 5/4 of what it was, because the denominator got smaller. Percentage-wise they got bigger, and easier to find. This effect snowballs, allowing you to really squeeze the code.

灼痛 2024-10-07 15:11:12

您可以尝试使用记忆或缓存策略来减少实际调用的数量。如果您非常绝望,您可以尝试的另一件事是本机数组,因为索引它们的速度快得令人难以置信,并且如果您使用不需要编组的参数(例如 long),JNI 不应该调用太多开销。

You could try using a memoizing or caching strategy to reduce the number of actual calls. Another thing you could try if you're very desperate is a native array, since indexing those is unbelievably fast, and JNI shouldn't invoke toooo much overhead if you're using parameters like longs that don't require marshalling.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文