Java小数据集的数据查找方法?
我们必须根据三个输入数据字段查找一些数据。查找必须很快。仅有大约 20 种可能的查找组合。我们使用静态 HashMap 实例实现了这一点,在该实例中,我们通过连接三个数据字段来创建一个键。有没有更好的方法来做到这一点,或者这是要走的路?代码如下。
更新:我并不是说这段代码很慢。只是好奇是否有更好的方法来做到这一点。我认为可能有一个更优雅的解决方案,但如果没有令人信服的替代方案,我很乐意保留它!
创建类级静态 HashMap 实例:
private static HashMap map = new HashMap();
我们如何将数据加载到内存中:
private void load(Iterator iterator) {
while (iterator.next()) {
Object o = it.next();
key = o.getField1() + "-" + o.getField2() + "-" o.getField3();
map.put(key, o.getData());
}
}
以及我们如何根据三个字段查找数据:
private Stirng getData(String f1, String f2, String f3) {
String key = f1 + "-" + f2 + "-" f3;
return map.get(key);
}
We have to lookup some data based on three input data fields. The lookup has to be fast. There are only about 20 possible lookup combinations. We've implemented this using a static HashMap instance where we create a key by concatinating the three data fields. Is there a better way to do this or is this the way to go? Code is below.
Update: I'm not implying that this code is slow. Just curious if there is a better way to do this. I thought there might be a more elegant solution but I'm happy to keep this in place if there are no compelling alternatives!
Create class level static HashMap instance:
private static HashMap map = new HashMap();
How we load data into memory:
private void load(Iterator iterator) {
while (iterator.next()) {
Object o = it.next();
key = o.getField1() + "-" + o.getField2() + "-" o.getField3();
map.put(key, o.getData());
}
}
And how we look up the data based on the three fields:
private Stirng getData(String f1, String f2, String f3) {
String key = f1 + "-" + f2 + "-" f3;
return map.get(key);
}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
好吧,要问自己的问题当然是“它足够快吗?”因为除非您的应用程序需要更快,而这正是瓶颈,否则这并不重要。你所拥有的已经相当有效了。
话虽这么说,如果您想从这个例程中榨取尽可能多的速度(而不用汇编语言重写它;-),您可以考虑使用数组而不是 HashMap,因为只有数量有限的小钥匙。您必须开发某种哈希函数,将每个对象哈希为 0 到 19 之间的唯一数字(或者您实际拥有的元素数量)。您还可以优化该哈希函数的实现,尽管在不知道您正在使用的对象的详细信息的情况下,我无法告诉您具体如何做到这一点。
Well, the question to ask yourself is of course "is it fast enough?" Because unless your application needs to be speedier and this is the bottleneck, it really doesn't matter. What you've got is already reasonably efficient.
That being said, if you want to squeeze every bit of speed possible out of this routine (without rewriting it in assembly language ;-) you might consider using an array instead of a
HashMap
, since there are only a small, limited number of keys. You'd have to develop some sort of hash function that hashes each object to a unique number between 0 and 19 (or however many elements you actually have). You may also be able to optimize the implementation of that hash function, although I couldn't tell you how exactly to do that without knowing the details of the objects you're working with.您可以创建一个具有三个字符串字段的特殊密钥对象,以避免构建密钥字符串:
You could create a special key object having three String fields to avoid building up the key string:
在你的情况下,我会继续使用你概述的实现。对于映射到常量数据的大量常量键,您可以使用最小完美哈希。由于编写这个代码并不简单,而且我不确定现有的库,因此在使用它之前您必须考虑实现成本。
In your case I would keep using the implementation you outlined. For a large list of constant keys mapping to constant data, you could use Minimal Perfect Hashing. As it is not trivial to code this, and I am not sure about existing libraries, you have to consider the implementation cost before using this.
我认为你的方法相当快。通过实现自己的哈希算法获得的任何收益都将非常小,特别是与所需的工作相比。
关于您的密钥格式的一点评论。您最好确保分隔符不能出现在字段 toString() 值中,否则可能会发生按键冲突:
I think your approach is pretty fast. Any gains by implementing your own hashing algorithm would be very small, especially compared to the effort required.
One remark about your key format. You better make sure that your separator cannot occur in the field toString() values, otherwise you might get key collisions:
连接字符串对于创建键来说是一个坏主意。我的主要目的是不清楚。但实际上,很大一部分实现都存在错误,特别是分隔符实际上可能出现在字符串中。就性能而言,我发现只需将字符串 hack 的密钥更改为有意义的密钥对象,程序的速度就会提高百分之十。 (如果您确实需要偷懒,可以使用 Arrays.asList 来创建密钥 - 请参阅 List.equals API 文档。)
Concatenating strings is a bad idea for creating a key. My main object is that it is unclear. But in practice a significant proportion of implementations have bugs, notably that the separator can actually occur in the strings. In terms of performance, I have seen a program speed up ten percent simply by changing the key for a string hack to a meaningful key object. (If you really must be lazy about code, you can use
Arrays.asList
to make the key - seeList.equals
API doc.)由于您只有 20 种组合,因此基于了解每种组合的特征,手工制作“给我该组合的索引 1..20”可能是可行的。
您能列出确切的组合列表吗?
Since you only have 20 combinations it might be feasible to handcraft a "give me the index 1..20 of this combination" based on knowing the characteristics of each combination.
Are you in a position to list the exact list of combinations?
完成此操作的另一种方法是创建一个
Object
来处理您的密钥,您可以使用它覆盖equals()
(和hashCode()
) 对传入密钥进行测试,依次测试field1
、field2
和field3
。编辑(响应评论):
由于您的地图使用从
hashCode()
返回的值将您的密钥放入存储桶中,(然后从中测试equals
),理论上所有键的值都可以相同。但是,我不建议这样做,因为您不会获得 HashMap 性能的好处。您实际上是在迭代存储桶中的所有项目并测试equals()
。您可以采取的一种方法是将对
hashCode()
的调用委托给密钥容器中的值之一。例如,您始终可以从field3
返回 hashCode。在这种情况下,您将把密钥分发到可能与field3
的不同值一样多的存储桶。一旦您的HashMap
找到存储桶,它仍然需要迭代存储桶中的项目来测试equals()
的结果,直到找到匹配项。您可以创建所有字段上
hashCode()
返回的值的总和。正如刚才所讨论的,该值不需要是唯一的。此外,碰撞的可能性以及因此产生的更大的桶要小得多。考虑到这一点,您在HashMap
上的查找应该会更快。EDIT2:
该密钥的良好哈希码问题已在一个单独的问题中得到解答 这里
Another way to get this done is to create an
Object
to handle your key, with which you can overrideequals()
(andhashCode()
) to do a test against an incomming key, testingfield1
,field2
andfield3
in turn.EDIT (in response to comment):
As the value returned from
hashCode()
is used by your Map to put your keys into buckets, (from which it then will testequals
), the value could theoretically be the same for all keys. I wouldn't suggest doing that, however, as you would not reap the benefits of HashMaps performance. You would essentially be iterating over all of your items in a bucket and testingequals()
.One approach you could take would be to delegate the call to
hashCode()
to one of the values in your key container. You could always return the hashCode fromfield3
, for example. In this case, you will distribute your keys to potentially as many buckets as there are distinct values forfield3
. Once yourHashMap
finds the bucket, it will still need to iterate over the items in the bucket to test the result ofequals()
until it finds a match.You could create would be the sum of the values returned by
hashCode()
on all of your fields. As just discussed, this value does not need to be unique. Further, the potential for collision, and therefore larger buckets, is much smaller. With that in mind, your lookups on theHashMap
should be quicker.EDIT2:
the question of a good hash code for this key has been answered in a separate question here