带着面具搜索
有大量具有以下类型的条目:
typedef struct {
int value;
int mask;
int otherData;
} Entry;
我想根据提供的 int key;
尽可能快地在该数组中找到一个条目。该条目需要确保(key & mask) == value
。
这种数组组织的最佳方式是什么?相应的算法是什么?
编辑:数组组织没有限制;它是静态的,可以在查找之前准备好。 value
和 mask
可以具有任意值。
Edit2:value
和mask
可以具有任意值,但数组中的条目数约为10000。因此,可以提前计算某些“模式”。
查找次数很大。
There is big array of entries having the following type:
typedef struct {
int value;
int mask;
int otherData;
} Entry;
I'd like to find an entry in this array according to provided int key;
as fast as posible. The entry is required to ensure that (key & mask) == value
.
What will be the best way for such array organization and what is the corresponding algorithm processing it?
Edit: There is no restrictions on the array organization; it is static and can be prepared before lookup. The value
and mask
may have arbitrary values.
Edit2: value
and mask
may have arbitrary values, but number of entries in the array is about 10000. So, certain "paterns" can be calculated in advance.
The number of lookups is big.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
每个位都是独立的,因此在预处理阶段[*],您可以对每个条目分类 32 次(或者无论您的
int
有多大)次。每个分类存储 2 个集合:当key
为 0 时在该位匹配的集合和当key
为 1 时匹配的集合。也就是说,如果 value == 1 且 mask = = 0 在该位,那么该分类根本不存储该条目,因为它与
key
的任何值都不匹配(事实上,无论您使用什么方案,此类条目都应该是在任何预处理阶段都被删除,所以否分类应该存储一个条目,即使只有一位是这样的)。如果均为 0,则存储到两个集合中。否则存储到两个集合之一中。然后,根据您的密钥,您希望找到 32 个集合的快速交集。
根据原始数组的大小,存储每个集合的最佳方式可能是一个巨大的位数组,指示数组中的每个条目是否在该集合中。然后查找交集可以一次一个字完成 -
&
总共 32 个字,每个位数组中一个字。如果结果为0,则继续。如果结果非 0,则表示有匹配项,并且结果中设置的位会告诉您哪个条目是匹配项。当然,这在数组的大小上仍然是线性的,事实上,您正在执行 31 个&
操作来检查 32 个条目是否匹配,这与简单的线性搜索大致相同通过原始数组。但比较和分支较少,并且您正在查看的数据更加压缩,因此您可能会获得更好的性能。或者可能有更好的方法来进行交叉。
如果键倾向于重复使用,那么您应该将查找结果缓存在从键到条目的映射中。如果可能的键的数量相当小(也就是说,如果可能的输入明显少于 2^32 个键,和/或您有大量可用内存),那么您的预处理阶段可能就是:
[*] 在没有任何预处理的情况下,显然您所能做的就是检查每个数组成员,直到找到匹配项或者检查所有内容。
Each bit is independent, so in a preprocessing phase[*] you could classify each entry 32 (or however big your
int
is) times. Each classification stores 2 sets: those which match at that bit whenkey
is 0 and those which match whenkey
is 1.That is, if value == 1 and mask == 0 at that bit, then that classification doesn't store that entry at all, since it doesn't match any value of
key
(in fact, no matter what scheme you use, such entries should be removed during any preprocessing stage, so no classification should store an entry if even one bit is like this). If both 0, store into both sets. Otherwise store into one of the two sets.Then, given your key, you want to find a fast intersection of 32 sets.
Depending on the size of the original array, it may be that the best way to store each set is a giant bit array indicating whether each entry in the array is in the set or not. Then finding the intersection can be done a word at a time -
&
together 32 words, one from each bit array. If the result is 0, keep going. If the result is non-0, you have a match, and the bit that's set in the result tells you which entry is the match. This is still linear in the size of the array, of course, and in fact you're doing 31&
operations to check 32 entries for a match, which is about the same as the simple linear search through the original array. But there's less comparison and branching, and the data you're looking at is more compressed, so you might get better performance.Or there might be a better way to do the intersection.
If keys tend to be re-used then you should cache the results of the lookup in a map from keys to entries. If the number of possible keys is reasonably small (that is, if significantly less than 2^32 keys are possible inputs, and/or you have a lot of memory available), then your preprocessing phase could just be:
[*] Without any preprocessing, obviously all you can do is check every array member until either you find a match or else you've checked everything.
由于您没有额外的信息(例如,数组已排序),因此您需要线性搜索 - 遍历数组并检查条件 - 伪代码:
Since you don't have extra information (for example, that the array is sorted) you need a linear search - traverse the array and check the condition - pseudocode:
如果您有一个到条目的键映射,那么这将非常容易。
如果你的数组是按键排序的,那么你可以通过一些小的努力来进行字典二分搜索。[实际上,也许不是!]事实上,你'我们只需要遍历数组,直到找到您要查找的内容。也就是说,从头到尾迭代并在找到它时停止。
顺便说一句,这是一个很好的例子,说明了数据结构的选择如何影响算法的可用性。如果你一开始就选择了错误的数据结构,你就不能只是用算法来解决问题!
If you instead had a map of keys to Entries, then this would be really easy.
If your array were sorted by key, then you could do a lexicographic binary search with some small effort.[actually, maybe not!]As it is, you're just going to have to traverse the array until you find what you're looking for. That is, iterate from start to end and stop when you find it.
As an aside, this is a great example of how a choice of data structure affects the availability of algorithms down the line. You can't just throw algorithms at a problem if you picked the wrong data structures in the first place!
线性搜索当然可以,但如果您需要使用相同键进行多次查找,您可以尝试首先根据
(key & mask)
对范围进行排序。如果您只有几个固定键,您可以尝试使用 boost.multi_index,每个键值有一个索引。A linear search would of course work, but if you need many lookups with the same key, you could try sorting the range first according to
(key & mask)
. If you only have a few, fixed keys, you could try using a boost.multi_index, with one index for each key value.如果每个条目的
mask
任意变化,我看不到太多线性搜索的替代方案。如果有重大限制
mask
,这样只有少数几个值是可能的,最好对
mask
的每个值使用某种map
,进行线性搜索找到第一个包含您要查找的值的
map
。或者,如果
掩码
仅涉及几个位,则可能值得使用
multimap
,按value
排序,并用所有的and
进行掩码mask
,并用key
进行索引处理相同,然后是一个线性使用完整的
key
进行搜索以查找完全匹配的内容。If the
mask
varies arbitrarily for each entry, I don't see muchalternative to a linear search. If there are significant constraints on
mask
, such that only a few values are possible, it might be better touse some sort of
map
for each value ofmask
, doing a linear searchto find the first
map
which contained the value you are looking for.Alternatively, if the
mask
s only concern a few bits, it may be worthusing a
multimap
, ordered byvalue
masked with anand
of all ofthe
mask
s, and indexed withkey
handled the same, then a linearsearch using the full
key
to find the exact match.如果掩码中的零位数量很少,您可以为掩码中的每个“无关”位复制该条目。例如,如果
value=0
和mask=0xfffe
,那么您需要在表中为key=0
和key 添加一个条目=1
。对于value=0
和mask=0xfeef
,在表中放入 4 个条目:key=0x0000
、key=0x0010
>、key=0x0100
和key=0x0110
。现在,您可以对条目进行排序并使用二分搜索,或使用二分搜索结构,例如std::map
。If the number of zero bits in your mask is small, you could duplicate the entry for each "don't-care" bit in the mask. For example if
value=0
andmask=0xfffe
then you'd put an entry in the table forkey=0
andkey=1
. Forvalue=0
andmask=0xfeef
, put 4 entries in the table:key=0x0000
,key=0x0010
,key=0x0100
, andkey=0x0110
. Now you can sort the entries and use a binary search, or use a binary search structure such asstd::map
.