如果我有一个键数组 M 和一个目标数组 N,我如何在搜索之前验证 M[i] 是否存在于 N 中?
正如标题所说,我正在尝试查找存在于大型常量数组 N 中的 M 元素。大多数时候,N 中不会存在 M 的元素,因此对 M 进行的绝大多数搜索都是浪费时间。
我正在寻找某种方法来创建索引以在对 M 进行全面搜索之前进行检查。类似于我的项目从 M 的每个元素的前几个字节创建一个位数组,据我了解,利用位级并行以快速搜索它。我完全不明白这是如何工作的。
那么我可以用什么技巧来减少不必要地搜索M的机会呢?
这是一个主要与语言无关的问题,但为了尽可能完整,我使用 C++。
Like the title says, I'm trying to find elements of M that exist in the large constant array N. Most of the time, no element of M will exist in N, so the vast majority of searches done on M are a waste of time.
I'm looking for some way to create an index to check before doing a full-scale search of M. A project similar to mine creates a bit array from the first few bytes of every element of M, and from what I understand, leverages bit level parallelism to search it quickly. I don't understand entirely how this works.
So what tricks can I use to cut down the chance of searching M unnecessarily?
This is a mostly language independent question, but just to be as complete as possible, I'm using C++.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
您可能会想到 Bloom 过滤器,它正是用于这种情况。它们可能会给您误报,在这种情况下您必须在真实表中搜索,但在大多数情况下,如果您没有存储该项目,它们会从一开始就告诉您。
哈希表通常是存储的最佳选择;但是,如果您的密钥空间远远大于目标数量,则会出现大量哈希冲突,您必须检查存储在那里的目标是否确实是您正在查找的密钥。如果关键比较成本高昂,它很快就会成为一个因素。
You might be thinking of Bloom filters, which are used for exactly this case. They can give you false positives, in which case you have to search in the real table, but in most cases will tell you from the start if you don't have the item stored.
Hash tables are usually the best option for storage; but if your key space is vastly larger than the number of targets, you'll have a sizable number of hash collisions where you'll have to check if the target stored there is really the key you're looking. If key comparison is expensive, it can quickly become a factor.
您可以使用 N 的值作为键构建一个哈希表。
然后你尝试访问hash[M[i]],如果它返回一个值,那么它存在,即O(1)(不考虑冲突)。
You could build a hashtable with with the values of N as keys.
Then you try to access hash[M[i]], if it returns a value then it exists, that is O(1) (disregarding collisions.)
由于 N 是静态的,您可能会考虑为 N 创建一个 Perfect Hash 函数。这将使您的搜索保证 O(1) 时间。
有关算法的 CLR 书籍有一章介绍了这一点,上面的 wiki 页面上有您可能会觉得有用的链接。不过,它可能太复杂了,
并且您可能很难找到有用的实现。。查看 Gperf 的实现。不过,您始终可以使用预期 O(1) 的常用哈希表。
我想您正在存储一些您想要检索的额外信息,因为您知道它在那里?你如何存储这些?
您可能会发现 B-Tree 在这种情况下很有用(行业标准数据库通常使用其中的一些变体),甚至可以用作索引!因此,您进行搜索,如果找到它,您就拥有了指向它的数据/指针。您会在网络上找到许多这些的实现。
Since N is static you might consider creating a Perfect Hash function for N. This will make your search guaranteed O(1) time.
The CLR book on algorithms has a chapter on this and wiki page above has links which you might find useful. It might be too complicated, though
and you might be hard pressed to find a useful implementation.. Look at Gperf for an implementation.You could always use a commonly available hash table with expected O(1) though.
I suppose you are storing some extra information which you want to retrieve knowing that it is there? How are you storing those?
You might find a B-Tree useful in that case (industry standard databases usually use a some variant of those), which could even serve as the index! So, you search, and if you find it, you have the data/pointer to it. You will find many implementations for these on the web.