如何在 Ruby 中找到最接近的二进制 bin 字符串对(汉明距离)而不出现 O^2 问题?
我有一个 MongoDB,里面有大约 100 万个文档。这些文档都有一个表示 1 和 0 的 256 位 bin 的字符串,例如:
0110101010101010110101010101
理想情况下,我想查询近似二进制匹配。这意味着,如果两个文档具有以下编号。是的,这就是汉明距离。
Mongo 目前不支持此功能。所以,我被迫在应用层做这件事。
因此,考虑到这一点,我试图找到一种方法来避免在文档之间进行单独的汉明距离比较。这使得完成这项工作的时间基本上是不可能的。
我有很多内存。而且,在 ruby 中,似乎有一个很棒的 gem(算法)可以创建许多树,但我似乎无法使其中任何一个能够减少我需要进行的查询数量。
理想情况下,我希望进行 100 万次查询,找到几乎重复的字符串,并能够更新它们以反映这一点。
任何人的想法将不胜感激。
I've got a MongoDB with about 1 million documents in it. These documents all have a string that represents a 256 bit bin of 1s and 0s, like:
0110101010101010110101010101
Ideally, I'd like to query for near binary matches. This means, if the two documents have the following numbers. Yes, this is Hamming Distance.
This is NOT currently supported in Mongo. So, I'm forced to do it in the application layer.
So, given this, I am trying to find a way to avoid having to do individual Hamming distance comparisons between the documents. that makes the time to do this basically impossible.
I have a LOT of RAM. And, in ruby, there seems to be a great gem (algorithms) that can create a number of trees, none of which I can seem to make work (yet) that would reduce the number of queries I'd need to make.
Ideally, I'd like to make 1 million queries, find the near duplicate strings, and be able to update them to reflect that.
Anyone's thoughts would be appreciated.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
我最终将所有文档检索到内存中..(带有 id 和字符串的子集)。
然后,我使用 BK Tree 来比较字符串。
I ended up doing a retrieval of all the documents into memory.. (subset with the id and the string).
Then, I used a BK Tree to compare the strings.
汉明距离定义了一个度量空间,因此您可以使用 O(n log n) 算法找到最近的一对点,这是典型的分而治之的性质。
然后,您可以重复应用此方法,直到获得“足够”对为止。
编辑:我现在看到维基百科实际上并没有给出算法,所以这是一个描述。
编辑2:如果没有距离小于
n
的配对,则可以修改算法以放弃。对于汉明距离的情况:只需计算您所处的递归级别。如果您在任何分支中都没有找到级别n
的内容,则放弃(换句话说,永远不要输入 <代码>n + 1)。如果您使用的度量在一维上分割并不总是产生1
的距离,则需要调整放弃的递归级别。The Hamming distance defines a metric space, so you could use the O(n log n) algorithm to find the closest pair of points, which is of the typical divide-and-conquer nature.
You can then apply this repeatedly until you have "enough" pairs.
Edit: I see now that Wikipedia doesn't actually give the algorithm, so here is one description.
Edit 2: The algorithm can be modified to give up if there are no pairs at distance less than
n
. For the case of the Hamming distance: simply count the level of recursion you are in. If you haven't found something at leveln
in any branch, then give up (in other words, never entern + 1
). If you are using a metric where splitting on one dimension doesn't always yield a distance of1
, you need to adjust the level of recursion where you give up.据我所知,您有一个输入字符串
X
并且您想要在数据库中查询包含字符串字段b
的文档,这样X 之间的汉明距离
和document.b
小于某个小数d
。您可以在线性时间内完成此操作,只需扫描所有
N
=1M 文档并计算距离(每个文档需要少量固定时间)。由于你只想要距离小于d
的文档,所以在d
个不匹配字符之后可以放弃比较;如果大多数字符都匹配,则只需比较所有 256 个字符。您可以尝试扫描少于
N
个文档,即获得比线性时间更好的。令
ones(s)
为字符串s
中1
的数量。对于每个文档,将ones(document.b)
存储为新的索引字段ones_count
。那么您只能查询 1 的数量足够接近ones(X)
的文档,具体来说,ones(X)
-d
< =document.ones_count
<=ones(X)
+d
。 Mongo 索引应该在这里启动。如果您想找到集合中所有足够接近的对,请参阅@Philippe 的答案。
As far as I could understand, you have an input string
X
and you want to query the database for a document containing string fieldb
such that Hamming distance betweenX
anddocument.b
is less than some small numberd
.You can do this in linear time, just by scanning all of your
N
=1M documents and calculating the distance (which takes small fixed time per document). Since you only want documents with distance smaller thand
, you can give up comparison afterd
unmatched characters; you only need to compare all 256 characters if most of them match.You can try to scan fewer than
N
documents, that is, to get better than linear time.Let
ones(s)
be the number of1
s in strings
. For each document, storeones(document.b)
as a new indexed fieldones_count
. Then you can only query documents where number of ones is close enough toones(X)
, specifically,ones(X)
-d
<=document.ones_count
<=ones(X)
+d
. The Mongo index should kick in here.If you want to find all close enough pairs in the set, see @Philippe's answer.
这听起来像是某种算法问题。您可以尝试先比较那些具有相似数量的 1 或 0 位的值,然后从那里开始遍历列表。当然,那些相同的将会脱颖而出。我认为拥有大量 RAM 在这里没有帮助。
您也可以尝试使用较小的块。您是否可以将其视为 32 个 8 位序列,而不是处理 256 位序列? 16 个 16 位序列?此时,您可以计算查找表中的差异并将其用作一种索引。
根据您想要匹配的“不同”程度,您可以仅排列源二进制值的更改并进行键控搜索以查找其他匹配的值。
This sounds like an algorithmic problem of some sort. You could try comparing those with a similar number of 1 or 0 bits first, then work down through the list from there. Those that are identical will, of course, come out on top. I don't think having tons of RAM will help here.
You could also try and work with smaller chunks. Instead of dealing with 256 bit sequences, could you treat that as 32 8-bit sequences? 16 16-bit sequences? At that point you can compute differences in a lookup table and use that as a sort of index.
Depending on how "different" you care to match on, you could just permute changes on the source binary value and do a keyed search to find the others that match.