快速相似性检测
我有大量的对象,我需要找出它们之间的相似之处。
确切地说:给定两个对象,我可以将它们的差异计算为数字,即 指标 - 值越高意味着相似性越低,0 意味着对象具有相同的内容。计算该数字的成本与较小对象的大小成正比(每个对象都有给定的大小)。
我需要能够在给定一个对象的情况下快速找到与其相似的一组对象。
确切地说:对于某些相异值 d,我需要生成一个数据结构,将任何对象 o 映射到与 o 不相似的对象集合,这样列出集合中的对象所花费的时间不会比它们花费的时间多。在数组或链表中(也许它们实际上是)。通常,该集合将比对象总数小得多,因此执行此计算确实值得。如果数据结构假设一个固定的 d 就足够了,但如果它适用于任意 d,那就更好了。
您以前见过这个问题或类似的问题吗?什么是好的解决方案?
确切地说:一个简单的解决方案涉及计算所有对象对之间的差异,但这很慢 - O(n2),其中 n 是对象的数量。有没有复杂度较低的通用解决方案?
I have a large collection of objects and I need to figure out the similarities between them.
To be exact: given two objects I can compute their dissimilarity as a number, a metric - higher values mean less similarity and 0 means the objects have identical contents. The cost of computing this number is proportional to the size of the smaller object (each object has a given size).
I need the ability to quickly find, given an object, the set of objects similar to it.
To be exact: I need to produce a data structure that maps any object o to the set of objects no more dissimilar to o than d, for some dissimilarity value d, such that listing the objects in the set takes no more time than if they were in an array or linked list (and perhaps they actually are). Typically, the set will be very much smaller than the total number of objects, so it is really worthwhile to perform this computation. It's good enough if the data structure assumes a fixed d, but if it works for an arbitrary d, even better.
Have you seen this problem before, or something similar to it? What is a good solution?
To be exact: a straightforward solution involves computing the dissimilarities between all pairs of objects, but this is slow - O(n2) where n is the number of objects. Is there a general solution with lower complexity?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(8)
当小计变得大于
d
时,放弃相似性计算可能是最快的。例如,如果您的相似性基于余弦或豪斯多夫距离,则可以轻松完成。PS:如果这无法完成,您的问题可能与 k 最近邻问题(或更准确地说是具有阈值邻域的最近邻问题)有关。您应该寻找无需计算所有距离即可找到附近成员的算法(可能使用三角不等式)。维基百科应该帮助您探索合适的算法。
It might be fastest to just abandon the similarity computation when the subtotal becomes larger than
d
. For example, if your similarities are based on cosine or hausdorff distances this can easily be done.PS: if this cannot be done, your problem might be related to the k-nearest neighbors problem (or more precise a nearest neighbor problem with a threshold neighborhood). You should look for algorithms that find close-by members without computing all distances (maybe something using triangle inequality). Wikipedia should help you to explore suitable algorithms.
在不了解该指标的更多细节的情况下,很难说。我没有任何消除 O(n^2) 方面的想法,但可能有一种方法可以减少涉及的一些常数。例如,如果您有欧几里得度量 d(p,q) = sqrt( (p_1-q_1)^2 + ..+ (p_n-q_n)^2),您可以对距离 d 进行平方并将其与部分距离进行比较(p_i-q_i)^2 的总和,并在超过 d^2 时停止。
这是否真的会节省您的时间取决于仅计算被加数的比较的成本以及您可以通过这样做避免多少次被加数计算(显然,d 越小越好)。
Without knowing more details of the metric, it's hard to say. I don't have any ideas for eliminating the O(n^2) aspect, but there may be a way to reduce some of the constants involved. For example, if you had a Euclidean metric d(p,q) = sqrt( (p_1-q_1)^2 + ..+ (p_n-q_n)^2), you could square your distance d and compare it to the partial sums of (p_i-q_i)^2 and stop when you exceed d^2.
Whether this will actually save you time depends on how expensive the compare is to just calculating the summands and how many summand calculations you could expect to avoid by doing this (obviously, the smaller d is, the better).
如果您的相似性度量是传递的,则不必计算所有对象对的相似性,因为对于对象 a、b、c:
其中
op
是二元运算符,例如乘法或加法。If your similarity measure is transitive, you don't have to compute the similarity for all pairs of objects since for objects a, b, c:
where
op
is a binary operator e.g. multiplication or addition.我认为解决方案取决于有关问题性质的更多细节。
您需要多次查找同一对象的相似对象,还是只查找一次?如果多次,那么创建一个数据结构,在其中计算每对的差异一次,然后将对象连接到相似的对象,以便您可以快速检索列表而无需重新计算,这可能是非常有用的性能增强。
计算的本质是什么?在一种极端情况下,如果差异的性质是,例如,两个人之间的身高差异,那么维护按身高排序的列表可以让您非常快速地找到相似的对象。我假设真正的问题比这更复杂,但是按照这个逻辑,如果差异是几个线性量的总和,您可以创建一个多维数组,然后从概念上想象一组类似的对象在以参考对象为中心的n维球体(即圆、球体、超球体等)内,再次直接找到它们。实际上,我想到,如果半径计算太复杂或花费太多运行时间,一个好的近似方法是在参考对象周围创建一个 n 维立方体(即正方形、立方体、超正方体等),检索所有位于该立方体内的对象作为“候选者”,然后对候选者进行实际计算。
例如,假设“差异”是三个属性(例如 a1、a2 和 a3)的差异的绝对值之和。您可以创建一个 3 维数组,并将数组每个节点的值设置为具有这些值的对象(如果有)。然后,如果你想找到与对象 o 的差异小于 d 的所有对象,你可以这样写:
我怀疑差异规则比这更复杂,但是很好,只需在算法中添加复杂性以匹配规则的复杂性即可。重点是使用数组来限制必须检查的对象集。
I think the solution depends on a lot more detail about the nature of your problem.
Do you need to find the similar objects for the same object many times, or only once? If it's many times, then creating a data structure where you compute the difference once for each pair and then connect objects to similar objects so that you can retrieve the list quickly without recalculation might be a very useful performance enhancement.
What is the nature of the calculation? At one extreme, if the nature of the difference is that it is, for example, the difference in height between two people, then maintaining the list sorted by height would let you find the similar objects very quickly. I'm assuming the real problem is more complicated than that, but following on that logic, if the difference is the sum of several linear quantities, you could create a multi-dimenstional array, and then conceptually imagine the set of similar objects as those within an n-dimensional sphere (i.e. circle, sphere, hypersphere, etc) centered around the reference object, and again find them directly. Actually it occurs to me that if the radius calculations are too complicated or take too much run-time, a good approximation would be to create an n-dimensional cube (i.e. square, cube, tesseract, etc) around the reference object, retrieve all objects which lie within that cube as "candidates", and then just do the actual computation on the candidates.
For example, suppose the "difference" is the sum of the absolute values of the differences of three attributes, say a1, a2, and a3. You could create a 3-dimensional array and set the value of each node of the array to the object with those values, if any. Then if you want to find all objects with difference less than d from object o, you could write:
I suspect that the difference rules are more complicated than that, but fine, just add sophistication to the alrorithm to match the complexity of the rules. The point is to use the array to limit the set of objects that you have to examine.
难道不能使用kd-tree吗?
可能有必要(如果可能)对尺寸进行标准化。之后,您只需填充树,并使用“最近的 N 个邻居”搜索,并尝试查找某个范围内的任何对象。
Is it not possible to use a kd-tree?
It may be necessary (if possible) to normalize the dimensions. Afterwards, you just need to populate the tree, and use a "nearest N neighbors" search, and try to find any object within some range.
对象示例:
图像、文档。当然,使用这些对象的原始表示大多没有用。通常,人们会预处理原始形式并将其转换为某种标准化形式(对于文档,比如说一个向量,其中每个条目代表某个单词出现的次数/百分比,对于图像,它可以是找到的视觉特征的表示在图像中)。
例如,如果 d 是固定的并且 ^2 预计算是可行的,则您可以只使用每个对象的链接列表的图形表示。
您可以使用近似最近邻算法以牺牲准确性为代价获得更有效的解决方案。
Example of objects:
Images, Documents. Of course working with the raw representation of these objects is mostly not useful. usually one would pre-process the raw form and turn it into some normalized form (for documents, say a vector for which each entry represents the number/percent of times a certain word appeared, for images it could be a representation of visual features found in the image).
if d is fixed and a n^2 pre-computation is feasible, you could just use a graph representation using a linked list for each object for example.
You can have more efficient solutions on the expense of accuracy using approximate nearest neighbors algorithms.
我们可以假设相似性是传递的,即。
diff(a,c) == diff(a,b) + diff(b,c)
?如果是这样,您可以尝试以下操作: 对s
与o
相似的对象,请在排序列表中找到o
,并向左和向右搜索,直到差异为止变得比s
更大。这样做的优点是排序可以完成一次,并且后续的集合构建与集合中的成员数量成正比。
Can we assume that similarity is transitive, ie.
diff(a,c) == diff(a,b) + diff(b,c)
? If so, you can try the following:s
too
, findo
in the sorted list, and search to the left and to the right until the diff grows larger thans
.The advantage of this is that the sorting can be done once, and subsequent set building is proportional to the number of members that will be in the set.
听起来像 BK-Tree。 这是一个小示例。您基本上创建树并检查哪个分支应该用于相似的对象搜索,哪个分支不应该用于,这样就可以防止
O(n2)
Sounds like BK-Tree. Here is a small example. You basically create tree and check which branch should be used for similar object search and which not, so you prevent
O(n2)