快速相似性检测

发布于 2024-08-15 07:31:31 字数 574 浏览 9 评论 0原文

我有大量的对象，我需要找出它们之间的相似之处。

确切地说：给定两个对象，我可以将它们的差异计算为数字，即指标 - 值越高意味着相似性越低，0 意味着对象具有相同的内容。计算该数字的成本与较小对象的大小成正比（每个对象都有给定的大小）。

我需要能够在给定一个对象的情况下快速找到与其相似的一组对象。

确切地说：对于某些相异值 d，我需要生成一个数据结构，将任何对象 o 映射到与 o 不相似的对象集合，这样列出集合中的对象所花费的时间不会比它们花费的时间多。在数组或链表中（也许它们实际上是）。通常，该集合将比对象总数小得多，因此执行此计算确实值得。如果数据结构假设一个固定的 d 就足够了，但如果它适用于任意 d，那就更好了。

您以前见过这个问题或类似的问题吗？什么是好的解决方案？

确切地说：一个简单的解决方案涉及计算所有对象对之间的差异，但这很慢 - O(n²)，其中 n 是对象的数量。有没有复杂度较低的通用解决方案？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

瑕疵 2024-08-22 07:31:31

我需要生成一个数据结构
将任何对象 o 映射到集合
物体与 o 的相似度不超过
d，对于某些相异值 d。

当小计变得大于d时，放弃相似性计算可能是最快的。例如，如果您的相似性基于余弦或豪斯多夫距离，则可以轻松完成。

PS：如果这无法完成，您的问题可能与 k 最近邻问题（或更准确地说是具有阈值邻域的最近邻问题）有关。您应该寻找无需计算所有距离即可找到附近成员的算法（可能使用三角不等式）。维基百科应该帮助您探索合适的算法。

回复收藏 0 原文

以酷 2024-08-22 07:31:31

在不了解该指标的更多细节的情况下，很难说。我没有任何消除 O(n^2) 方面的想法，但可能有一种方法可以减少涉及的一些常数。例如，如果您有欧几里得度量 d(p,q) = sqrt( (p_1-q_1)^2 + ..+ (p_n-q_n)^2)，您可以对距离 d 进行平方并将其与部分距离进行比较(p_i-q_i)^2 的总和，并在超过 d^2 时停止。

这是否真的会节省您的时间取决于仅计算被加数的比较的成本以及您可以通过这样做避免多少次被加数计算（显然，d 越小越好）。

回复收藏 0 原文

江南烟雨〆相思醉 2024-08-22 07:31:31

如果您的相似性度量是传递的，则不必计算所有对象对的相似性，因为对于对象 a、b、c：

similarity(a,c) = similarity(a,b) op similarity(b,c)

其中 op 是二元运算符，例如乘法或加法。

If your similarity measure is transitive, you don't have to compute the similarity for all pairs of objects since for objects a, b, c:

similarity(a,c) = similarity(a,b) op similarity(b,c)

where op is a binary operator e.g. multiplication or addition.

回复收藏 0 原文

弃爱 2024-08-22 07:31:31

我认为解决方案取决于有关问题性质的更多细节。

您需要多次查找同一对象的相似对象，还是只查找一次？如果多次，那么创建一个数据结构，在其中计算每对的差异一次，然后将对象连接到相似的对象，以便您可以快速检索列表而无需重新计算，这可能是非常有用的性能增强。
计算的本质是什么？在一种极端情况下，如果差异的性质是，例如，两个人之间的身高差异，那么维护按身高排序的列表可以让您非常快速地找到相似的对象。我假设真正的问题比这更复杂，但是按照这个逻辑，如果差异是几个线性量的总和，您可以创建一个多维数组，然后从概念上想象一组类似的对象在以参考对象为中心的n维球体（即圆、球体、超球体等）内，再次直接找到它们。实际上，我想到，如果半径计算太复杂或花费太多运行时间，一个好的近似方法是在参考对象周围创建一个 n 维立方体（即正方形、立方体、超正方体等），检索所有位于该立方体内的对象作为“候选者”，然后对候选者进行实际计算。

例如，假设“差异”是三个属性（例如 a1、a2 和 a3）的差异的绝对值之和。您可以创建一个 3 维数组，并将数组每个节点的值设置为具有这些值的对象（如果有）。然后，如果你想找到与对象 o 的差异小于 d 的所有对象，你可以这样写：

for (x1=o.a1-d;x1<o.a1+d;++x1)
{
  for (x2=o.a2-d;x1<o.a2+d;++x2)
  {
    for (x3=o.a3-d;x1<o.a3+d;++x3)
    {
      if (array[x1][x2][x3]!=null
        && (abs(x1-o.a1)+abs(x2-o.a2)+abs(x3-o.a3)<=d)
        {
          ... found a match ...
        }
    }
  }
}

我怀疑差异规则比这更复杂，但是很好，只需在算法中添加复杂性以匹配规则的复杂性即可。重点是使用数组来限制必须检查的对象集。

再次关于计算的本质：如果构成差异的一个元素或某个小子集往往比其他元素更重要，那么创建一个数据结构，允许您在范围内快速比较。如果在范围内，则进行全面比较。如果没有，那你连看都不看。

I think the solution depends on a lot more detail about the nature of your problem.

Do you need to find the similar objects for the same object many times, or only once? If it's many times, then creating a data structure where you compute the difference once for each pair and then connect objects to similar objects so that you can retrieve the list quickly without recalculation might be a very useful performance enhancement.
What is the nature of the calculation? At one extreme, if the nature of the difference is that it is, for example, the difference in height between two people, then maintaining the list sorted by height would let you find the similar objects very quickly. I'm assuming the real problem is more complicated than that, but following on that logic, if the difference is the sum of several linear quantities, you could create a multi-dimenstional array, and then conceptually imagine the set of similar objects as those within an n-dimensional sphere (i.e. circle, sphere, hypersphere, etc) centered around the reference object, and again find them directly. Actually it occurs to me that if the radius calculations are too complicated or take too much run-time, a good approximation would be to create an n-dimensional cube (i.e. square, cube, tesseract, etc) around the reference object, retrieve all objects which lie within that cube as "candidates", and then just do the actual computation on the candidates.

For example, suppose the "difference" is the sum of the absolute values of the differences of three attributes, say a1, a2, and a3. You could create a 3-dimensional array and set the value of each node of the array to the object with those values, if any. Then if you want to find all objects with difference less than d from object o, you could write:

for (x1=o.a1-d;x1<o.a1+d;++x1)
{
  for (x2=o.a2-d;x1<o.a2+d;++x2)
  {
    for (x3=o.a3-d;x1<o.a3+d;++x3)
    {
      if (array[x1][x2][x3]!=null
        && (abs(x1-o.a1)+abs(x2-o.a2)+abs(x3-o.a3)<=d)
        {
          ... found a match ...
        }
    }
  }
}

I suspect that the difference rules are more complicated than that, but fine, just add sophistication to the alrorithm to match the complexity of the rules. The point is to use the array to limit the set of objects that you have to examine.

Again on the nature of the calculation: If one of the elements making up the difference, or some small subset, tends to be more significant than others, then create a data structure that allows you to quickly compare for this within range. If it is in range, do the full compare. If not, then you don't even look at it.

回复收藏 0 原文