如何一遍又一遍地在 {0,1,2}^12 中找到最近的向量

发布于 2024-10-03 10:38:06 字数 620 浏览 1 评论 0原文

我正在搜索长度为 12 的向量空间,其中条目为 0、1、2。例如,这样的一个向量是
001122001122。我有大约一千个好的向量,和大约一千个坏的向量。对于每个坏向量,我需要找到最接近的好向量。两个向量之间的距离只是不匹配的坐标数。好的向量排列得不是特别好,而且它们“好”的原因在这里似乎没有帮助。我的首要任务是算法要快。

如果我进行简单的穷举搜索,我必须计算大约 1000*1000 的距离。看来真是脑洞很大啊。

如果我首先使用好的向量应用 Dijkstra 算法,我可以计算空间中每个向量的最近向量和最小距离,以便每个坏向量都需要一个简单的查找。但该空间中有 3^12 = 531,441 个向量,因此预计算是 50 万次距离计算。积蓄不多。

你能帮我想一个更好的办法吗?

编辑:由于人们认真地询问是什么让它们“好”:每个向量代表六个等边三角形的六边形图片的描述,这是立方体的 3D 排列的 2D 图像(想想广义的 Q-bert)。等边三角形是立方体面 (45-45-90) 的一半,倾斜透视。其中六个坐标描述了三角形的性质(感知的地板、左墙、右墙),六个坐标描述了边缘的性质(感知的连续性、两种感知的不连续性)。这 1000 个好的向量是那些代表在透视立方体时可以看到的六边形的向量。搜索的原因是对充满三角形的六角形地图应用局部校正......

I'm searching a space of vectors of length 12, with entries 0, 1, 2. For example, one such vector is
001122001122. I have about a thousand good vectors, and about a thousand bad vectors. For each bad vector I need to locate the closest good vector. Distance between two vectors is just the number of coordinates which don't match. The good vectors aren't particularly nicely arranged, and the reason they're "good" doesn't seem to be helpful here. My main priority is that the algorithm be fast.

If I do a simple exhaustive search, I have to calculate about 1000*1000 distances. That seems pretty thick-headed.

If I apply Dijkstra's algorithm first using the good vectors, I can calculate the closest vector and minimal distance for every vector in the space, so that each bad vector requires a simple lookup. But the space has 3^12 = 531,441 vectors in it, so the precomputation is half a million distance computations. Not much savings.

Can you help me think of a better way?

Edit: Since people asked earnestly what makes them "good": Each vector represents a description of a hexagonal picture of six equilateral triangles, which is the 2D image of a 3D arrangement of cubes (think generalized Q-bert). The equilateral triangles are halves of faces of cubes (45-45-90), tilted into perspective. Six of the coordinates describe the nature of the triangle (perceived floor, left wall, right wall), and six coordinates describe the nature of the edges (perceived continuity, two kinds of perceived discontinuity). The 1000 good vectors are those that represent hexagons that can be witnessed when seeing cubes-in-perspective. The reason for the search is to apply local corrections to a hex map full of triangles...

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

只为一人 2024-10-10 10:38:07

我的计算几何非常粗糙,但似乎您应该能够:

  1. 计算 Voronoi 图为你的一组好的向量。
  2. 计算图中单元格的 BSP 树

Voronoi 图将为每个好的向量提供一个 12 维凸包,其中包含最接近该向量的所有点。

BSP 树将为您提供一种快速的方法来确定向量位于哪个单元格中,从而确定它最接近哪个好的向量。

编辑:我刚刚注意到您使用的是汉明距离而不是欧几里德距离。我不确定如何调整它以适应该限制。对不起。

My computational geometry is VERY rough, but it seems that you should be able to:

  1. Calculate the Voronoi diagram for your set of good vectors.
  2. Calculate the BSP tree for the cells of the diagram.

The Voronoi diagram will give you a 12th dimensional convex hull for each good vector that contains that all the points closest to that vector.

The BSP tree will give you a fast way to determine which cell a vector lies within and, therefore, which good vector it is closest to.

EDIT: I just noticed that you are using hamming distances instead of euclidean distances. I'm not sure how this could be adapted to fit that constraint. Sorry.

梦回梦里 2024-10-10 10:38:07

假设向量的压缩表示,一次距离计算(比较一个好的向量和一个坏的向量以产生距离)可以在大约 20 个时钟周期或更短的时间内完成。因此,一百万次这样的距离计算可以在 2000 万个周期或(假设 2GHz cpu)0.01 秒内完成。这些数字有帮助吗?

PS:- 20 个周期是保守的高估。

Assuming a packed representation for the vectors, one distance computation (comparing one good vector and one bad vector to yield the distance) can be completed in roughly 20 clock cycles or less. Hence a million such distance calculations can be done in 20 million cycles or (assuming a 2GHz cpu) 0.01 sec. Do these numbers help?

PS:- 20 cycles is a conservative overestimate.

情丝乱 2024-10-10 10:38:06

为了正确看待事情,并确保没有优化不必要的事情,没有任何优化的蛮力方法在我的机器上需要 12 秒。

Mathematica 中的代码:

bad = Table[RandomInteger[5, 12], {1000}];
good = Table[RandomInteger[2, 12], {1000}];
distance[a_, b_] := Total[Sign@Abs[a - b]];

bestMatch = #[[2]] & /@ 
   Position[
    Table[Ordering@
      Table[distance[good[[j]], bad[[i]]], {j, Length@good}], {i, 
      Length@bad}], 1] // Timing

正如您所料,时间遵循 O(n^2) 法则:

alt text

Just to keep the things in perspective, and be sure you are not optimizing unnecessary things, the brute force approach without any optimization takes 12 seconds in my machine.

Code in Mathematica:

bad = Table[RandomInteger[5, 12], {1000}];
good = Table[RandomInteger[2, 12], {1000}];
distance[a_, b_] := Total[Sign@Abs[a - b]];

bestMatch = #[[2]] & /@ 
   Position[
    Table[Ordering@
      Table[distance[good[[j]], bad[[i]]], {j, Length@good}], {i, 
      Length@bad}], 1] // Timing

As you may expect, the Time follows a O(n^2) law:

alt text

爱的那么颓废 2024-10-10 10:38:06

这听起来很像拼写检查器必须做的事情。诀窍通常是滥用tries

您可以做的最基本的事情是在好的向量上构建一个特里树,然后进行洪水填充,优先考虑几乎没有不匹配的分支。当附近有一个向量时,这会非常快,而当最近的向量很远时,就会退化为暴力。不错。

但我认为你可以做得更好。共享相同前缀的坏向量将执行相同的初始分支工作,因此我们也可以尝试共享它。因此,我们还对坏向量构建了一个特里树,并一次性将它们全部完成。

不能保证这是正确的,因为算法和代码都超出了我的想象:

var goodTrie = new Trie(goodVectors)
var badTrie = new Trie(badVectors)
var result = new Map<Vector, Vector>()
var pq = new PriorityQueue(x => x.error)
pq.add(new {good: goodTrie, bad: badTrie, error: 0})
while pq.Count > 0
  var g,b,e = q.Dequeue()
  if b.Count == 0: 
      //all leafs of this path have been removed
      continue
  if b.IsLeaf:
      //we have found a mapping with minimum error for this bad item
      result[b.Item] = g.Item
      badTrie.remove(b) //prevent redundant results
  else:
      //We are zipping down the tries. Branch to all possibilities.
      q.EnqueueAll(from i in {0,1,2}
                   from j in {0,1,2}
                   select new {good: g[i], bad: b[j], error: e + i==j ? 0 : 1})

return result   

最终的优化可能是对向量重新排序,以便不良向量之间高度一致的位置排在第一位并分担更多工作。

This sounds a lot like what spellcheckers have to do. The trick is generally to abuse tries.

The most basic thing you can do is build a trie over the good vectors, then do a flood-fill prioritizing branches with few mismatches. This will be very fast when there is a nearby vector, and degenerate to brute force when the closest vector is very far away. Not bad.

But I think you can do better. Bad vectors which share the same prefix will do the same initial branching work, so we can try to share that as well. So we also build a trie over the bad vectors and sortof do them all at once.

No guarantees this is correct, since both the algorithm and code are off the top of my head:

var goodTrie = new Trie(goodVectors)
var badTrie = new Trie(badVectors)
var result = new Map<Vector, Vector>()
var pq = new PriorityQueue(x => x.error)
pq.add(new {good: goodTrie, bad: badTrie, error: 0})
while pq.Count > 0
  var g,b,e = q.Dequeue()
  if b.Count == 0: 
      //all leafs of this path have been removed
      continue
  if b.IsLeaf:
      //we have found a mapping with minimum error for this bad item
      result[b.Item] = g.Item
      badTrie.remove(b) //prevent redundant results
  else:
      //We are zipping down the tries. Branch to all possibilities.
      q.EnqueueAll(from i in {0,1,2}
                   from j in {0,1,2}
                   select new {good: g[i], bad: b[j], error: e + i==j ? 0 : 1})

return result   

A final optimization might be to re-order the vectors so positions with high agreement among the bad vectors come first and share more work.

酷炫老祖宗 2024-10-10 10:38:06

3^12 并不是一个很大的搜索空间。如果速度很重要,而算法的通用性不是很重要,那么您可以将每个向量映射到 0..531440 范围内的 int,并将其用作“最近的好向量”预先计算表的索引。

如果您为该表中的每个条目指定一个 32 位字(这已经足够了),那么您将需要为该表查找大约 2 MB 的空间,以换取几乎即时的“计算”。

编辑:这与问题建议的预计算没有太大区别,但我的观点是,根据应用程序,这样做不一定有任何问题,特别是如果您在应用程序运行之前完成所有预计算。

3^12 isn't a very large search space. If speed is essential and generality of the algorithm is not, you could just map each vector to an int in the range 0..531440 and use it as an index into a precomputed table of "nearest good vectors".

If you gave each entry in that table a 32-bit word (which is more than enough), you'd be looking at about 2 MB for the table, in exchange for pretty much instantaneous "calculation".

edit: this is not much different from the precomputation the question suggests, but my point is just that depending on the application, there's not necessarily any problem with doing it that way, especially if you do all the precalculations before the application even runs.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文