距离度量的组合优化
我有一组轨迹,由轨迹上的点以及与每个点关联的坐标组成。我将它们存储在 3d 数组中(轨迹、点、参数)。我想找到一组 r 轨迹,它们在这些轨迹的可能成对组合之间具有最大累积距离。我的第一次尝试看起来是这样的:
max_dist = 0
for h in itertools.combinations ( xrange(num_traj), r):
for (m,l) in itertools.combinations (h, 2):
accum = 0.
for ( i, j ) in itertools.izip ( range(k), range(k) ):
A = [ (my_mat[m, i, z] - my_mat[l, j, z])**2 \
for z in xrange(k) ]
A = numpy.array( numpy.sqrt (A) ).sum()
accum += A
if max_dist < accum:
selected_trajectories = h
这需要很长时间,因为 num_traj 可以在 500-1000 左右,r 可以在 5-20 左右。 k 是任意的,但通常可以达到 50。
为了变得超级聪明,我将所有内容放入两个嵌套列表推导式中,大量使用 itertools:
chunk = [[ numpy.sqrt((my_mat[m, i, :] - my_mat[l, j, :])**2).sum() \
for ((m,l),i,j) in \
itertools.product ( itertools.combinations(h,2), range(k), range(k)) ]\
for h in itertools.combinations(range(num_traj), r) ]
除了非常难以阅读(!!!)之外,它还采用了很长一段时间。谁能提出任何改进方法?
I have a set of trajectories, made up of points along the trajectory, and with the coordinates associated with each point. I store these in a 3d array ( trajectory, point, param). I want to find the set of r trajectories that have the maximum accumulated distance between the possible pairwise combinations of these trajectories. My first attempt, which I think is working looks like this:
max_dist = 0
for h in itertools.combinations ( xrange(num_traj), r):
for (m,l) in itertools.combinations (h, 2):
accum = 0.
for ( i, j ) in itertools.izip ( range(k), range(k) ):
A = [ (my_mat[m, i, z] - my_mat[l, j, z])**2 \
for z in xrange(k) ]
A = numpy.array( numpy.sqrt (A) ).sum()
accum += A
if max_dist < accum:
selected_trajectories = h
This takes forever, as num_traj can be around 500-1000, and r can be around 5-20. k is arbitrary, but can typically be up to 50.
Trying to be super-clever, I have put everything into two nested list comprehensions, making heavy use of itertools:
chunk = [[ numpy.sqrt((my_mat[m, i, :] - my_mat[l, j, :])**2).sum() \
for ((m,l),i,j) in \
itertools.product ( itertools.combinations(h,2), range(k), range(k)) ]\
for h in itertools.combinations(range(num_traj), r) ]
Apart from being quite unreadable (!!!), it is also taking a long time. Can anyone suggest any ways to improve on this?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
您可以从计算所有轨迹对之间的距离开始,而不是按需重新计算每对轨迹之间的距离。您可以将它们存储在字典中并根据需要查找它们。
这样,您的内循环
for (i,j) ...
将被替换为恒定时间查找。Rather than recalculate the distance between each pair of trajectories on-demand, you can start by calculating the distance between all pairs of trajectories. You can store those in a dictionary and look them up as needed.
This way your inner-loop
for (i,j) ...
will be replaced with a constant-time lookup.您可以放弃距离计算中的平方根计算...最大和也将具有最大平方和,尽管这只会产生恒定的加速。
You can ditch the square root calculation on the distance calculation... the maximal sum will also have the maximal squared sum, although that only yields a constant speedup.
除了其他人提到的内容之外,这里还有一些兴趣点和建议。 (顺便说一句,mathmike 建议生成所有所有对距离的查找列表,您应该立即实施。它消除了算法复杂性中的 O(r^2)。)
首先,这些行
可以被替换为
因为 i 和 j 在每个循环中始终相同。这里根本不需要使用 izip。
其次,关于线
你确定这就是你想要的计算方式吗?可能是这样,但它让我感到奇怪,因为如果这更多的是向量之间的欧几里得距离,那么这条线将是:
或者只是
因为我认为将 A 转换为 numpy 数组以使用 numpy 的 sum 函数会比仅仅慢使用内置的Python sum 函数,但我可能是错的。另外,如果它确实是您想要的欧几里德距离,那么您将以这种方式执行更少的开方。
第三,您是否意识到您可能会尝试迭代多少种潜在组合?对于 num_traj = 1000 和 r = 20 的最坏情况,根据我的估计,大约有 6.79E42 个组合。用你目前的方法来说这是相当棘手的。即使对于 num_traj = 500 和 r = 5 的最佳情况,这也是 1.28E12 个组合,虽然数量不少,但并非不可能。这是您在这里遇到的真正问题,因为根据 Mathmike 的建议,我提到的前两点并不是很重要。
那你能做什么呢?好吧,你需要更聪明一点。我还不清楚什么是一个很好的方法来实现这一点。我猜你需要以某种方式使算法启发式。我的一个想法是尝试一种带有启发式的动态编程方法。对于每个轨迹,您可以找到它与另一个轨迹的每个配对的距离总和或平均值,并将其用作适应度度量。在进入三重奏之前,可以放弃一些适应度最低的轨迹。然后,您可以对三重奏做同样的事情:找到每个轨迹所涉及的所有三重奏(在剩余的可能轨迹中)的累积距离的总和或平均值,并将其用作适应度度量来决定在继续之前丢弃哪些轨迹到四人一组。它不能保证最佳解决方案,但它应该相当好,并且我相信它会大大降低解决方案的时间复杂度。
Here are few points of interest and suggestions in addition to what everyone else has mentioned. (By the way, mathmike's suggestion of generating a look-up list all all pair distances is one that you should put in place immediately. It gets rid of a O(r^2) from your algorithm complexity.)
First, the lines
can be replaced with
because i and j are always the same in every loop. There's no need to use izip at all here.
Second, regarding the line
Are you sure this is how you want to calculate it? Possibly so, but it just struck me as odd because if this was more of a Euclidean distance between vectors then the line would be:
or just
because I would think that converting A to a numpy array to use numpy's sum function would be slower than just using the built-in Python sum function, but I could be wrong. Also, if it truly is a Euclidean distance that you want, then you will be doing less sqrt's this way.
Third, do you realize how many potential combinations you may be trying to iterate over? For the worst case with num_traj = 1000 and r = 20, that is approximately 6.79E42 combinations by my estimate. That's quite intractable with your current method. Even for the best case of num_traj = 500 and r = 5, that's 1.28E12 combinations which is quite a few, but not impossible. This is the real problem you're having here because by taking mathmike's advice, the first two points that I have mentioned aren't very important.
What can you do then? Well, you will need to be a little more clever. It isn't clear to me yet what would be a great method use for this. I'm guessing that you will need to make the algorithm heuristic in some way. One thought I had was to try a dynamic programming sort of approach with a heuristic. For each trajectory you could find the sum or mean of the distances for every pairing of it with another trajectory and use this as a fitness measure. Some of the trajectories with the lowest fitness measures could be dropped before moving on to trios. You could then do the same thing with trios: find the sum or mean of the accumulated distances for all trios (among remaining possible trajectories) that each trajectory is involved with and use that as the fitness measure to decide which ones to drop before moving on to foursomes. It doesn't guarantee the optimal solution, but it should be quite good and it will greatly lower the time complexity of the solution I believe.
无论如何,这可能会花费很长时间,因为您的算法大约需要〜
O(C(N,r)*r^2)
,其中C(N,r)
是N选择 r。对于较小的 r(或 N),这可能没问题,但如果您绝对需要找到最大值,而不是使用近似启发式,您应该尝试使用不同的策略进行分支定界。这可能适用于较小的 r,并且可以节省大量不必要的重新计算。It's likely to take forever anyhow, as your algorithm takes about ~
O( C( N, r ) * r^2 )
, whereC( N, r )
is N choose r. For smaller r's (or N) this might be okay, but if you absolutely need to find the max, as opposed to using an approximation heuristic, you should try branch and bound with different strategies. This might work for smaller r's, and it saves you quite a bit on unnecessary recalculations.这听起来像是一个“加权派系”问题:找到例如
r=网络中的 5 个人,具有最大兼容性/C(5,2) 对权重的最大总和。
谷歌“加权派系”算法 -“派系渗透”→ 3k 次点击。
但我会采用贾斯汀·皮尔的方法
因为它是可以理解和可控的
(取n2个最好的对,从中得到最好的n3个三元组......
调整 n2 n3 ...以轻松权衡运行时间/结果质量。)
5 月 18 日添加,随后对实现进行了削减。
@Jose,看看什么 nbest[] 序列适合你会很有趣。
This sounds like a "weighted clique" problem: find e.g.
r=5 people in a network with maximim compatibility / max sum of C(5,2) pair weights.
Google "weighted clique" algorithm -"clique percolation" → 3k hits.
BUT I would go with Justin Peel's method
because it's understandable and controllable
(take the n2 best pairs, from them the best n3 triples ...
adjust n2 n3 ... to easily tradeoff runtime / quality of results.)
Added 18may, a cut at an implementation follows.
@Jose, it would be interesting to see what nbest[] sequence works for you.