最相距的 k 个元素(聚类?)
我有一个简单的机器学习问题:
我有 n (~110) 个元素,以及所有成对距离的矩阵。我想选择相距最远的 10 个元素。也就是说,我想要
Maximize:
Choose 10 different elements.
Return min distance over (all pairings within the 10).
我的距离度量是对称的并且尊重三角不等式。
我可以使用什么样的算法?我的第一反应是执行以下操作:
- 将 n 个元素聚类为 20 个 集群。
- 将每个簇替换为 该簇的元素是 距平均元素最远 原来的n.
- 使用蛮力来解决 剩下20个的问题 候选人。幸运的是,20选10是 只有 184,756。
编辑:感谢 etarion 的富有洞察力的评论,将优化问题陈述中的“返回(距离)总和”更改为“返回最小距离”。
I have a simple machine learning question:
I have n (~110) elements, and a matrix of all the pairwise distances. I would like to choose the 10 elements that are most far apart. That is, I want to
Maximize:
Choose 10 different elements.
Return min distance over (all pairings within the 10).
My distance metric is symmetric and respects the triangle inequality.
What kind of algorithm can I use? My first instinct is to do the following:
- Cluster the n elements into 20
clusters. - Replace each cluster with just the
element of that cluster that is
furthest from the mean element of
the original n. - Use brute force to solve the
problem on the remaining 20
candidates. Luckily, 20 choose 10 is
only 184,756.
Edit: thanks to etarion's insightful comment, changed "Return sum of (distances)" to "Return min distance" in the optimization problem statement.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
好问题。
我不确定是否可以以有效的方式准确解决它,并且您基于集群的解决方案似乎是合理的。另一个值得关注的方向是局部搜索方法,例如模拟退火和爬山。
这是一个明显的基线,我将与任何其他解决方案进行比较:
Nice question.
I'm not sure if it can be solved exactly in an efficient manner, and your clustering based solution seems reasonable. Another direction to look at would be local search method such as simulated annealing and hill climbing.
Here's an obvious baseline I would compare any other solution against:
以下是如何通过采用凸松弛来解决这个组合优化问题。
令 D 为上三角矩阵,距离位于上三角上。即,我< j, D_i,j 是元素 i 和 j 之间的距离。 (大概,对角线上也会有零。)
那么您的目标是最大化 x'*D*x,其中 x 是二进制值,其中 10 个元素设置为 1,其余元素设置为 0。(设置第 i 个元素)输入 x 到 1 类似于选择第 i 个元素作为 10 个元素之一。)处理
此类组合问题的“标准”凸优化是放宽约束,使得 x 不需要为离散值。这样做会给我们带来以下问题:
最大化 y'*D*y
服从: 0 <= y_i <= 1 对于所有 i,1'*y = 10
这是(道德上)一个二次规划。 (如果我们将 D 替换为 D + D',它将成为一个真正的二次规划,并且您得到的 y 应该没有什么不同。)您可以使用现成的 QP 求解器,或者只需将其插入到您选择的凸优化求解器(例如 cvx)。
您得到的 y 不一定是(也可能不会)二进制向量,但您可以通过多种方式将标量值转换为离散值。 (最简单的可能是让 x 在 y_i 最高的 10 个条目中为 1,但您可能需要做一些更复杂的事情。)在任何情况下,y'*D*y 与您得到的 y 确实给出您为 x'*D*x 的最佳值设定了上限,因此,如果您从 y 构造的 x 的 x'*D*x 非常接近 y'*D*y,您会对您的近似值感到非常满意。
如果有任何不清楚的地方,无论是符号还是其他,请告诉我。
Here's how you might approach this combinatorial optimization problem by taking the convex relaxation.
Let D be an upper triangular matrix with your distances on the upper triangle. I.e. where i < j, D_i,j is the distance between elements i and j. (Presumably, you'll have zeros on the diagonal, as well.)
Then your objective is to maximize x'*D*x, where x is binary valued with 10 elements set to 1 and the rest to 0. (Setting the ith entry in x to 1 is analogous to selecting the ith element as one of your 10 elements.)
The "standard" convex optimization thing to do with a combinatorial problem like this is to relax the constraints such that x need not be discrete valued. Doing so gives us the following problem:
maximize y'*D*y
subject to: 0 <= y_i <= 1 for all i, 1'*y = 10
This is (morally) a quadratic program. (If we replace D with D + D', it'll become a bona fide quadratic program and the y you get out should be no different.) You can use an off-the-shelf QP solver, or just plug it in to the convex optimization solver of your choice (e.g. cvx).
The y you get out need not be (and probably won't be) a binary vector, but you can convert the scalar values to discrete ones in a bunch of ways. (The simplest is probably to let x be 1 in the 10 entries where y_i is highest, but you might need to do something a little more complicated.) In any case, y'*D*y with the y you get out does give you an upper bound for the optimal value of x'*D*x, so if the x you construct from y has x'*D*x very close to y'*D*y, you can be pretty happy with your approximation.
Let me know if any of this is unclear, notation or otherwise.