如何计算 k-means 中的质心++通过使用距离?
我在交互式遗传算法中使用 Apache Commons Math 的 k-means++ 聚类器来减少用户评估的个体数量。
Commons Math 使其非常易于使用。用户只需执行 可集群
界面。它有两个方法:
非常清晰的 double distanceFrom(T p) 和 T centroidOf(Collection
如果用于欧几里德点,质心很容易计算。但在染色体上这是相当困难的,因为它们的含义并不总是很清楚。
我的问题:是否有一种有效的通用方法来选择质心,而不取决于问题域? (例如通过使用距离)
编辑
好的,现在这是我的质心计算代码。 想法:与所有其他点的总距离最小的点距离质心最近。
public T centroidOf(Collection<T> c) {
double minDist = Double.MAX_VALUE;
T minP = null;
// iterate through c
final Iterator<T> it = c.iterator();
while (it.hasNext()) {
// test every point p1
final T p1 = it.next();
double totalDist = 0d;
for (final T p2 : c) {
// sum up the distance to all points p2 | p2!=p1
if (p2 != p1) {
totalDist += p1.distanceFrom(p2);
}
}
// if the current distance is lower that the min, take it as new min
if (totalDist < minDist) {
minDist = totalDist;
minP = p1;
}
}
return minP;
}
I am using the k-means++ clusterer from Apache Commons Math in a interactive genetic algorithm to reduce the number of individuals that are evaluated by the user.
Commons Math makes it very easy to use. The user only needs to implement theClusterable
interface. It has two methods:
double distanceFrom(T p)
which is quite clear and T centroidOf(Collection<T> p)
, which lets the user pick the centroid of a cluster.
If used on euclidean points, the centroid is very easy to calculate. But on chromosomes it is quite difficult, because their meaning is not always clear.
My question: Is there a efficient generic way to pick the centroid, not depending on the problem domain? (E.g. by using the distance)
EDIT
Ok, here is now my code for the centroid calculation.
The idea: The point that has the lowest total distance to all other points is the nearest to the centroid.
public T centroidOf(Collection<T> c) {
double minDist = Double.MAX_VALUE;
T minP = null;
// iterate through c
final Iterator<T> it = c.iterator();
while (it.hasNext()) {
// test every point p1
final T p1 = it.next();
double totalDist = 0d;
for (final T p2 : c) {
// sum up the distance to all points p2 | p2!=p1
if (p2 != p1) {
totalDist += p1.distanceFrom(p2);
}
}
// if the current distance is lower that the min, take it as new min
if (totalDist < minDist) {
minDist = totalDist;
minP = p1;
}
}
return minP;
}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
k-means 需要平均度量(例如,欧几里德)。如果不定义这样的度量和空间,您甚至不知道点的平均值是否实际上是空间内的点。
但是,您可以使用 k-medoids,它仅将原始点视为候选点中心点(而 k-means 找到不一定位于原始点上的均值/质心)。该算法寻找最小化成对差异的点(即
distanceFrom
)。k-means requires an averaging metric (e.g., Euclidean). Without defining such a metric and space, you don't even know whether the average of points is actually a point inside the space.
You could, however, use k-medoids, which considers only the original points as candidates for medoids (while k-means finds means/centroids which are not necessarily on the original points). The algorithm looks for points which minimize pairwise dissimilarities (i.e.,
distanceFrom
).