K 最近邻算法

发布于 2024-10-16 01:37:56 字数 1455 浏览 8 评论 0原文

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

锦上情书 2024-10-23 01:37:56

应选择这 2 个或更多对象中的哪一个作为第 5 个最近邻居?

这实际上取决于您想要如何实现它。

大多数算法会执行以下三件事之一:

  1. 包括所有相等距离的点,因此对于此估计,它们将使用 6 个点,而不是 5 个。
  2. 使用两个相等距离的“第一个”找到的点。
  3. 从找到的 2 个点中选择一个随机点(通常具有一致的种子,因此结果可重现)。

话虽这么说,大多数基于径向搜索的算法都有一个固有的平稳性假设,在这种情况下,选择上面的哪个选项实际上并不重要。一般来说,从理论上讲,它们中的任何一个都应该提供合理的默认值(特别是因为它们是近似值中最远的点,并且应该具有最低的有效权重)。

Which object of these 2 or more should be chosen as the 5th nearest neighbor?

It really depends on how you want to implement it.

Most algorithms will do one of three things:

  1. Include all equal distance points, so for this estimation, they'll use 6 points, not 5.
  2. Use the "first" found point of the two equal distant.
  3. Pick a random (usually with a consistent seed, so results are reproducable) point from the 2 points found.

That being said, most algorithms based on radial searching have an inherent assumption of stationarity, in which case, it really shouldn't matter which of the options above you choose. In general, any of them should, theoretically, provide reasonable defaults (especially since they're the furthest points in the approximation, and should have the lowest effective weightings).

吝吻 2024-10-23 01:37:56

另一个有趣的选择是使用最近邻,如下所示:

  • 您计算每个类的 5 个最近邻到样本的距离:您与每个类的距离

  • 然后您将获得每个类别的平均距离。

  • 较低的平均距离将是您将分配给样本的类别。

这种方法对于重叠类的数据集是有效的。

Another and interesting option is to use the nearest neighbor like this:

  • You calculate the distances of the 5 nearest neighbors from each class to the sample: you will have 5 distances from each class.

  • Then you get the mean distance for each class.

  • That lower mean distance will be the class you will assign to the sample.

This way is effective for datasets of classes that overlap.

梦巷 2024-10-23 01:37:56

如果你有另一个距离函数,你可以用它来打破平局。即使是一个糟糕的人也能完成这项工作,如果你有一些启发式的方法,效果会更好。例如,如果您知道用于计算主距离的特征之一更为重要,则仅使用该特征来解决平局问题。

如果不是这种情况,请随机选择。在同一测试集上运行多次程序,以检查随机选择是否重要。

If you have another distance function, you can use it to break the tie. Even a bad one can do the job, better if you have some heuristics. For instance, if you know that one of the feature considered to compute your main distance is more significant, use only this one to solve the tie.

If it's not the case, pick at random. The run several times your program on the same test set, to check if the random choice matters.

红墙和绿瓦 2024-10-23 01:37:56

如果 k=5,则查看前 5 条记录,查看这 5 条记录中最常见的结果。你很可能会得到两对,这会让你陷入困境,这会很困难,因为这样你每对都有 50/50 的机会。

这使得生活充满挑战。那么如何选择 k 的值呢?您可以使用一些指标来分析事后的结果,但没有严格规定 k 必须是什么,所以我会让您自己轻松开始并坚持使用 k=3 而不是 k=5,然后向下我们将研究一些策略,通过查看预测的实际准确性来帮助您优化 k 值。

If you have k=5, you look at the top five records, look at the most common result out of those five. It's probable that you would get two pairs which would put you in a bind and it would be tough, because then you have a 50/50 chance of each pair.

So that makes life challenging. So how do you pick out a value for k? There are some metrics you can use to analyze the result after the fact, but no strict rule of what k must be, so I would make it easy on yourself just starting out and stick with k=3 instead of k=5 and then down the road look into some strategies that can assist you in optimizing the value of k, by looking at the actual accuracy of your predictions.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文