Which object of these 2 or more should be chosen as the 5th nearest neighbor?
It really depends on how you want to implement it.
Most algorithms will do one of three things:
Include all equal distance points, so for this estimation, they'll use 6 points, not 5.
Use the "first" found point of the two equal distant.
Pick a random (usually with a consistent seed, so results are reproducable) point from the 2 points found.
That being said, most algorithms based on radial searching have an inherent assumption of stationarity, in which case, it really shouldn't matter which of the options above you choose. In general, any of them should, theoretically, provide reasonable defaults (especially since they're the furthest points in the approximation, and should have the lowest effective weightings).
If you have another distance function, you can use it to break the tie. Even a bad one can do the job, better if you have some heuristics. For instance, if you know that one of the feature considered to compute your main distance is more significant, use only this one to solve the tie.
If it's not the case, pick at random. The run several times your program on the same test set, to check if the random choice matters.
这使得生活充满挑战。那么如何选择 k 的值呢?您可以使用一些指标来分析事后的结果,但没有严格规定 k 必须是什么,所以我会让您自己轻松开始并坚持使用 k=3 而不是 k=5,然后向下我们将研究一些策略,通过查看预测的实际准确性来帮助您优化 k 值。
If you have k=5, you look at the top five records, look at the most common result out of those five. It's probable that you would get two pairs which would put you in a bind and it would be tough, because then you have a 50/50 chance of each pair.
So that makes life challenging. So how do you pick out a value for k? There are some metrics you can use to analyze the result after the fact, but no strict rule of what k must be, so I would make it easy on yourself just starting out and stick with k=3 instead of k=5 and then down the road look into some strategies that can assist you in optimizing the value of k, by looking at the actual accuracy of your predictions.
发布评论
评论(4)
这实际上取决于您想要如何实现它。
大多数算法会执行以下三件事之一:
话虽这么说,大多数基于径向搜索的算法都有一个固有的平稳性假设,在这种情况下,选择上面的哪个选项实际上并不重要。一般来说,从理论上讲,它们中的任何一个都应该提供合理的默认值(特别是因为它们是近似值中最远的点,并且应该具有最低的有效权重)。
It really depends on how you want to implement it.
Most algorithms will do one of three things:
That being said, most algorithms based on radial searching have an inherent assumption of stationarity, in which case, it really shouldn't matter which of the options above you choose. In general, any of them should, theoretically, provide reasonable defaults (especially since they're the furthest points in the approximation, and should have the lowest effective weightings).
另一个有趣的选择是使用最近邻,如下所示:
您计算每个类的 5 个最近邻到样本的距离:您与每个类的距离
然后您将获得每个类别的平均距离。
较低的平均距离将是您将分配给样本的类别。
这种方法对于重叠类的数据集是有效的。
Another and interesting option is to use the nearest neighbor like this:
You calculate the distances of the 5 nearest neighbors from each class to the sample: you will have 5 distances from each class.
Then you get the mean distance for each class.
That lower mean distance will be the class you will assign to the sample.
This way is effective for datasets of classes that overlap.
如果你有另一个距离函数,你可以用它来打破平局。即使是一个糟糕的人也能完成这项工作,如果你有一些启发式的方法,效果会更好。例如,如果您知道用于计算主距离的特征之一更为重要,则仅使用该特征来解决平局问题。
如果不是这种情况,请随机选择。在同一测试集上运行多次程序,以检查随机选择是否重要。
If you have another distance function, you can use it to break the tie. Even a bad one can do the job, better if you have some heuristics. For instance, if you know that one of the feature considered to compute your main distance is more significant, use only this one to solve the tie.
If it's not the case, pick at random. The run several times your program on the same test set, to check if the random choice matters.
如果 k=5,则查看前 5 条记录,查看这 5 条记录中最常见的结果。你很可能会得到两对,这会让你陷入困境,这会很困难,因为这样你每对都有 50/50 的机会。
这使得生活充满挑战。那么如何选择 k 的值呢?您可以使用一些指标来分析事后的结果,但没有严格规定 k 必须是什么,所以我会让您自己轻松开始并坚持使用 k=3 而不是 k=5,然后向下我们将研究一些策略,通过查看预测的实际准确性来帮助您优化 k 值。
If you have k=5, you look at the top five records, look at the most common result out of those five. It's probable that you would get two pairs which would put you in a bind and it would be tough, because then you have a 50/50 chance of each pair.
So that makes life challenging. So how do you pick out a value for k? There are some metrics you can use to analyze the result after the fact, but no strict rule of what k must be, so I would make it easy on yourself just starting out and stick with k=3 instead of k=5 and then down the road look into some strategies that can assist you in optimizing the value of k, by looking at the actual accuracy of your predictions.