当前位置：文江博客话题详情

实现k近邻需要哪些数据？

发布于 2024-11-11 12:11:12 字数 206 浏览 10 评论 0原文

我目前有一个 reddit 克隆类型的网站。我正在尝试根据用户之前喜欢的帖子来推荐帖子。

看来 K 最近邻或 k 均值是实现此目的的最佳方法。

我似乎无法理解如何实际实现这一点。我看过一些数学公式（例如维基百科页面上的 k 表示的公式），但它们对我来说并没有真正的意义。

有人可以推荐一些伪代码或可以查看的地方，以便我可以更好地了解如何执行此操作吗？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

唔猫 2024-11-18 12:11:12

K 最近邻（又名 KNN）是一种分类算法。

基本上，您采用 N 个项目的训练组并对它们进行分类。如何对它们进行分类完全取决于您的数据以及您认为该数据的重要分类特征是什么。在您的示例中，这可能是帖子类别、谁发布了该项目、谁对该项目进行了投票等。

对“训练”数据进行分类后，您就可以评估“未知”数据点。您可以通过在分类系统中找到距该未知物最近的邻居来确定该未知物的“类别”。如果通过 3 个最近邻来确定分类，则可以称为 3 个最近邻算法。

如何确定“最近邻居”在很大程度上取决于您对数据进行分类的方式。将数据绘制到 N 维空间中是很常见的，其中 N 表示您正在检查的不同分类特征的数量。

一个简单的例子：

假设您有一个位置的经度/纬度坐标，该位置可以位于世界上任何地方的任何陆地上。我们还假设您没有地图，但您有一个非常大的数据集，可以为您提供世界上许多不同城市的经度/纬度，并且您还知道这些城市是哪个国家/地区如果我问你随机的经纬度点在哪个国家

，你能猜出来吗？你会做什么来解决这个问题？

经度/纬度数据自然落入 X,Y 图中。那么，如果您将所有城市绘制到这张图上，然后绘制未知点，您将如何找出未知的国家/地区？您可以开始围绕该点画圆圈，逐渐变大，直到圆圈包含绘图上最近的 10 个城市。现在，您可以查看这 10 个城市所在的国家/地区。如果所有 10 个点都在美国，那么您可以相当肯定地说您的未知点也在美国。但如果只有6个城市在美国，另外4个在加拿大，你能说出你的未知点在哪里吗？您可能仍然会猜测美国，但不确定性较小。

KNN 最困难的部分是弄清楚如何对数据进行分类，从而确定质量相似的“邻居”以及与这些邻居的距离。

K-Nearest Neighbor (aka KNN) is a classification algorithm.

Basically, you take a training group of N items and classify them. How you classify them is completely dependent on your data, and what you think the important classification characteristics of that data are. In your example, this may be category of posts, who posted the item, who upvoted the item, etc.

Once this 'training' data has been classified, you can then evaluate an 'unknown' data point. You determine the 'class' of the unknown by locating the nearest neighbors to it in the classification system. If you determine the classification by the 3 nearest neighbors, it could then be called a 3-nearest neighboring algorithm.

How you determine the 'nearest neighbor' depends heavily on how you classify your data. It is very common to plot the data into N-dimensional space where N represents the number of different classification characteristics you are examining.

A trivial example:

Let's say you have the longitude/latitude coordinates of a location that can be on any landmass anywhere in the world. Let us also assume that you do not have a map, but you do have a very large data set that gives you the longitude/latitude of many different cities in the world, and you also know which country those cities are in.

If I asked you which country the a random longitude latitude point is in, would you be able to figure it out? What would you do to figure it out?

Longitude/latitude data falls naturally into an X,Y graph. So, if you plotted out all the cities onto this graph, and then the unknown point, how would you figure out the country of the unknown? You might start drawing circles around that point, growing increasingly larger until the circle encompasses the 10 nearest cities on the plot. Now, you can look at the countries of those 10 cities. If all 10 are in the USA, then you can say with a fair degree of certainty that your unknown point is also in the USA. But if only 6 cities are in the USA, and the other 4 are in Canada, can you say where your unknown point is? You may still guess USA, but with less certainty.

The toughest part of KNN is figuring out how to classify your data in a way that you can determine 'neighbors' of similar quality, and the distance to those neighbors.

回复收藏 0 原文

佼人 2024-11-18 12:11:12

您所描述的听起来像是一个推荐系统引擎，而不是像 k-means 这样本质上的聚类算法是一种无监督的方法。我无法让自己清楚地了解 reddit 实际使用的内容，但我通过谷歌搜索“recommender + reddit”发现了一些有趣的帖子，例如 Reddit、Stumbleupon、Del.icio.us 和黑客新闻算法曝光！无论如何，k-NN 算法（在十大数据挖掘算法（维基百科上的伪代码）可能会被使用，或者其他技术，例如协作过滤（由亚马逊，例如），在这个很好的教程。

回复收藏 0 原文

请止步禁区 2024-11-18 12:11:12

k 均值聚类最简单的形式是对值求平均值，并将其他平均值保持在一个中心平均值周围。假设您有以下值，

1,2,3,4,6,7,8,9,10,11,12,21,22,33,40

现在如果我进行 k 均值聚类，请记住 k 均值聚类将具有偏差（均值/平均）机制，该机制应将值置于靠近中心或远离中心的位置。我们得到以下结果。

cluster-1 
1,2,3,4,5,6,7,8

cluster-2
10,11,12

cluster-3
21,22

cluster-4
33

cluster-5
40

请记住，我刚刚构建了这些聚类中心（聚类 1-5）。
因此，下次进行聚类时，数字最终将围绕这些中心均值（也称为 k 中心）中的任何一个。上面的数据是单维的。

当您对具有多维的大型数据集执行 kmeans 聚类时（多维数据是一组值，您将拥有数百万个相同维度的数据），您将需要更大且可扩展的数据集。您将首先对一个数组进行平均，您将获得一个值，同样，您将对其他数组重复相同的操作，然后执行 kmean 聚类。

请阅读我的一个问题此处

希望如此有帮助。

k-Means clustering in its simplest form is averaging values and keep other average values around one central average value. Suppose you have the following values

1,2,3,4,6,7,8,9,10,11,12,21,22,33,40

Now if I do k-means clustering and remember that the k-means clustering will have a biasing (means/averaging) mechanism that shall either put values close to the center or far away from it. And we get the following.

cluster-1 
1,2,3,4,5,6,7,8

cluster-2
10,11,12

cluster-3
21,22

cluster-4
33

cluster-5
40

Remember I just made up these cluster centers (cluster 1-5).
So the next, time you do clustering, the numbers would end up around any of these central means (also known as k-centers). The data above is single dimensional.

When you perform kmeans clustering on large data sets, with multi dimension (A multidimensional data is an array of values, you will have millions of them of the same dimension), you will need something bigger and scalable. You will first average one array, you will get a single value, like wise you will repeat the same for other arrays, and then perform the kmean clustering.

Read one of my questions Here

Hope this helps.

回复收藏 0 原文

夏九 2024-11-18 12:11:12

要进行 k 最近邻，您主要需要距离的概念以及一种找到您可以承受的点的 k 个最近邻的方法（您可能不想逐个搜索所有数据点）。有一个近似最近邻库，位于 http://www.cs.umd.edu/ 〜安装/ANN/。这是一个非常简单的分类算法 - 对新点 p 进行分类，找到它的 k 个最近邻居，并根据这 k 个邻居中最流行的类别对 p 进行分类。

我想在你的情况下，一旦你决定了最接近的含义，你就可以向某人提供类似帖子的列表，然后监视点击率，并尝试从中学习以预测哪些替代方案最受欢迎。

如果您有兴趣找到适合您目的的特别好的学习算法，请查看 http ://www.cs.waikato.ac.nz/ml/weka/ - 它允许您尝试大量不同的算法，也可以编写自己的算法作为插件。

回复收藏 0 原文