关于一些数据挖掘算法的问题

发布于 2024-09-30 05:49:22 字数 255 浏览 3 评论 0原文

最近我研究了k近邻和决策树,我很好奇两者之间的区别,即,对于像分离目标函数这样的任务“如果x2>x1返回1,否则返回0”,那么选择最近邻将在这里很好,因为决策树会涉及太多的分裂。 所以我只是考虑在什么样的情况下,选择决策树会比k近邻更合适?

另一个问题与 K-最近邻有关,我知道当 K=1 时,它只是一个基线分类(将实例分类到其最近邻的类)。任何人都可以给我一个关于什么类型的想法吗?在分类任务中,3-最近邻分类器一定会优于 1-最近邻分类器吗?

提前致谢!

Recently I have studied k-nearest neighbour and decision trees and I am quite curious about the difference between the two,i.e,for task like seperating a target function "return 1 if x2>x1,return 0 otherwise",then choosing Nearest Neighbour would be good here since decision tree will invovle too many splits.
So I am just considering in what kind of cases,chossing a decision tree would be more appropriate than k-nearest neighbour?

The other question is just to do with K-nearest neighbour,I understand that when K=1,then it is just a baseline classification(classify the instance to its nearset neighbour' class).Could any one give me an idea on what kind of classification task,will a 3-nearest neighbour definitely outperfom a 1-nearest neightbour classifier?

Thanks in advance!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

终难愈 2024-10-07 05:49:23

k-NN 与决策树

我总是发现图片是获得算法直觉的最佳方式。您建议的目标函数会产生一个有点像这样的数据集:

alt text

分离数据的函数x1 - x2 = 0。问题是,通常决策树在节点处只有一个变量的函数,因此节点处的决策函数是轴对齐的。我想象在这个数据集上学习的决策树会做这样的事情:

alt text

希望你明白了,显然你可以通过在决策树中使用足够的节点来近似最佳决策边界,但这意味着您面临过度拟合数据的风险。

实际上,我说过决策树通常在节点处使用单变量函数,但还有另一种方法,在 StackOverflow 问题中描述 多变量决策树(我未能回答)。

顺便说一下,这种数据的最佳分类器是线性分类器,也许是逻辑回归,它会找到最佳决策边界

k在k-NN中的效果

我能给出的最好描述对于 k 最近邻中的 k 来说,k 的高值可以平滑决策边界。也并非较高的 k 总是比较低的 k 更好。

为了考虑 k-NN,我们需要更复杂的数据集。对于 k=1,k-NN 模型可能会做出类似于以下的决策:

alt text

如果我们增加k,决策将受到更大邻域的影响,因此决策边界将变得更加平滑。特别是,那些红色和蓝色的小岛会被周围的数据点淹没:

alt text

是否使用高 k 是更好地取决于数据集上的噪声水平。这些小岛是否真的很重要,而我们学到的模型太简单,不能很好地拟合数据,或者它们只是噪音,我们是否避免了过度拟合?

实际视角

不幸的是,给定一些大型、复杂的真实数据集,您可能没有很好的基础来决定哪种算法最有效(除非您借鉴了以前的工作)相同或相似的数据)。大多数人所做的就是仔细地将数据划分为训练集、参数调整集和测试集,然后运行他们能想到的尽可能多的算法。您可能还会发现您的特定情况决定了算法必须具有的一些属性(快速、增量、概率等)

k-NN vs. Decision Tree

I always find a picture is the best way to gain an intuition of an algorithm. The target function you suggest would give rise to a data set a bit like this:

alt text

Where the function to separate the data is x1 - x2 = 0. The trouble is that normally, decision trees only have functions of one variable at the nodes, so the decision functions at the nodes are axis aligned. I image a decision tree learned on this data set would do something like this:

alt text

Hopefully you get the idea, obviously you can approximate the optimal decision boundary by doing this with enough nodes in a decision tree, but that means you run the risk of overfitting the data.

Actually, I said that decision trees normally use single variable functions at nodes, but there is another approach, described in a StackOverflow question about multivariate decision trees (that I failed to answer).

By the way, the best classifier for this kind of data would be a linear classifier, maybe logistic regression, which would find the optimal decision boundary

The effect of k in k-NN

The best description I can give for k in k-nearest neighbour is that high values of k smooth the decision boundary. It is also not the case that a higher k is always better then a lower one.

To think about k-NN we need a bit more of a complicated data set. For k=1, a k-NN model might make decisions a bit like this:

alt text

If we increased the value of k, the decisions would be affected by a larger neighbourhood of points and so the decision boundaries would become smoother. In particular, those little red and blue island would be overwhelmed by the surrounding data points:

alt text

Whether using a high k is better depends on the level of noise on the dataset. Were those little islands really important and we learned too simple a model that doesn't fit the data very well, or were they just noise and did we avoid overfitting?

A practical perspective

Unfortunately, given some large, complex, real-world data set you probably don't have a very good basis for deciding which algorithm is going to work best (unless you draw on previous work on the same or similar data). What most people do is carefully segment the data into training, parameter tuning and test sets and then run as many algorithms as they can think of. You might also find that you particular situation determines some properties that the algorithm must have (fast, incremental, probabilistic, etc.)

随心而道 2024-10-07 05:49:23

这是对第二个问题的回答。

(我认为绝对优于你的意思是总是优于。)

我不确定这是否可能 -因为,给定数据集和 kNN 算法,对于 k=3(相对于 k=1)预测效果更好的每个实例,可以通过更改模型配置方式或改变数据描述来轻松翻转结果(特别是解决方案空间中的数据密度)。

这是一个简单的例子,尽管 kNN 可能是最简单的机器学习算法,但除了计算距离矩阵然后计算最小距离之外,仍然有一些关键的配置细节。这些配置参数之一是加权——即每个相邻点对预测值的贡献进行加权。一些常见的加权函数是高斯函数和逆函数。例如,一个常见的加权函数是“减法函数”,对于每个邻居,只要距离大于常数,它就只从常数中减去距离。虽然此函数很好地避免了对非常接近未知点(您尝试预测其值的点)的数据点进行过度加权,但当点与未知点的距离接近所选常量的值时,点的权重接近零。换句话说,使用 k=3 的预测可能比使用此函数的 k=1 的预测好得多,但如果三个相邻点中的两个距离足够远,使得它们的权重接近零,则它们也可能非常接近相同。

或者可能是数据。出于我刚才提到的原因,假设 ak=3 模型的预测给出与 k=1 相同的预测。现在假设数据集扩大,因此数据密度更大,这又意味着三个相邻点比以前更有可能对预测值做出大致相同的贡献。

当然,这同样适用于 kNN 算法中的其他主要配置参数,例如距离度量、维度缩放、概率分布等。

顺便说一句,这是个好问题。

This is an answer to the second question.

(I assume that by definitely outperform you mean always outperform.)

I'm not sure it's possible--because, given a data set and a kNN algorithm, for every instance in which the prediction is better with k=3 (vs. k=1) it's easy to flip that result by changing either how the model is configured or varying the data description (in particular the data density in the solution space).

Here's a simple example, Even though kNN is probably the simplest machine learning algorithm, there are a still a few crucial configuration details beyond calculating a distance matrix and then calcluating minimum distances against it. One of these configuration parameters is weighting--i.e., the contribution of each neighboring points to the predicted value weighted. Some common weighting functions are gaussian, and inverse. For instance, one common weighting function is the 'subtraction function', which, for each neighbor, just subtracts the distance from a constant provided that the distance is greater than the constant. While this function nicely avoids over-weighting data points very close to the unknown point (the point whose value you are trying to predict), a point's weight approaches zero as its distance from the unknown point approaches the value of the chose constant. In other words, predictions using k=3 could be much better than k=1 using this function, but they can also be very nearly the same if two of the three neighboring points are are far enough away so that their weight approaches zero.

Or it might be the data. Suppose the predictions from a k=3 model give the same predictions as k=1 for reason i just mentioned. Now suppose that the data set is enlarged so there is a greater data density, which in turn means that the three neighboring points are more likely than before to contribute approximately equally to the predicted value.

Of course, the same applies for other primary configuration parameters in a kNN algorithm--e.g., distance metric, dimension scaling, probability distribution, etc.

Good question, by the way.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文