R knn大数据集

发布于 2024-12-17 05:44:04 字数 746 浏览 4 评论 0原文

我试图在 R 中使用 knn （使用了几个包（knnflex、class））来根据 8 个变量预测默认概率。该数据集大约有 8 列 100k 行，但我的机器似乎在处理 10k 行样本时遇到困难。对数据集进行 knn 的任何建议 > 50 行（即iris）？

编辑：

澄清有几个问题。

1）class和knnflex包中的示例有点不清楚，我很好奇是否有一些类似于randomForest包的实现，你可以在其中给它你想要的变量来预测和要用于训练模型的数据：

RF <- randomForest(x, y, ntree, type,...)

然后转身并使用模型通过测试数据集来预测数据：

pred <- predict(RF, testData)

2）我不太明白为什么 knn 想要训练并且用于构建模型的测试数据。据我所知，该包创建了一个矩阵 ~ 到 nrows(trainingData)^2 ，这似乎也是预测数据大小的上限。我使用 5000 行创建了一个模型（超过该值我遇到了内存分配错误）并且无法预测测试集 > 5000 行。因此我需要：

a）找到一种使用＆gt;的方法训练集中有 5000 行

，或者

b) 找到一种方法在完整的 100k 行上使用模型。

原文

I'm trying to use knn in R (used several packages(knnflex, class)) to predict the probability of default based on 8 variables. The dataset is about 100k lines of 8 columns, but my machine seems to be having difficulty with a sample of 10k lines. Any suggestions for doing knn on a dataset > 50 lines (ie iris)?

EDIT:

To clarify there are a couple issues.

1) The examples in the class and knnflex packages are a bit unclear and I was curious if there was some implementation similar to the randomForest package where you give it the variable you want to predict and the data you want to use to train the model:

RF <- randomForest(x, y, ntree, type,...)

then turn around and use the model to predict data using the test data set:

pred <- predict(RF, testData)

2) I'm not really understanding why knn wants training AND test data for building the model. From what I can tell, the package creates a matrix ~ to nrows(trainingData)^2 which also seems to be an upper limit on the size of the predicted data. I created a model using 5000 rows (above that # I got memory allocation errors) and was unable to predict test sets > 5000 rows. Thus I would need either:

a) find a way to use > 5000 lines in a training set

b) find a way to use the model on the full 100k lines.

分享到QQ

分享到微博