R knn大数据集
我试图在 R 中使用 knn (使用了几个包(knnflex
、class
))来根据 8 个变量预测默认概率。该数据集大约有 8 列 100k 行,但我的机器似乎在处理 10k 行样本时遇到困难。对数据集进行 knn 的任何建议 > 50 行(即iris
)?
编辑:
澄清有几个问题。
1)class
和knnflex
包中的示例有点不清楚,我很好奇是否有一些类似于randomForest包的实现,你可以在其中给它你想要的变量来预测和要用于训练模型的数据:
RF <- randomForest(x, y, ntree, type,...)
然后转身并使用模型通过测试数据集来预测数据:
pred <- predict(RF, testData)
2)我不太明白为什么 knn
想要训练并且用于构建模型的测试数据。据我所知,该包创建了一个矩阵 ~ 到 nrows(trainingData)^2 ,这似乎也是预测数据大小的上限。我使用 5000 行创建了一个模型(超过该值我遇到了内存分配错误)并且无法预测测试集 > 5000 行。因此我需要:
a)找到一种使用&gt;的方法训练集中有 5000 行
,或者
b) 找到一种方法在完整的 100k 行上使用模型。
I'm trying to use knn in R (used several packages(knnflex
, class
)) to predict the probability of default based on 8 variables. The dataset is about 100k lines of 8 columns, but my machine seems to be having difficulty with a sample of 10k lines. Any suggestions for doing knn on a dataset > 50 lines (ie iris
)?
EDIT:
To clarify there are a couple issues.
1) The examples in the class
and knnflex
packages are a bit unclear and I was curious if there was some implementation similar to the randomForest package where you give it the variable you want to predict and the data you want to use to train the model:
RF <- randomForest(x, y, ntree, type,...)
then turn around and use the model to predict data using the test data set:
pred <- predict(RF, testData)
2) I'm not really understanding why knn
wants training AND test data for building the model. From what I can tell, the package creates a matrix ~ to nrows(trainingData)^2
which also seems to be an upper limit on the size of the predicted data. I created a model using 5000 rows (above that # I got memory allocation errors) and was unable to predict test sets > 5000 rows. Thus I would need either:
a) find a way to use > 5000 lines in a training set
or
b) find a way to use the model on the full 100k lines.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
knn
(在 class 中)要求提供训练数据和测试数据的原因是,如果没有,它返回的“模型”将只是训练数据数据本身。训练数据就是模型。
为了进行预测,
knn
计算测试观察和每个训练观察之间的距离(尽管我认为对于超大数据集有一些奇特的版本,它们不会检查每个距离)。因此,在获得测试观察结果之前,实际上不需要构建模型。ipred 包提供的函数看起来像您所描述的那样结构化,但是如果您查看它们,您会发现“训练”函数中基本上没有发生任何事情。所有的工作都在“预测”函数中。这些实际上是作为包装器,用于使用交叉验证进行错误估计。
至于案例数量的限制,这将取决于您拥有多少物理内存。如果您遇到内存分配错误,那么您需要减少其他地方的 RAM 使用量(关闭应用程序等)、购买更多 RAM、购买新计算机等。
中的
knn
函数尽管我有 8GB 的 RAM,但 >class 对我来说在 10k 行或更多的训练和测试数据集上运行得很好。另外,我怀疑 class 中的knn
会比 knnflex 更快,但我还没有进行广泛的测试。The reason that
knn
(in class) asks for both the training and test data is that if it didn't, the "model" it would return would simply be the training data itself.The training data is the model.
To make predictions,
knn
calculates the distance between a test observation and each training observation (although I suppose there are some fancy versions for insanely large data sets that don't check every distance). So until you have test observations, there isn't really a model to build.The ipred package provides functions that appear structured as you describe, but if you look at them, you'll see that there is basically nothing happening in the "training" function. All the work is in the "predict" function. And those are really intended as wrappers to be used for error estimation using cross validation.
As far as limitations on the number of cases, that will be dependent on how much physical memory you have. If you're getting memory allocation errors, then you either need to reduce your RAM usage elsewhere (close apps, etc), buy more RAM, buy a new computer, etc.
The
knn
function in class runs fine for me with training and test data sets of 10k rows or more, although I have 8gb of RAM. Also, I suspect thatknn
in class will be faster than in knnflex, but I haven't done extensive testing.