I have been in this situation before and my suggestion would be take a step back and look at the problem again.
Does your model absolutely need all of the data at once? Or can it be done in batches? It's also possible that the model you are using can be done in batches, but the library you are using does not support such a case. In that situation, either try to find a library that does support batches or if such a library does not exist (unlikely), "reinvent the wheel" yourself, i.e., create the model from scratch and allow batches. However, as your question mentioned, you need to use a model from Scikit-Learn, TensorFlow, or PyTorch. So if you truly want to stick with your mentioned libraries, there are techniques such as those that Alexey Larionov and I'mahdi mentioned in comments to your question in relation to PyTorch and TensorFlow.
Is all of your data actually relevant? Once I found that a whole subset of my data was useless to the problem I was trying to solve; another time I found that it was only marginally helpful. Dimensionality reduction, numerosity reduction, and statistical modeling may be your friends here. Here is a link to a wikipedia page about data reduction:
发布评论
评论(1)
我以前曾经处于这种情况,我的建议将退后一步,再次查看问题。
您的模型是否绝对需要所有数据?还是可以分批完成?您使用的模型也可以分批完成,但是您使用的库不支持这种情况。在这种情况下,要么尝试找到一个确实支持批处理的库,要么尝试不存在这样的库(不太可能),“重新发明了轮子”自己,即,您自己从头开始创建模型并允许批处理。但是,正如您提到的那样,您需要使用Scikit-Learn,Tensorflow或Pytorch的模型。因此,如果您确实想坚持提到的库,那么在您的问题上,与Pytorch和Tensorflow有关的问题中提到了Alexey Larionov和Iahdi等技术。
您的所有数据实际上都相关吗?一旦我发现我的数据子集对我试图解决的问题毫无用处;另一个时候,我发现这只是有很小的帮助。降低维度,减少数字和统计建模可能是您的朋友。这是Wikipedia页面有关数据减少的链接:
https://en.wikipedia.orgg/wiki/ data_reduction
数据减少不仅会减少所需的内存量,还会改善您的模型。不良数据意味着不良数据。
I have been in this situation before and my suggestion would be take a step back and look at the problem again.
Does your model absolutely need all of the data at once? Or can it be done in batches? It's also possible that the model you are using can be done in batches, but the library you are using does not support such a case. In that situation, either try to find a library that does support batches or if such a library does not exist (unlikely), "reinvent the wheel" yourself, i.e., create the model from scratch and allow batches. However, as your question mentioned, you need to use a model from Scikit-Learn, TensorFlow, or PyTorch. So if you truly want to stick with your mentioned libraries, there are techniques such as those that Alexey Larionov and I'mahdi mentioned in comments to your question in relation to PyTorch and TensorFlow.
Is all of your data actually relevant? Once I found that a whole subset of my data was useless to the problem I was trying to solve; another time I found that it was only marginally helpful. Dimensionality reduction, numerosity reduction, and statistical modeling may be your friends here. Here is a link to a wikipedia page about data reduction:
https://en.wikipedia.org/wiki/Data_reduction
Not only will data reduction reduce the amount of memory you need, it will also improve your model. Bad data in means bad data out.