用于前馈神经网络训练的有效数据集大小
我在 python 中使用前馈神经网络,使用 pybrain 实现。对于训练,我将使用反向传播算法。我知道,对于神经网络,我们需要拥有适量的数据,以免网络训练不足/过度。我可以获得大约 1200 个不同的数据集训练数据模板。
那么问题来了:
如何计算训练的最佳数据量?
由于我尝试了数据集中的 500 个项目,并且花了很多小时才收敛,所以我不想尝试太多的大小。我们对最后一个尺寸的结果非常满意,但我想找到最佳的数量。该神经网络大约有 7 个输入、3 个隐藏节点和 1 个输出。
I'm using a feed-foward neural network in python using the pybrain implementation. For the training, i'll be using the back-propagation algorithm. I know that with the neural-networks, we need to have just the right amount of data in order not to under/over-train the network. I could get about 1200 different templates of training data for the datasets.
So here's the question:
How do I calculate the optimal amount of data for my training?
Since I've tried with 500 items in the dataset and it took many hours to converge, I would prefer not to have to try too much sizes. The results we're quite good with this last size but I would like to find the optimal amount. The neural network has about 7 inputs, 3 hidden nodes and one output.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
它完全取决于解决方案。还有一点艺术与科学的结合。了解您是否进入过度拟合区域的唯一方法是定期针对一组验证数据(即您未使用的数据进行训练)测试您的网络。当该组数据的性能开始下降时,您可能已经训练得太远了——回滚到最后一次迭代。
“最佳”不一定是可能的;这也取决于你的定义。您通常需要的是高度的信心,即一组给定的权重在未见过的数据上表现“良好”。这就是验证集背后的想法。
It's completely solution-dependent. There's also a bit of art with the science. The only way to know if you're into overfitting territory is to be regularly testing your network against a set of validation data (that is data you do not train with). When performance on that set of data begins to drop, you've probably trained too far -- roll back to the last iteration.
"Optimal" isn't necessarily possible; it also depends on your definition. What you're generally looking for is a high degree of confidence that a given set of weights will perform "well" on unseen data. That's the idea behind a validation set.
数据集的多样性比输入网络的样本数量重要得多。
您应该自定义数据集以包含并强化您希望网络学习的数据。
创建此自定义数据集后,您必须开始处理样本数量,因为它完全取决于您的问题。
例如:如果您正在构建一个神经网络来检测特定信号的峰值,那么使用无数没有峰值的信号样本来训练您的网络是完全没有用的。无论您有多少样本,定制训练数据集都很重要。
The diversity of the dataset is much more important than the quantity of samples you are feeding to the network.
You should customize your dataset to include and reinforce the data you want the network to learn.
After you have crafted this custom dataset you have to start playing with the amount of samples, as it is completely dependant on your problem.
For example: If you are building a neural network to detect the peaks of a particular signal, it would be completely useless to train your network with a zillion samples of signals that do not have peaks. There lies the importance of customizing your training dataset no matter how many samples you have.
从技术上讲,在一般情况下,假设所有示例都是正确的,那么示例越多越好。问题实际上是,边际改进(答案质量的一阶导数)是多少?
您可以通过使用 10 个示例对其进行训练,检查质量(例如 95%),然后检查 20 个示例,依此类推来进行测试,以获得如下表格:
然后您可以清楚地看到您的边际收益,并相应地做出决定。
Technically speaking, in the general case, and assuming all examples are correct, then more examples are always better. The question really is, what is the marginal improvement (first derivative of answer quality)?
You can test this by training it with 10 examples, checking quality (say 95%), then 20, and so on, to get a table like:
you can then clearly see your marginal gains, and make your decision accordingly.