用于前馈神经网络训练的有效数据集大小

发布于 2024-09-30 20:19:09 字数 320 浏览 11 评论 0原文


我在 python 中使用前馈神经网络,使用 pybrain 实现。对于训练,我将使用反向传播算法。我知道,对于神经网络,我们需要拥有适量的数据,以免网络训练不足/过度。我可以获得大约 1200 个不同的数据集训练数据模板。
那么问题来了:
如何计算训练的最佳数据量?

由于我尝试了数据集中的 500 个项目,并且花了很多小时才收敛,所以我不想尝试太多的大小。我们对最后一个尺寸的结果非常满意,但我想找到最佳的数量。该神经网络大约有 7 个输入、3 个隐藏节点和 1 个输出。

I'm using a feed-foward neural network in python using the pybrain implementation. For the training, i'll be using the back-propagation algorithm. I know that with the neural-networks, we need to have just the right amount of data in order not to under/over-train the network. I could get about 1200 different templates of training data for the datasets.
So here's the question:
How do I calculate the optimal amount of data for my training?

Since I've tried with 500 items in the dataset and it took many hours to converge, I would prefer not to have to try too much sizes. The results we're quite good with this last size but I would like to find the optimal amount. The neural network has about 7 inputs, 3 hidden nodes and one output.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

西瓜 2024-10-07 20:19:09

如何计算最佳金额
我的训练数据?

它完全取决于解决方案。还有一点艺术与科学的结合。了解您是否进入过度拟合区域的唯一方法是定期针对一组验证数据(即您未使用的数据进行训练)测试您的网络。当该组数据的性能开始下降时,您可能已经训练得太远了——回滚到最后一次迭代。

结果非常好
最后一个尺寸,但我想找到
最佳用量。

“最佳”不一定是可能的;这也取决于你的定义。您通常需要的是高度的信心,即一组给定的权重在未见过的数据上表现“良好”。这就是验证集背后的想法。

How do I calculate the optimal amount
of data for my training?

It's completely solution-dependent. There's also a bit of art with the science. The only way to know if you're into overfitting territory is to be regularly testing your network against a set of validation data (that is data you do not train with). When performance on that set of data begins to drop, you've probably trained too far -- roll back to the last iteration.

The results were quite good with this
last size but I would like to find the
optimal amount.

"Optimal" isn't necessarily possible; it also depends on your definition. What you're generally looking for is a high degree of confidence that a given set of weights will perform "well" on unseen data. That's the idea behind a validation set.

青萝楚歌 2024-10-07 20:19:09

数据集的多样性比输入网络的样本数量重要得多。

您应该自定义数据集以包含并强化您希望网络学习的数据。

创建此自定义数据集后,您必须开始处理样本数量,因为它完全取决于您的问题。

例如:如果您正在构建一个神经网络来检测特定信号的峰值,那么使用无数没有峰值的信号样本来训练您的网络是完全没有用的。无论您有多少样本,定制训练数据集都很重要。

The diversity of the dataset is much more important than the quantity of samples you are feeding to the network.

You should customize your dataset to include and reinforce the data you want the network to learn.

After you have crafted this custom dataset you have to start playing with the amount of samples, as it is completely dependant on your problem.

For example: If you are building a neural network to detect the peaks of a particular signal, it would be completely useless to train your network with a zillion samples of signals that do not have peaks. There lies the importance of customizing your training dataset no matter how many samples you have.

只涨不跌 2024-10-07 20:19:09

从技术上讲,在一般情况下,假设所有示例都是正确的,那么示例越多越好。问题实际上是,边际改进(答案质量的一阶导数)是多少?

您可以通过使用 10 个示例对其进行训练,检查质量(例如 95%),然后检查 20 个示例,依此类推来进行测试,以获得如下表格:

10 95%
20 96%
30 96.5%
40 96.55%
50 96.56%

然后您可以清楚地看到您的边际收益,并相应地做出决定。

Technically speaking, in the general case, and assuming all examples are correct, then more examples are always better. The question really is, what is the marginal improvement (first derivative of answer quality)?

You can test this by training it with 10 examples, checking quality (say 95%), then 20, and so on, to get a table like:

10 95%
20 96%
30 96.5%
40 96.55%
50 96.56%

you can then clearly see your marginal gains, and make your decision accordingly.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文