神经网络测试用例

发布于 2024-08-31 05:07:20 字数 185 浏览 11 评论 0原文

在精密神经网络的情况下增加~~测试用例~~训练数据的数量是否可能会导致问题（例如过度拟合）..？
增加~~测试用例~~训练数据数量总是好的吗？这总是会带来转化吗？
如果不是，这些情况是什么.. 一个例子会更好..

谢谢，

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

似狗非友 2024-09-07 05:07:20

当您说“测试用例”时，我会假设您正在谈论数据实例。

让我们看几个场景：

各向异性

假设您有一个包含 1000 个实例的训练数据集，它们彼此显着相似，但资格数据集中的实例显着不同< /strong> 来自您的训练数据。例如，您遇到一个问题，您尝试估计函数 y = mx + b。

假设您的某些数据集提供了可以帮助您估计 m 的样本，而其他数据集则可以帮助您估计 b。如果您为神经网络提供 1000 个样本来帮助您估计 b，但只有 5 个样本来帮助您估计 m，那么您的神经网络在涉及到估计m。您将过度拟合您的神经网络，并添加更多样本来帮助您估计 b 不会有任何帮助。

各向同性

现在假设您的数据集中的数据实例呈比例分布（请注意，我没有说等于）...并且您希望它们成比例，因为与估计 b 相比，您可能需要更多的数据实例来估计 m。现在您的数据相对均匀，添加更多样本将为您提供更多机会，帮助您更好地估计函数。使用 y = mx + b 从技术上讲，您可以拥有无限数量的数据实例（因为该线在两个方向上都是无限的），并且它可能会有所帮助，但存在收益递减点。

收益递减

在 y = mx + b 示例中，您可以拥有无限数量的数据实例，但如果您可以用 1,000 个实例来估计函数，那么向数据集中添加 100,000 个以上的数据实例可能并不困难有用。在某些时候，添加更多实例不会导致更好的适应度，因此收益递减。

现在假设您正在尝试估计像 XOR 这样的布尔函数：

A    B   A XOR B
1    1      0
1    0      1
0    1      1
0    0      0

在这种情况下，您根本无法添加更多数据，并且添加更多数据也没有意义...只有四个有效数据实例，并且这就是你所拥有的一切。对于此示例，根本没有必要添加更多数据实例。

结论

一般来说，添加更多数据实例将直接取决于您的问题：一些问题可能会从更多数据实例中受益，而其他问题可能会受到影响。您必须分析数据集，并且可能必须对数据集进行一些操作，以使您的样本更能代表真实世界的数据。你必须研究你想要解决的问题，了解它的领域，了解它拥有的数据样本，并且你必须做出相应的计划……机器学习/人工智能中没有一刀切的解决方案。

When you say "test cases" I'm going to assume you're talking about data instances.

Let's look at several scenarios:

Antisotropy

Suppose you have a training data set with 1000 instances and they're all significantly similar to each other, but the instances in your qualification data set are significantly different from your training data. For example you have a problem where you try to estimate the function y = mx + b.

Suppose that some of your data set provides you with samples that help you estimate m and others help you estimate b. If you provide your neural network with 1000 samples that help you estimate b but only 5 samples that help you estimate m, then your neural network will perform really poorly when it comes to estimating m. You will be over-fitting your neural network and adding more samples that help you estimate b will not be any helpful.

Isotropy

Now suppose that you have a proportional distribution (note that I didn't say equal) of data instances in your data set... and you want them to be proportional, because you might need more data instances to estimate m than you would need to estimate b. Now your data is relatively homogeneous and adding more samples will provide you with more opportunities that would help you make a better estimation of the function. With y = mx + b you can technically have an infinite number of data instances (since the line is infinite in both directions) and it will probably help, but there is a point of diminishing returns.

Diminishing Returns

With the y = mx + b example you could have an infinite number of data instances, but if you can estimate the function with 1,000 instances then adding 100,000 more data instances to your data set might not be useful. At some point adding more instances will not result in better fitness, thus the diminishing returns.

Now suppose that you're trying to estimate a boolean function like XOR:

A    B   A XOR B
1    1      0
1    0      1
0    1      1
0    0      0

In this case you simply can't add more data, and it wouldn't make sense to add any more data... there are only four valid data instances and that's ALL you have. With this example there is no point to of adding more data instances at all.

Conclusion

In general adding more data instances will depend directly on your problem: some problems might benefit from more data instances and other problems might suffer. You have to analyze your data set and you might have to do things to your data set that would make your samples be more representative of the real-world data. You have to study the problem you're trying to solve, understand its domain, understand the data samples it has and you have to plan accordingly... there is no one-size-fits-all solution in machine learning/artificial intelligence.

回复收藏 0 原文

在巴黎塔顶看东京樱花 2024-09-07 05:07:20

过拟合问题是指用许多神经元构建网络，因此当您意识到训练过程时，网络会调整“太好”。换句话说，这就像拟合一个 n 级多项式，并且您的数据大小为 m，其中 n 大于 o 接近 m。
由于函数中有如此多的等级，拟合会更好，但这并不意味着这条曲线是最好的。神经网络也会发生同样的事情，神经元和错误之间的关系就像微笑一样减少。

没有证据表明更多的数据会导致更多的错误，但有些工作利用主成分对数据进行预分析以捕获更好的关系。

回复收藏 0 原文

~没有更多了~