神经网络输入数据的优化

发布于 2024-08-16 18:35:24 字数 367 浏览 5 评论 0原文

我正在尝试构建一个应用程序来检测来自网页的广告图像。一旦我检测到这些,我将不允许它们显示在客户端。

基本上我使用反向传播算法使用此处给出的数据集来训练神经网络: http://archive.ics.uci.edu/ml/datasets/Internet+Advertisements

但在该数据集中没有。属性非常高。事实上,该项目的一位导师告诉我,如果你训练具有这么多属性的神经网络,那么将需要很多时间来接受训练。那么有没有办法优化输入数据集呢?或者我只需要使用那么多属性?

I'm trying to build an app to detect images which are advertisements from the webpages. Once I detect those I`ll not be allowing those to be displayed on the client side.

Basically I'm using Back-propagation algorithm to train the neural network using the dataset given here: http://archive.ics.uci.edu/ml/datasets/Internet+Advertisements.

But in that dataset no. of attributes are very high. In fact one of the mentors of the project told me that If you train the Neural Network with that many attributes, it'll take lots of time to get trained. So is there a way to optimize the input dataset? Or I just have to use that many attributes?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

各自安好 2024-08-23 18:35:24

1558 实际上是一个适度数量的特征/属性。实例数(3279)也很小。问题不在于数据集方面,而在于训练算法方面。

ANN 的训练速度很慢,我建议你使用逻辑回归或支持向量机。他们俩的训练速度都非常快。特别是,svm有很多快速算法。

在此数据集中,您实际上正在分析文本,而不是图像。我认为线性族分类器,即逻辑回归或支持向量机,更适合你的工作。

如果您用于生产并且无法使用开源代码。与良好的 ANN 和 SVM 相比,逻辑回归非常容易实现。

如果你决定使用逻辑回归或SVM,我以后可以推荐一些文章或源代码供你参考。

1558 is actually a modest number of features/attributes. The # of instances(3279) is also small. The problem is not on the dataset side, but on the training algorithm side.

ANN is slow in training, I'd suggest you to use a logistic regression or svm. Both of them are very fast to train. Especially, svm has a lot of fast algorithms.

In this dataset, you are actually analyzing text, but not image. I think a linear family classifier, i.e. logistic regression or svm, is better for your job.

If you are using for production and you cannot use open source code. Logistic regression is very easy to implement compared to a good ANN and SVM.

If you decide to use logistic regression or SVM, I can future recommend some articles or source code for you to refer.

夜灵血窟げ 2024-08-23 18:35:24

如果您实际上使用的是具有 1558 个输入节点和只有 3279 个样本的反向传播网络,那么训练时间是您遇到的问题中最少的:即使您有一个非常小的网络,只有一个包含 10 个神经元的隐藏层,您也有 1558*输入层和隐藏层之间的权重为10。您如何期望从仅 3279 个样本中获得 15580 个自由度的良好估计? (这个简单的计算甚至没有考虑“维数灾难”)

您必须分析您的数据以找出如何优化它。尝试理解您的输入数据:哪些特征(元组)具有(共同)统计显着性? (为此使用标准统计方法)某些功能是否多余? (主成分分析是一个很好的说明点。)不要指望人工神经网络可以为您完成这项工作。

另外:请记住 Duda 和 Hart 著名的“没有免费午餐定理”:没有一种分类算法适用于所有问题。对于任何分类算法 X 来说,都存在一个问题,即抛硬币会得到比 X 更好的结果。如果考虑到这一点,在分析数据之前决定使用哪种算法可能不是一个明智的主意。您很可能选择了实际上比盲目猜测您的特定问题表现更差的算法! (顺便说一句:Duda&Hart&Storks 的关于模式分类的书是一本如果您还没有阅读过,这是了解这一点的很好的起点。)

If you're actually using a backpropagation network with 1558 input nodes and only 3279 samples, then the training time is the least of your problems: Even if you have a very small network with only one hidden layer containing 10 neurons, you have 1558*10 weights between the input layer and the hidden layer. How can you expect to get a good estimate for 15580 degrees of freedom from only 3279 samples? (And that simple calculation doesn't even take the "curse of dimensionality" into account)

You have to analyze your data to find out how to optimize it. Try to understand your input data: Which (tuples of) features are (jointly) statistically significant? (use standard statistical methods for this) Are some features redundant? (Principal component analysis is a good stating point for this.) Don't expect the artificial neural network to do that work for you.

Also: remeber Duda&Hart's famous "no-free-lunch-theorem": No classification algorithm works for every problem. And for any classification algorithm X, there is a problem where flipping a coin leads to better results than X. If you take this into account, deciding what algorithm to use before analyzing your data might not be a smart idea. You might well have picked the algorithm that actually performs worse than blind guessing on your specific problem! (By the way: Duda&Hart&Storks's book about pattern classification is a great starting point to learn about this, if you haven't read it yet.)

后来的我们 2024-08-23 18:35:24

对每一类特征应用一个单独的人工神经网络
例如
457 个输入 1 个 URL 术语输出 ( ANN1 )
495 个输入 1 个输出用于 origurl ( ANN2 )
...

然后训练他们所有人
使用另一个主 ANN 来连接结果

aplly a seperate ANN for each category of features
for example
457 inputs 1 output for url terms ( ANN1 )
495 inputs 1 output for origurl ( ANN2 )
...

then train all of them
use another main ANN to join results

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文