SVM 分类 - 每个类别的最小输入集数量
我正在尝试构建一个应用程序来检测来自网页的广告图像。一旦我检测到这些,我将不允许它们显示在客户端。
从我在这个 Stackoverflow 问题获得的帮助中,我认为 SVM 是实现我的目标的最佳方法。
所以,我自己编写了 SVM 和 SMO。我从 UCI 数据存储库获得的数据集有 3280 个实例( 链接到数据集< /a> )其中大约 400 个来自表示广告图像的类,其余的表示非广告图像。
现在我正在获取前 2800 个输入集并训练 SVM。但在查看准确率后,我意识到这 2800 个输入集中的大多数都来自非广告图像类别。所以我在那堂课上获得了非常好的准确性。
那么我在这里能做什么呢?我应该给 SVM 多少个输入集来训练,每个类别有多少个?
谢谢。干杯。 (基本上提出了一个新问题,因为上下文与我之前的问题不同。神经网络的优化输入数据)
感谢您的回复。 我想检查我是否正确导出广告和非广告类的 C 值。 请就此向我提供反馈。
或者您可以查看文档版本 此处。
您可以在此处查看 y1 等于 y2 的图表
且此处 y1 不等于 y2
I'm trying to build an app to detect images which are advertisements from the webpages. Once I detect those I`ll not be allowing those to be displayed on the client side.
From the help that I got on this Stackoverflow question, I thought SVM is the best approach to my aim.
So, I have coded SVM and an SMO myself. The dataset which I have got from UCI data repository has 3280 instances ( Link to Dataset ) where around 400 of them are from class representing Advertisement images and rest of them representing non-advertisement images.
Right now I'm taking the first 2800 input sets and training the SVM. But after looking at the accuracy rate I realised that most of those 2800 input sets are from non-advertisement image class. So I`m getting very good accuracy for that class.
So what can I do here? About how many input set shall I give to SVM to train and how many of them for each class?
Thanks. Cheers. ( Basically made a new question because the context was different from my previous question. Optimization of Neural Network input data )
Thanks for the reply.
I want to check whether I`m deriving the C values for ad and non-ad class correctly or not.
Please give me feedback on this.
Or you u can see the doc version here.
You can see graph of y1 eqaul to y2 here
and y1 not equal to y2 here
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
有两种方法可以解决这个问题。一种方法是平衡训练数据,使其包含相同数量的广告和非广告图像。这可以通过对 400 个广告图像进行过采样或对数千个非广告图像进行欠采样来完成。由于训练时间会随着使用的数据点数量的增加而急剧增加,因此您可能应该首先尝试对非广告图像进行欠采样,并使用 400 个广告图像和 400 个随机选择的非广告创建一个训练集。
另一种解决方案是使用加权 SVM,以便广告图像的边距误差比非广告的边距误差权重更大,对于 libSVM 包,这是通过
-wi 完成的
标志。根据您对数据的描述,您可以尝试将广告图像的重量比非广告图像的重量增加约 7 倍。There are two ways of going about this. One would be to balance the training data so it includes an equal number of advertisement and non-advertisement images. This could be done by either oversampling the 400 advertisement images or undersampling the thousands of non-advertisement images. Since training time can increase dramatically with the number of data points used, you should probably first try undersampling the non-advertisement images and create a training set with the 400 ad images and 400 randomly selected non-advertisements.
The other solution would be to use a weighted SVM so that margin errors for the ad images are weighted more heavily than those for non-ads, for the package libSVM this is done with the
-wi
flag. From your description of the data, you could try weighing the ad images about 7 times more heavily than the non-ads.训练集所需的大小取决于特征空间的稀疏程度。据我所知,您没有讨论您选择使用哪些图像功能。在训练之前,您需要将每个图像转换为描述图像的数字(特征)向量,希望能够捕获您关心的方面。
哦,除非你为了运动而重新实现 SVM,否则我建议只使用 libsvm ,
The required size of your training set depends on the sparseness of the feature space. As far as I can see, you are not discussing what image features you have chose to use. Before you can train, you need to to convert each image into a vector of numbers (features) that describe the image, hopefully capturing the aspects that you care about.
Oh, and unless you are reimplementing SVM for sport, I'd recomment just using libsvm,