Google 如何通过安全搜索识别成人内容?

发布于 2024-10-10 07:09:56 字数 211 浏览 0 评论 0原文

我正在创建一个搜索引擎(用于学习),我想知道 Google 如何使用 Safesearch 识别成人内容和图像 ( http://en.wikipedia.org/wiki/Safesearch)。

程序语言并不重要,我只想知道通用程序语言的方法。

I am creating a search engine ( for studying ) and I want to know how Google recognizes adult content and images with Safesearch ( http://en.wikipedia.org/wiki/Safesearch ).

The program language doesn't matter, I want to know only the approach for a generic program language.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

遇到 2024-10-17 07:09:56

如果任何类型的内容过滤器的规则落入试图通过过滤器获取该内容的人手中,则过滤器将变得无效。

所以我认为谷歌的规则(1)不公开并且(2)经常改变。

也就是说,从一小部分成人网站黑名单开始,然后跟踪传出链接(和/或查找带有黑名单网站链接的网站)可能会发现大量成人网站。但绝不是全部,您还需要某种文本处理和图像识别算法。

注意:一个流行的理论是,成人内容提供商付费让人们在 stackoverflow.com 上提问,这样 Jon Skeet 和 Marc Gravell 就没有时间更新安全搜索过滤器。然而,很容易表明,乔恩和马克回答问题的频率如此之高,以至于任何此类策略在经济上都不可行。

If the rules for any sort of content filter fell into the hands of people trying to get that content through the filter, the filter would become ineffective.

So I imagine that Google's rules (1) are not publicly available and (2) change frequently.

That said, starting with a small blacklist of adult sites and following outgoing links (and/or finding sites with links to the blacklisted sites) probably finds a huge number of adult sites. But by no means all, you'd want some sort of text processing and image recognition algorithms in addition.

NOTE: A popular theory is that adult content providers pay people to ask questions on stackoverflow.com so that Jon Skeet and Marc Gravell will have less time to update the SafeSearch filters. However, it is easily shown that Jon and Marc answer questions at such a high rate that any such strategy would not be economically viable.

楠木可依 2024-10-17 07:09:56

本的回答在所有方面都是正确的,但我想补充一下我的考虑。

关于图像识别:在给定大量图像的情况下,您会发现使用模式识别来识别诸如裸露的乳房、阴茎等内部物体非常容易。

然而,所有人工智能算法都有弱点。您可能会遇到一定比例的图像被错误分类,具体取决于所使用的分类器的质量。

然后,您必须应用图像处理之外的其他标准。当然,Google 的标准不是公开的,但您可能会考虑使用 ICRA 标签来自愿将某些材料标记为成人材料、文本处理和跨域链接。如果我是安全搜索的创建者,我会采用以下模式:成人网站经常交换链接,因此您会在一组成人网站之间的链接图中发现很多交叉点。

总而言之,一个好的分类方法会使用几个较小的标准,对它们进行评分来确定图像是否是成人图像。

Ben's answer is correct about all points, but I would like to add my considerations.

About image recognition: you will find pretty easy, given a large set of images, to identify objects like naked breasts, penises and such inside of them using pattern recognition.

All artificial intelligence algorithms, however, have weak points. You might experience that a certain percentage of your images, depending on the quality of the classificator used, is misclassified.

Then, you have to apply other criteria more than image processing. Surely Google's criteria are not public, but you would like to consider ICRA tags for volountarily marking certain material as adult material, text processing and cross-domain links. If I was the creator of the Safesearch, I would have adopted the following pattern: adult sites often exchange links, so you'll find lots of intersections in the link graphs between a group of adult sites.

Putting it all together, a good classification approach uses several smaller criteria, scoring them to determine whether an image is an adult image or not.

层林尽染 2024-10-17 07:09:56

可能与过滤垃圾邮件的方式类似。

第一步是根据已知的成人网站创建一个训练集,并从中提取特征。这些可以是关键字、图像中使用的颜色、域名结构、whois 详细信息等等。与非成人内容相比,成人内容在某种程度上可能有特别不同的任何内容。

下一步是应用某种统计模型。贝叶斯模型似乎适用于垃圾邮件,但可能不适用于成人内容。

支持向量机看起来很合适,但那要复杂得多,我不是很了解我自己也很熟悉。

Possibly in a similar way to how spam is filtered.

First step is to create a training set, based on known adult sites, and extract features from them. These could be keywords, colors used in images, domain name structure, whois details, whatever. Anything that could in some way be specifically different for adult content as compared to non-adult content.

Next step is to apply some sort of statistical model to that. Bayesian models seem to work well for spam, but may not for adult stuff.

Support vector machines seem like a good fit, but that's a lot more complex and I'm not really familiar with it myself.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文