以编程方式检测色情图像的最佳方法是什么?
Akismet 在检测垃圾评论方面做得非常出色。 但如今评论并不是垃圾邮件的唯一形式。 如果我想要像 akismet 这样的东西自动检测社交网站上的色情图片,并允许用户上传他们的照片、头像等,该怎么办?
已经有一些基于图像的搜索引擎以及人脸识别功能可用,所以我假设这不会是火箭科学,而是可以完成的。 然而,我不知道这些东西是如何工作的,以及如果我想从头开始开发它我应该如何去做。
我应该如何开始?
有没有开源项目可以解决这个问题?
Akismet does an amazing job at detecting spam comments. But comments are not the only form of spam these days. What if I wanted something like akismet to automatically detect porn images on a social networking site which allows users to upload their pics, avatars, etc?
There are already a few image based search engines as well as face recognition stuff available so I am assuming it wouldn't be rocket science and it could be done. However, I have no clue regarding how that stuff works and how I should go about it if I want to develop it from scratch.
How should I get started?
Is there any open source project for this going on?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(25)
简短的回答:使用主持人;)
长的回答:我认为没有一个针对此问题的项目,因为什么是色情? 只有腿、全裸、侏儒等。这是主观的。
short answer: use a moderator ;)
Long answer: I dont think there's a project for this cause what is porn? Only legs, full nudity, midgets etc. Its subjective.
添加攻击性链接并存储攻击性图像的 md5(或其他哈希值),以便将来可以自动标记。
如果有人拥有一个大型公共图像 md5 数据库以及作为 Web 服务运行的描述性标签,那该有多酷? 很多色情作品都不是原创作品(因为现在拥有它的人可能没有成功),而且流行的图像往往会在不同的地方传播,所以这确实会产生影响。
Add an offensive link and store the md5 (or other hash) of the offending image so that it can automatically tagged in the future.
How cool would it be if somebody had a large public database of image md5 along with descriptive tags running as a webservice? Alot of porn isn't original work (in that the person who has it now, didn't probably make it) and the popular images tend to float around different places, so this could really make a difference.
如果您确实有时间和金钱:
一种方法是 1)编写图像检测算法来确定物体是否是人类。 这可以通过对图像进行位掩码以检索其“轮廓”并查看轮廓是否适合人体轮廓来完成。
2)对大量色情图像进行数据挖掘,并使用数据挖掘技术(例如C4算法或粒子群优化)来学习检测与色情图像匹配的模式。
这将要求您确定数字化格式下裸体男人/女人的人体轮廓必须是什么样子(这可以通过与 OCR 图像识别算法的工作方式相同的方式来实现)。
希望你玩得开心! :-)
If you're really have time and money:
One way of doing it is by 1) Writing an image detection algorithm to find whether an object is human or not. This can be done by bitmasking an image to retrieve it's "contours" and see if the contours fits a human contour.
2) Data mine a lot of porn images and use data mining techniques such as the C4 algorithms or Particle Swarm Optimization to learn to detect pattern that matches porn images.
This will require that you identify how a naked man/woman contours of a human body must look like in digitized format (this can be achieved in the same way OCR image recognition algorithms works).
Hope you have fun! :-)
在我看来,主要障碍是定义“色情图像”。 如果你可以轻松地定义它,你可能会写出一些有用的东西。 但即使是人类也无法就什么是色情达成一致。 应用程序如何知道? 用户审核可能是您最好的选择。
Seems to me like the main obstacle is defining a "porn image". If you can define it easily, you could probably write something that would work. But even humans can't agree on what is porn. How will the application know? User moderation is probably your best bet.
我见过一个可以过滤色情图片的网络过滤应用程序,抱歉我不记得名字了。 它很容易出现误报,但大多数时候它都在工作。
我认为主要技巧是检测“图片上有太多皮肤:)
I've seen a web filtering application which does porn image filtering, sorry I can't remember the name. It was pretty prone to false positives however most of the time it was working.
I think main trick is detecting "too much skin on the picture :)
检测色情图像仍然是一项明确的人工智能任务,目前还处于理论阶段。
通过添加按钮/链接“报告垃圾邮件/滥用行为”来收获集体力量和人类智慧。 或者聘请几位主持人来完成这项工作。
PS 真的很惊讶有这么多人提出问题,假设软件和算法是万能的,甚至没有考虑他们想要的东西是否可以完成。 他们是不了解硬件、低级编程和所有“背后的魔法”的新一代程序员的代表吗?
PS#2。 我还记得,有时会发生这样的情况,当人们自己无法决定一张照片是色情还是艺术时,就会被告上法庭。 即使法院做出裁决后,仍有可能有一半的人认为这一决定是错误的。 最近一次类似的愚蠢情况发生在最近,维基百科的一个页面在英国被禁止,因为 CD 封面图片中有一些裸体。
Detecting porn images is still a definite AI task which is very much theoretical yet.
Harvest collective power and human intelligence by adding a button/link "Report spam/abuse". Or employ several moderators to do this job.
P.S. Really surprised how many people ask questions assuming software and algorithms are all-mighty without even thinking whether what they want could be done. Are they representatives of that new breed of programmers who have no understanding of hardware, low-level programming and all that "magic behind"?
P.S. #2. I also remember that periodically it happens that some situation when people themselves cannot decide whether a picture is porn or art is taken to the court. Even after the court rules, chances are half of the people will consider the decision wrong. The last stupid situation of the kind was quite recently when a Wikipedia page got banned in UK because of a CD cover image that features some nakedness.
我能想到的两个选项(尽管它们都没有以编程方式检测色情内容):
Two options I can think of (though neither of them is programatically detecting porn):
BrightCloud Web 服务 API 非常适合此目的。 它是一个 REST API,用于像这样进行网站查找。 它包含一个非常大且非常准确的网络过滤数据库,其中一个类别“成人”已识别出超过 1000 万个色情网站!
The BrightCloud web service API is perfect for this. It's a REST API for doing website lookups just like this. It contains a very large and very accurate web filtering DB and one of the categories, Adult, has over 10M porn sites identified!
我听说过一些工具使用非常简单但非常有效的算法。 该算法计算颜色值接近某些预定义“皮肤”颜色的像素的相对数量。 如果该数量高于某个预定义值,则图像被视为具有色情/色情内容。 当然,该算法会对特写脸部照片和许多其他事物给出误报结果。
由于您正在撰写有关社交网络的文章,因此会有很多带有大量肤色的“正常”照片,因此您不应该使用此算法来拒绝所有具有积极结果的照片。 但是您可以使用它为版主提供一些帮助,例如将这些图片标记为具有更高的优先级,这样如果版主想要检查一些新图片是否包含色情内容,他可以从这些图片开始。
I've heard about tools which were using very simple, but quite effective algorithm. The algorithm calculated relative amount of pixels with color value near to some predefined "skin" colours. If that amount is higher than some predefined value then image is considered to be of erotic/pornographic content. Of course that algorithm will give false positive results for close-up face photos and many other things.
Since you are writing about social networking there will be lots of "normal" photos with high amount of skin colour on it, so you shouldn't use this algorithm to deny all pictures with positive result. But you can use it provide some help for moderators, for example flag these pictures with higher priority, so if moderator want to check some new pictures for pornographic content he can start from these pictures.
这看起来很有前途。 基本上,它们检测皮肤(通过识别面部进行校准)并确定“皮肤路径”(即测量皮肤像素与面部皮肤像素/皮肤像素的比例)。 这具有不错的性能。
http://www.prip.tuwien.ac.at/people/朱利安/皮肤检测
This one looks promising. Basically they detect skin (with calibration by recognizing faces) and determine "skin paths" (i.e. measuring the proportion of skin pixels vs. face skin pixels / skin pixels). This has decent performance.
http://www.prip.tuwien.ac.at/people/julian/skin-detection
查看文件名和任何属性。 没有足够的信息来检测甚至 20% 的顽皮图像,但简单的关键字黑名单至少可以检测带有描述性标签或元数据的图像。 20 分钟的编码以获得 20% 的成功率并不是一件坏事,尤其是作为预筛选,在将其余部分交给主持人进行评审之前,至少可以捕获一些简单的代码。
当然,另一个有用的技巧是相反的,维护图像源白名单以允许无需审核或检查。 如果您的大部分图像来自已知的安全上传者或来源,您可以绑定地接受它们。
Look at file name and any attributes. There's not nearly enough information to detect even 20% of naughty images, but a simple keyword blacklist would at least detect images with descriptive labels or metadata. 20 minutes of coding for a 20% success rate isn't a bad deal, especially as a prescreen that can at least catch some simple ones before you pass the rest to a moderator for judging.
The other useful trick is the opposite of course, maintain a whitelist of image sources to allow without moderation or checking. If most of your images come from known safe uploaders or sources, you can just accept them bindly.
— 美国最高法院法官波特·斯图尔特,1964 年
— United States Supreme Court Justice Potter Stewart, 1964
您可以找到许多网络上处理此主题的白皮书。
You can find many whitepapers on the net dealing with this subject.
这不是火箭科学。 不再。 它与人脸识别非常相似。 我认为解决这个问题最简单的方法就是使用机器学习。 由于我们正在处理图像,我可以指向神经网络,因为这些似乎更适合图像。 您将需要训练数据。 您可以在互联网上找到大量训练数据,但您必须将图像裁剪到您希望算法检测到的特定部分。 当然,您必须将问题分解为您想要检测的不同身体部位,并为每个部位创建训练数据,这就是事情变得有趣的地方。
就像楼上的人说的,不可能100%完成。 此类算法有时会失败。 实际精度将取决于您的训练数据、神经元网络的结构以及您选择如何对训练数据进行聚类(阴茎、阴道、乳房等及其组合)。 无论如何,我非常有信心可以以高精度实现露骨的色情图像。
It is not rocket science. Not anymore. It is very similar to face recognition. I think that the easiest way to deal with it is to use machine learning. And since we are dealing with images, I can point towards neuronal networks, because these seem to be preferred for images. You will need training data. And you can find tons of training data on the internet but you have to crop the images to the specific part that you want the algorithm to detect. Of course you will have to break the problem into different body parts that you want to detect and create training data for each, and this is where things become amusing.
Like someone above said, it cannot be done 100% percent. There will be cases where such algorithms fail. The actual precision will be determined by your training data, the structure of your neuronal networks and how you will choose to cluster the training data (penises, vaginas, breasts, etc, and combinations of such). In any case I am very confident that this can be achieved with high accuracy for explicit porn imagery.
这是一个裸体探测器。 我没试过。 这是我能找到的唯一一个 OSS。
https://code.google.com/p/nudetech
This is a nudity detector. I haven't tried it. It's the only OSS one I could find.
https://code.google.com/p/nudetech
以现在的知识,你不可能 100% 做到这一点(我想说也许 1-5% 是合理的)。 只需检查图像名称中与性别相关的单词,您就会得到更好的结果(比那些 1-5% 的结果要好得多:)。
@SO巨魔:确实如此。
There is no way you could do this 100% (i would say maybe 1-5% would be plausible) with nowdays knowledge. You would get much better result (than those 1-5%) just checking the image-names for sex-related-words :).
@SO Troll: So true.
这实际上相当简单。 您可以通过编程方式检测肤色 - 而色情图片往往有很多皮肤。 这会产生误报,但如果这是一个问题,您可以传递通过实际审核检测到的图像。 这不仅大大减少了版主的工作量,而且还为您提供了大量免费色情内容。 这是双赢的。
此代码测量图像中心的肤色。 我测试了 20 张相对温和的“色情”图像和 20 张完全无辜的图像。 它标记了 100% 的“色情”图像和 20 张干净图像中的 4 张。 这是一个相当高的误报率,但该脚本的目标是相当谨慎,并且可以进一步调整。 它适用于浅色、深色和亚洲肤色。
它的误报主要弱点是沙子和木头等棕色物体,当然它不知道“顽皮”和“漂亮”肉体(例如脸部照片)之间的区别。
假阴性的弱点是没有太多裸露肉体的图像(如皮革束缚)、涂漆或纹身的皮肤、黑白图像等。
源代码和示例图像
This is actually reasonably easy. You can programatically detect skin tones - and porn images tend to have a lot of skin. This will create false positives but if this is a problem you can pass images so detected through actual moderation. This not only greatly reduces the the work for moderators but also gives you lots of free porn. It's win-win.
This code measures skin tones in the center of the image. I've tested on 20 relatively tame "porn" images and 20 completely innocent images. It flags 100% of the "porn" and 4 out of the 20 of the clean images. That's a pretty high false positive rate but the script aims to be fairly cautious and could be further tuned. It works on light, dark and Asian skin tones.
It's main weaknesses with false positives are brown objects like sand and wood and of course it doesn't know the difference between "naughty" and "nice" flesh (like face shots).
Weakness with false negatives would be images without much exposed flesh (like leather bondage), painted or tattooed skin, B&W images, etc.
source code and sample images
这是 2000 年写的,不确定色情检测的技术水平是否已经进步,但我对此表示怀疑。
http://www.dansdata.com/pornsweeper.htm
This was written in 2000, not sure if the state of the art in porn detection has advanced at all, but I doubt it.
http://www.dansdata.com/pornsweeper.htm
我宁愿允许用户报告不良图像。 图像识别的开发需要花费太多的精力和时间,并且不会像人眼那么准确。 将审核工作外包出去要便宜得多。
看看:Amazon Mechanical Turk
“Amazon Mechanical Turk (MTurk) 是亚马逊网络服务套件之一,亚马逊网络服务是一个众包市场,使计算机程序能够协调人类智能的使用来执行计算机无法完成的任务。”
I would rather allow users report on bad images. Image recognition development can take too much efforts and time and won't be as much as accurate as human eyes. It's much cheaper to outsource that moderation job.
Take a look at: Amazon Mechanical Turk
"The Amazon Mechanical Turk (MTurk) is one of the suite of Amazon Web Services, a crowdsourcing marketplace that enables computer programs to co-ordinate the use of human intelligence to perform tasks which computers are unable to do."
繁荣! 这是白皮书 包含算法。
有谁知道在哪里可以获得java(或任何语言)实现的源代码?
那会很震撼。
一种名为 WISE 的算法具有 98% 的准确率,但误报率为 14%。 因此,您要做的就是让用户标记 2% 的误报,理想情况下,如果一定数量的用户标记它,则自动删除,并让版主查看 14% 的误报。
BOOM! Here is the whitepaper containing the algorithm.
Does anyone know where to get the source code for a java (or any language) implementation?
That would rock.
One algorithm called WISE has a 98% accuracy rate but a 14% false positive rate. So what you do is you let the users flag the 2% false negatives, ideally with automatic removal if a certain number of users flag it, and have moderators view the 14% false positives.
Nude.js 基于 白皮书,来自 De La Salle 的 Rigan Ap-apid大学。
Nude.js based on the whitepaper by Rigan Ap-apid from De La Salle University.
有软件可以检测色情内容的概率,但这并不是一门精确的科学,因为计算机无法识别图片上的实际内容(图片只是网格上的一组值,没有任何意义)。 你可以通过举例来教计算机什么是色情,什么不是。 这样做的缺点是它只能识别这些或类似的图像。
考虑到色情内容的重复性,如果你训练系统时误报很少,那么你就有很好的机会。 例如,如果你用裸体人训练系统,它可能也会将“几乎”裸体人的海滩照片标记为色情片。
类似的软件还有最近问世的facebook软件。 它只是专门针对面部。 主要原理是一样的。
从技术上讲,您将实现某种利用贝叶斯过滤的特征检测器。 如果特征检测器是一个简单的检测器,或者只是计算当前图像与一组保存的色情图像的相似度,则特征检测器可能会查找诸如肤色像素百分比之类的特征。
这当然不仅限于色情内容,实际上更多的是一个极端案例。 我认为更常见的是尝试在图像中查找其他内容的系统;-)
There is software that detects the probability for porn, but this is not an exact science, as computers can't recognize what is actually on pictures (pictures are only a big set of values on a grid with no meaning). You can just teach the computer what is porn and what not by giving examples. This has the disadvantage that it will only recognize these or similar images.
Given the repetitive nature of porn you have a good chance if you train the system with few false positives. For example if you train the system with nude people it may flag pictures of a beach with "almost" naked people as porn too.
A similar software is the facebook software that recently came out. It's just specialized on faces. The main principle is the same.
Technically you would implement some kind of feature detector that utilizes a bayes filtering. The feature detector may look for features like percentage of flesh colored pixels if it's a simple detector or just computes the similarity of the current image with a set of saved porn images.
This is of course not limited to porn, it's actually more a corner case. I think more common are systems that try to find other things in images ;-)
答案很简单:可以肯定地说,这在未来二十年内是不可能的。 在此之前我们可能会得到很好的翻译工具。 上次我检查时,人工智能人员正在努力在从稍微改变角度拍摄的两张照片上识别同一辆车。 看看他们花了多长时间才获得足够好的 OCR 或语音识别。 这些识别问题可以从字典中受益匪浅,尽管投入了数百万个工月,但仍然远没有完全可靠的解决方案。
话虽这么说,你可以简单地添加一个“进攻”吗? 用户生成的内容旁边的链接,并让 mod 交叉检查收到的投诉。
编辑:
我忘了一件事:如果你要实现某种过滤器,你将需要一个可靠的过滤器。 如果您的解决方案正确率达到 50%,则 4000 个拥有良好图像的用户中的 2000 个将被阻止。 预计会引起愤怒。
The answer is really easy: It's pretty safe to say that it won't be possible in the next two decades. Before that we will probably get good translation tools. The last time I checked, the AI guys were struggling to identify the same car on two photographs shot from a slightly altered angle. Take a look on how long it took them to get good enough OCR or speech recognition together. Those are recognition problems which can benefit greatly from dictionaries and are still far from having completely reliable solutions despite of the multi-million man months thrown at them.
That being said you could simply add an "offensive?" link next to user generated contend and have a mod cross check the incoming complaints.
edit:
I forgot something: IF you are going to implement some kind of filter, you will need a reliable one. If your solution would be 50% right, 2000 out of 4000 users with decent images will get blocked. Expect an outrage.
台湾成功大学的一名研究生在2004年对此进行了一项研究,他在检测从互联网下载的裸照方面取得了89.79%的成功率。 以下是他的论文链接:基于肤色的裸体人物图像检测研究
它是中文的,所以你可能需要翻译以防你看不懂。
A graduate student from National Cheng Kung University in Taiwan did a research on this subject in 2004. He was able to achieve success rate of 89.79% in detecting nude pictures downloaded from the Internet. Here is the link to his thesis: The Study on Naked People Image Detection Based on Skin Color
It's in Chinese therefore you may need a translator in case you can't read it.