针对不安全图像的图像过滤
现在,我有一个抓取图像的网站。根据他们对是否允许不安全(18+)图像的偏好来提供图像。
现在我们自己整理图片,每天收到大量图片提交,需要很长时间。
我知道谷歌在这方面做得很好。
我只是希望对性和色情性质的图像进行整理。穿比基尼的女孩很好。
我心中有一个想法,程序将在图像中搜索我不想显示的图像的模式。例如,搜索图像中的隐私,然后如果找到该模式,则将其标记为不安全。
我想知道php中是否有任何程序或算法可以用来为我们执行此操作?
Now, I have a website which crawls images. The images are served based on their preference of whether unsafe (18+) images are allowed or not.
Right now we sort out the images ourself and it takes a very long time since we get a lot of image submissions per day.
I know google does it pretty well.
I just want the images of sexual and pornographic nature to be sorted out. Girls in bikini are fine.
I had an idea in mind, where the program would search an image for the patterns of the images that I dont want to be shown. For example searching images for the privates and then if the pattern is found mark it as unsafe.
I was wondering if there was any program or algorithm in php that can be used to perform this for us?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
尽管 SimpleCoder 的解决方案比这复杂得多,我仍然建议手动调节图像。除非你花费数千美元制作一些极其先进的算法,否则你总是会出现误报和漏报。只是作为一个小实验,我去了 http://pikture.logikit.net/Demo/index并上传了8张图片。 6 个是干净的,2 个是露骨的。在这两张露骨图像中,其中一张被错误地标记为干净。在六张干净的图像中,四张被错误地标记为露骨。请注意,我故意选择一些我认为计算机会混淆的图像来愚弄它,事实证明这非常简单。他们的程序得分只有区区 37.5%。
以下是一些建议,至少应该使版主的生活变得更轻松,并且以编程方式实现应该不会太困难:
1)获取所有当前拒绝的图像(如果可能)并对文件进行哈希处理并将哈希值存储在数据库。对所有新提交的内容进行哈希处理,并根据现有哈希值验证哈希值。如果找到匹配项,则自动标记它。当管理员手动拒绝图像时,也将此哈希添加到数据库中。这至少可以防止您必须重新标记重复项。
2) 如果在该域上的任何文件中发现任何露骨内容,则将 $isPornScore 的权重添加到整个域中的所有图像。也许应该对一个域中的多次出现给予更多的权重。对这些域上的图像进行类似的域热链接操作。
3)检查域名本身。如果包含露骨语言,请添加到 $isPornScore。此外,还应对图像和包含锚标记的页面的 URI(如果不同)执行相同的操作。
4) 检查图像周围的文字。尽管这不是 100% 准确,但如果您在页面上的某处有明显的“农场 sexxx 与三个女人和......”,您至少可以增加该页面(或域)上所有图像的权重明确的。
5) 使用您可以使用的任何其他技术或标准,并对图像应用总体“分数”。然后使用您自己的判断和/或反复试验,如果分数高于一定数量,则自动将其标记为明确并标记它。尝试在误报和不标记露骨图像的成本之间达到可接受的平衡。如果没有自动标记为露骨,仍然需要主持人干预。
Even though SimpleCoder's solution is by far more sophisticated than this, I would still recommend manually moderating the images. Unless you spend thousands of dollars making some extremely advanced algorithm, you will always have false positives and negatives. Just as a little experiment, I went to http://pikture.logikit.net/Demo/index and uploaded 8 images. 6 were clean and 2 were explicit. Of the two explicit images, one was falsely marked as clean. Of the six clean images, four were falsely marked as explicit. Mind you, I purposely tried to fool it by choosing images that I thought a computer would get confused with, and it turns out it was pretty easy. Their program scored a measly 37.5%.
Here are a few recommendations of things that should at least make life somewhat easier for the moderators and shouldn't be too difficult to implement programatically:
1) Take all currently rejected images (if possible) and hash the files and store the hashes in a database. Hash all new submissions when they come in and verify the hash against the already existing hashes. If a match is found, automatically flag it. When an admin manually rejects the image, add this hash to the database as well. This will at least prevent you from having to re-flag duplicates.
2) Add weight to the $isPornScore to all images from entire domains if any explicit content is found in any file on that domain. Perhaps more weight should be given for multiple occurrences from one domain. Do similarly to domains hotlinking to images on these domains.
3) Check the domain name itself. If it contains explicit language, add to the $isPornScore. Also the same should be done to the URI of both the image and the page containing the anchor tag (if different).
4) Check the text around the image. Even though this isn't 100% accurate, if you have a blatant "Farm sexxx with three women and ..." somewhere on a page, you can at least increase the weight that all images on that page (or domain) will be explicit.
5) Use any other techniques or criteria you can and apply an overall "score" to the image. Then use your own judgment and/or trial and error and if the score is higher than a certain amount, automatically flag it as being explicit and flag it. Try to reach an acceptable balance between false positives and whatever the cost of having the explicit image not be flagged is. If it is not automatically flagged as explicit, still require moderator intervention.
我假设您想根据图像内容进行过滤,而不是上下文(例如图像周围的单词)。
这是一些相当激烈的人工智能。您需要训练一种算法,以便它可以“学习”不安全图像的样子。这是关于该主题的一篇很棒的论文:http://www.stanford。 edu/class/cs229/proj2005/HabisKrsmanovic-ExplicitImageFilter.pdf
I'm assuming you want to filter based on image content, and not context (e.g. what words are around the image).
That's some pretty intense AI. You will need to train an algorithm so it can 'learn' what an unsafe image looks like. Here is a great paper on the subject: http://www.stanford.edu/class/cs229/proj2005/HabisKrsmanovic-ExplicitImageFilter.pdf