ruby Rails 中带有模糊搜索的脏话过滤器
我是一个 Rails 新手。
我正在使用 profanity_filter ruby gem 来过滤我的内容应用程序中的脏话..
profanity_filter,如果有的话是一个脏词,假设“foulword”
它返回“f******d”
如果任何用户玩得聪明并输入“foulwood”< /代码> 或
"foulwordd"
或 "foulllword"
等它不会检测为脏话。
有没有办法确保它检测到这些用户聪明的脏话?
期待帮助!
谢谢你!
I am a rails newbie.
I am using profanity_filter ruby gem to filter the foul words in my content application..
profanity_filter, if at all there is a foul word, lets say "foulword"
it returns "f******d"
If any user plays smart and types "foulwoord"
or "foulwordd"
or "foulllword"
etc it does not detect as a foul word.
Is there a way to make sure it detects these user-smart-foul-words?
Looking forward for help!
Thank you!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
你需要过滤多少脏话?
一种方法是使用诸如
Diff::LCS
(来自diff-lcs
gem)之类的东西来检查被检查的单词和每个脏词之间有多少个字母不同。如果您有大量脏话需要检查,则速度可能会非常慢。为了让它更快更快,你可以做的一件事是包含一本“好”单词的字典。将“好”字典保存在Set
中,在检查每个内容词之前,首先测试它是否在字典中。如果是这样,您可以继续前进。 (如果您想非常快速地检查字典,请将其保存在搜索树中。)此外,如果您检查一个单词并发现它没问题,您可以添加 将其添加到词典中,这样您就无需再次检查同一个单词。这里的危险是字典可能会变得太大。如果这是一个问题,您可以使用类似于“最近最少使用”缓存的东西,当字典变得太大时,它会丢弃最近未见过的“好”单词。
另一种方法是为每个脏词生成变体,并将它们存储在“坏”字典中。如果生成每个与脏词相差 1 个字母的单词,则每个脏词大约有 200-500 个。您还可以仅通过将字母“o”更改为零等来生成与脏词不同的单词。
无论您做什么,您永远不会在不错误地标记“坏”词的情况下捕获 100% 的“坏”词。好”字。如果你能得到一个过滤器,能够捕获可接受的高比例的“坏”单词,并且误报率低得可接受,那就是“成功”。
如果您正在为网站执行此操作,我建议您不要用“坏”字词屏蔽内容,而是自动将其标记为引起版主注意。如果允许淫秽内容在网站上出现,即使是短暂的,也是不可接受的,您可以延迟显示标记的内容,直到版主查看过该内容之后。这将避免 @Blorgbeard 在他的评论中提到的斯肯索普问题。
How many foul words do you need to filter?
One approach would be to use something like
Diff::LCS
(from thediff-lcs
gem) to check how many letters are different between the word being checked and each foul word. If you have a large number of foul words to check, this could be very slow. One thing you could do to make it much faster would be to include a dictionary of "good" words. Keep the "good" dictionary in aSet
, and before checking each content word, first test whether it is in the dictionary. If so, you can move on. (If you want to make checking the dictionary very fast, keep it in a search trie.)Further, if you check a word and find that it is OK, you could add it to the dictionary so you don't need to check the same word again. The danger here is that the dictionary may grow too large. If this is a problem, you could use something similar to a "least recently used" cache which, when the dictionary becomes too big, would discard "good" words which have not been seen recently.
Another approach would be to generate variants on each foul word, and store them in a "bad" dictionary. If you generate each word which differs by 1 letter from a foul word, there would be about 200-500 for each foul word. You could also generate words which differ from a foul word only by changing the letter "o" to a zero, etc.
No matter what you do, you are never going to catch 100% of "bad" words without ever mistakenly flagging a "good" word. If you can get a filter which catches an acceptably high percentage of "bad" words, with an acceptably low rate of false positives, that will be "success".
If you are doing this for a web site, I suggest that rather than blocking content with "bad" words, you automatically flag it for moderator attention. If allowing obscene content to go up on the site even briefly is unacceptable, you could delay displaying flagged content until after a moderator has looked at it. This will avoid the Scunthorpe problem with @Blorgbeard mentioned in his comment.