iPhone 版 Objective C 中的基本脏话过滤器

发布于 2024-09-01 16:47:31 字数 387 浏览 11 评论 0原文

你们志同道合的人如何应对过滤脏话的基本挑战，显然不可能解决所有情况，但最好有一个最基本的水平作为第一道防线。

在 Obj-c 中，我得到了

NSString *tokens = [text componentsSeparatedByString:@" "];

然后我循环遍历每个标记以查看是否在每个标记中找到任何关键字（我在列表中找到了大约 400 个）。

认识到误报也是一个问题，如果该单词是完美匹配，则将其标记为亵渎，否则，如果发现超过 3 个带有亵渎的单词但不是完美匹配，则也会将其标记为亵渎。

稍后我将使用一个网络服务来更精确地解决这个问题，但我真的只需要一些基本的东西。所以如果你写了“阴茎”这个词，它就会变成“顽皮顽皮”，写得不好。

原文

How have you like minded individuals tackled the basic challenge of filtering profanity, obviously one can't possibly tackle every scenario but it would be nice to have one at the most basic level as a first line of defense.

In Obj-c I've got

NSString *tokens = [text componentsSeparatedByString:@" "];

And then I loop through each token to see if any of the keywords (I've got about 400 in a list) are found within each token.

Realising False positives are also a problem, if the word is a perfect match, its flagged as profanity otherwise if more than 3 words with profanity are found without being perfect matches it is also flagged as profanity.

Later on I will use a webservice that tackles the problem more precisely, but I really just need something basic. So if you wrote the word penis it would go yup naughty naughty, bad word written.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

三人与歌 2024-09-08 16:47:31

淫秽过滤器：糟糕的想法，还是令人难以置信的相互交流的糟糕想法？

在开始编写这样的代码之前，Jeff 有一篇有趣的文章需要考虑：

http://www.codinghorror.com/blog/2008/10/obscenity-filters-bad-idea-or-incredible-intercoursing-bad-idea.html

回复收藏 0 原文

尹雨沫 2024-09-08 16:47:31

我只是有一个关于标记字符串的建议。如果单词全部由字符串分隔，则您的方法效果很好，但在大多数使用场景中很少出现这种情况，因为您通常必须处理换行符、标点符号等。如果您感兴趣，请尝试此操作：

NSMutableCharacterSet *separators = [NSMutableCharacterSet punctuationCharacterSet];

[separators formUnionWithCharacterSet:[NSCharacterSet whitespaceAndNewlineCharacterSet]];

NSArray *words = [bigString componentsSeparatedByCharactersInSet:separators];

来源：http://www.tech-recipes.com/ rx/3418/cocoa-explode-break-nsstring-into-individual-words/

I just have a suggestion for tokenizing the string. Your ways works well if the words are all separated by strings but that is rarely the case in most usage scenarios as you would normally have to deal with newlines, punctuation, etc. Try this if you are interested:

NSMutableCharacterSet *separators = [NSMutableCharacterSet punctuationCharacterSet];

[separators formUnionWithCharacterSet:[NSCharacterSet whitespaceAndNewlineCharacterSet]];

NSArray *words = [bigString componentsSeparatedByCharactersInSet:separators];

Source: http://www.tech-recipes.com/rx/3418/cocoa-explode-break-nsstring-into-individual-words/

回复收藏 0 原文

我还不会笑 2024-09-08 16:47:31

嗯，以这种方式搜索当然不是搜索脏话的最有效方法...更有效的方法是构造一个有限状态自动机来检测单词，并通过该 FSA 运行一次文本。您实际上并不需要拆分字符串来查找脏话，所有这些拆分都会增加您不需要的额外分配和复制开销。此外，某些列入黑名单的单词可能存在常见模式，您无法通过单独搜索每个单词来利用这些模式。

也就是说，我认为 400 字已经很多了。到底谁是你的观众？如果用户有医疗问题怎么办？这样的问题实际上应该被禁止吗？我只能想到一些在任何情况下都会被视为亵渎的单词，因此您可能需要重新考虑过滤。

回复收藏 0 原文