iPhone 版 Objective C 中的基本脏话过滤器

发布于 2024-09-01 16:47:31 字数 387 浏览 1 评论 0原文

你们志同道合的人如何应对过滤脏话的基本挑战,显然不可能解决所有情况,但最好有一个最基本的水平作为第一道防线。

在 Obj-c 中,我得到了

NSString *tokens = [text componentsSeparatedByString:@" "];

然后我循环遍历每个标记以查看是否在每个标记中找到任何关键字(我在列表中找到了大约 400 个)。

认识到误报也是一个问题,如果该单词是完美匹配,则将其标记为亵渎,否则,如果发现超过 3 个带有亵渎的单词但不是完美匹配,则也会将其标记为亵渎。

稍后我将使用一个网络服务来更精确地解决这个问题,但我真的只需要一些基本的东西。所以如果你写了“阴茎”这个词,它就会变成“顽皮顽皮”,写得不好。

How have you like minded individuals tackled the basic challenge of filtering profanity, obviously one can't possibly tackle every scenario but it would be nice to have one at the most basic level as a first line of defense.

In Obj-c I've got

NSString *tokens = [text componentsSeparatedByString:@" "];

And then I loop through each token to see if any of the keywords (I've got about 400 in a list) are found within each token.

Realising False positives are also a problem, if the word is a perfect match, its flagged as profanity otherwise if more than 3 words with profanity are found without being perfect matches it is also flagged as profanity.

Later on I will use a webservice that tackles the problem more precisely, but I really just need something basic. So if you wrote the word penis it would go yup naughty naughty, bad word written.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

三人与歌 2024-09-08 16:47:31

淫秽过滤器:糟糕的想法,还是令人难以置信的相互交流的糟糕想法?

在开始编写这样的代码之前,Jeff 有一篇有趣的文章需要考虑:

http://www.codinghorror.com/blog/2008/10/obscenity-filters-bad-idea-or-incredible-intercoursing-bad-idea.html

Obscenity Filters: Bad Idea, or Incredibly Intercoursing Bad Idea?

Jeff has an interesting article to consider before embarking on such a piece of code:

http://www.codinghorror.com/blog/2008/10/obscenity-filters-bad-idea-or-incredibly-intercoursing-bad-idea.html

尹雨沫 2024-09-08 16:47:31

我只是有一个关于标记字符串的建议。如果单词全部由字符串分隔,则您的方法效果很好,但在大多数使用场景中很少出现这种情况,因为您通常必须处理换行符、标点符号等。如果您感兴趣,请尝试此操作:

NSMutableCharacterSet *separators = [NSMutableCharacterSet punctuationCharacterSet];

[separators formUnionWithCharacterSet:[NSCharacterSet whitespaceAndNewlineCharacterSet]];

NSArray *words = [bigString componentsSeparatedByCharactersInSet:separators];

来源:http://www.tech-recipes.com/ rx/3418/cocoa-explode-break-nsstring-into-individual-words/

I just have a suggestion for tokenizing the string. Your ways works well if the words are all separated by strings but that is rarely the case in most usage scenarios as you would normally have to deal with newlines, punctuation, etc. Try this if you are interested:

NSMutableCharacterSet *separators = [NSMutableCharacterSet punctuationCharacterSet];

[separators formUnionWithCharacterSet:[NSCharacterSet whitespaceAndNewlineCharacterSet]];

NSArray *words = [bigString componentsSeparatedByCharactersInSet:separators];

Source: http://www.tech-recipes.com/rx/3418/cocoa-explode-break-nsstring-into-individual-words/

我还不会笑 2024-09-08 16:47:31

嗯,以这种方式搜索当然不是搜索脏话的最有效方法...更有效的方法是构造一个有限状态自动机来检测单词,并通过该 FSA 运行一次文本。您实际上并不需要拆分字符串来查找脏话,所有这些拆分都会增加您不需要的额外分配和复制开销。此外,某些列入黑名单的单词可能存在常见模式,您无法通过单独搜索每个单词来利用这些模式。

也就是说,我认为 400 字已经很多了。到底谁是你的观众?如果用户有医疗问题怎么办?这样的问题实际上应该被禁止吗?我只能想到一些在任何情况下都会被视为亵渎的单词,因此您可能需要重新考虑过滤。

Well, searching in that manner is certainly not the most efficient way to search for profanity... a more efficient approach would be to construct a finite state automaton to detect the words, and run the text once through that FSA. You don't really need to split strings to find profanity, and all that splitting adds extra allocation and copying overhead that you don't need. Also, there may be common patterns in some of the blacklisted words, which you are not exploiting by searching each word individually.

That said, I think 400 words is quite a lot. Who, exactly, is your audience? What if a user has a medical question? Should such questions actually be disallowed? I can only think of a handful of words that would be considered profane in any context, so you might want to rethink the filtering.

单身狗的梦 2024-09-08 16:47:31

  • FSA 不一定有效,具体取决于您希望过滤器的智能程度 正则
  • 表达式通常非常慢,具体取决于您想要运行的数量
  • 400 个单词有点低,具体取决于您的需求和语言
  • 有几点 有很多过滤时需要注意的极其棘手的情况,特别是嵌入诸如“ASSume”之类的单词。

我的公司 Inversoft 构建了一个商业过滤解决方案,它非常智能。它不使用正则表达式或 FSA,但具有定制的快速线性处理技术,使其极其快速和准确(每秒 4,000 多条消息)。它还包含多个类别的 600 多个英语单词,包括俚语、种族诽谤、毒品、帮派、宗教等。

如果您正在寻找有支持的智能上下文感知解决方案,您应该查看 Inversoft 的 Clean Speak。使用 XML WebService 将其连接到 Obj-C 应该很简单。

A couple of things:

  • FSA won't necessarily work depending on how intelligent you want the filter to be
  • Regex are generally extremely slow depending on how many you want to run
  • 400 words is somewhat low, depending on your needs and langauges
  • There are a number of extremely tricky cases to be careful of when filtering, particularly embedding of words such as "ASSume"

My company, Inversoft, builds a commercial filtering solution and it is quite intelligent. It doesn't use regex or FSA, but has a custom built fast-linear processing technology that makes it extremely fast and accurate (4,000+ messages per second). It also has over 600 English words in a number of categories including Slang, Racial Slurs, Drug, Gang, Religious, etc.

If you are looking for an intelligent context-aware solution with support, you should check out Clean Speak from Inversoft. Hooking it up to Obj-C should be simple using the XML WebService.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文