促销代码的脏话检查
我有一个有点不寻常的脏话相关问题。
现在我们习惯于处理用户生成内容的亵渎过滤——任何方法都不完美,但像 CleanSpeak< 这样的产品/a> 和 WebPurify 做得足够好。
不过,我们目前面临的问题是,我们一直在构建一个引擎来运行基于促销代码的比赛,该引擎将在国际上使用。我们可以检查这些代码在拉丁美洲西班牙语或马来语中是否有亵渎行为(至少在第一个例子中),以确保我们不会发送相当于 FUCK23
或PEN15
之类的。
我们尝试过谷歌搜索并询问我们认识的人,但我们找不到一种简单的方法来获取要过滤的 es-419
或 ms
脏话列表反对的代码。由于每个区域设置实际上有数百万个代码,因此我们宁愿进行离线检查,也不愿为每个代码调用 API(这在带宽和使用费方面都非常昂贵)。
我知道这有点遥远,但是有人知道不同语言的脏话列表的好来源吗?
#disclaim
:我们知道没有任何亵渎过滤是完美的,它对于用户生成的内容本质上是徒劳的,我们已经阅读了 SO #273516:如何实现一个好的脏话过滤器?——这不是我们要问的。
I have a slightly unusual profanity-related question.
Now we're used to dealing with profanity-filtering of user-generated content — any method is imperfect, but products like CleanSpeak and WebPurify do a good-enough job.
The problem we have at the moment, though, is that we've been building an engine to run promotional-code–based competitions, that will be used internationally. We could do with checking that none of these codes is profane in Latin American Spanish or Malay (at least in the first instance), to make sure we don't send out a code that's equivalent to FUCK23
or PEN15
or something.
We've tried Googling around and asking people we know, but we can't find an easy way of getting hold of an es-419
or an ms
profanity list to filter the codes against. As there are literally millions of codes per locale, we'd rather do an offline check than hit an API for each code (which would be expensive both in terms of bandwidth and usage fees).
I know this is a bit of a long shot, but does anyone know of a good source for profanity lists in different languages?
#disclaim
: We know that no profanity filtering is perfect, that it's essentially futile with user-generated content and we have read SO #273516: How do you implement a good profanity filter? — that's not what we're asking.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
用其他语言构建或查找列表非常耗时且困难(相信我,我们在 Inversoft 构建了其中的许多列表)。您可能最好调整代码生成器(据我所知,您的代码正在生成促销代码而不是人类)。
调整生成器的最佳方法是确保代码不能轻易地根据大多数欧洲语言中辅音和元音的一般用法来形成单词。在波兰语和其他语言中,事情变得有点冒险,但通常是有效的。
一般来说,大多数以元音开头的代码后面都会跟着另一个元音或非连接辅音(例如没有“u”的“q”)。如果代码以辅音开头,则下一个字符是相同的辅音或使用概率较低的辅音。例如,如果您以“s”开头,那么添加“g”是一个不错的选择。
您还可以使用维基词典或其他类似来源(如 Linux 词典文件)来构建统计方法。通过提取字符彼此相邻的概率,您应该能够以良好的准确性生成代码,而不会成为任何语言中的单词。
但是,如果我误读了您的问题并且您没有以编程方式生成代码,则您可以完全忽略我的回答。 :)
Building or finding lists in other languages is extremely time consuming and difficult (trust me, we've built many of them at Inversoft). You might be better off tweaking the code generators instead (from what I could tell your code is generating the promotional codes rather than humans).
The best way to tweak a generator is to ensure that the codes can't easily form words based on the general use of consonants and vowels in most European languages. Things get a bit dicey in Polish and others, but it usually works.
Generally, most codes that start with a vowel are followed by another vowel or a non-joining consonant (like 'q' without a 'u'). If the code starts with a consonant then the next character is the same consonant or one that has a low probability of being used. For example, if you start with 's' then adding 'g' is a good choice.
You could also use wiktionary or other similar sources (like Linux dictionary files) to build a statistical approach to this. By extracting the probability of characters being next to each other, you should be able to generate codes with good accuracy of never being words in any language.
However, if I misread your question and you aren't generating the codes programmatically, you can ignore my response completely. :)
我也有同样的想法。在尝试为我正在做的项目生成 6 个字符代码时。
我决定减少明显的 porfain 代码的可能性,因此我从最初的 36 代基础代码中删除了在尽可能多的“坏”单词中发现的元音。给我留下的更像是一个 28 进制系统,不包括 a,e,i,o,u, 1,0。删除了 1 和 0,以减少某些字体中这些字符与 I、L、O 之间的混淆
到目前为止,我还没有看到过“粗俗”的代码。尽管 28 进制有 1 亿种独特的组合。
我不能保证其他语言,甚至没有考虑过它......
I have had the same thoughts. in trying to generate 6 character codes for a project i am doing.
I decided to reduce the likelyhood of obvious porfain codes So i removed the vowels that i found in as many "bad" words as i could think of, from my intial base 36 generation code. Leaving me with something more like a base 28 system that did not include a,e,i,o,u, 1,0. the one and zero were removed to reduce confusion between those characters in some fonts with I,L,O's
so far I have not seen a "profain" code genreated. Although base 28 has 1.something billion unique combinations.
i cannot vouch for other languages, and had not even considered it...