西班牙脏话黑名单

发布于 2024-09-28 18:03:26 字数 163 浏览 0 评论 0原文

我的任务是为 Rails 应用程序实现基于黑名单的脏话过滤器。我知道基于黑名单的过滤存在很多问题,但这个决定是在我的脑海中做出的。挑战:我正在寻找一个很好的西班牙语脏话列表以供过滤器使用。对于英语,我们正在建立一个列表,其中详尽地列出了词形变化/复数/等,文本文件的每行一个。公共领域是否存在这样的西班牙语列表?

I've been tasked with implementing a blacklist-based profanity filter for a Rails app. I know there are a ton of issues with blacklist-based filtering, but the decision was made above my head. Challenge: I'm looking for a good list of Spanish profanity to run into the filter. For English, we're building on a list which exhaustively lists conjugations/plurals/etc, one per line of a text file. Does such a list exist in the public domain for Spanish?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

梦冥 2024-10-05 18:03:26

找到好的列表并对其进行调整是很困难的。听起来您正在做很多可以自动化的手动工作(即结合)。我为公司的 名为 CleanSpeak 的脏话过滤器做了很多这样的工作,其中大部分可以使用单词和单词的 POS 标识符来自动化在许多情况下,您可以手动进行 POS 标记或查找 POS 来源。

您还需要考虑列表的质量以及过滤器的维护和管理。很多人认为这很简单,然后意识到防止误报极其困难。

话虽如此,我们发现大多数其他语言的列表都很难在网上获得,最终不得不付费从其他公司构建或购买其中的许多列表。我们在网上找到的列表在我们翻译后几乎毫无价值。我们还尝试删除黑名单并进行翻译,但这完全失败了,因为大多数英语脏话在其他语言中没有对应的内容。我建议购买清单或与当地大学的学生合作生成清单。我们的一些客户发现这种方法相对较好,而且价格也不算太贵。

我还建议您查看一些定义了管理用户生成内容的最佳方法的资源。这些将帮助指导您做出任何构建还是购买的决定。

Finding good lists and having them tuned is difficult. It also sounds like you are doing a lot of manual work that can be automated (i.e. conjugation). I did a lot of this for my company's profanity filter named CleanSpeak and much of this can be automated using POS identifiers for words and in many cases you can manually do POS tagging or find a POS source.

You'll also need to consider the quality of the lists and the up-keep and management of a filter. A lot of people think it is simple and then realize that it is extremely difficult to prevent false-positives.

All that said, we found the majority of our lists for other languages difficult to come by online and ended up paying to have many of the built or purchased from other companies. The lists we did find online ended up being nearly worthless once we had them translated. We also attempted to take out blacklist and have that translated, which was a complete failure because most English profanities don't have equivalents in other languages. I would suggest purchasing lists or working with students at your local university to generate lists. A number of our customers found this method relatively good and not overly expensive.

I would also suggest that you take a look at some of the resources out there that define the best ways to manage User Generated Content. These will help guide you through any build vs. buy decisions.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文