如何使用 Perl 检测俄罗斯垃圾邮件帖子?

发布于 2024-08-04 04:18:47 字数 81 浏览 15 评论 0原文

我有一个用 Perl 编写的英语论坛网站,该网站不断受到俄语垃圾邮件的轰炸。有没有办法使用 Perl 和正则表达式来检测俄语文本,以便我可以阻止它?

I have an English language forum site written in perl that is continually bombarded with spam in Russian. Is there a way using Perl and regex to detect Russian text so I can block it?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

诗酒趁年少 2024-08-11 04:18:47

您可以使用以下命令来检测西里尔字母字符(用于俄语):

[\u0400-\u04FF]+

如果您如果真的只想要俄语字符,您可以查看上述文档,其中包含基本俄语字母表使用的确切范围,即 [\u0410-\u044F]。当然,您还需要考虑仅在俄语中使用的扩展西里尔字符——文档中也提到了。

You can use the following to detect Cyrillic characters (used in Russian):

[\u0400-\u04FF]+

If you really just want Russian characters, you can take a look at the aforesaid document, which contains the exact range used for the Basic Russian alphabet which is [\u0410-\u044F]. Of course you'd also need to consider extension Cyrillic characters that are used exclusively in Russian -- also mentioned in the document.

拍不死你 2024-08-11 04:18:47

如果一切都这样编码,那么使用 JG 建议的 unicode 西里尔字符集就可以了。然而,这是垃圾邮件,而且大多数情况都不是。此外,垃圾邮件发送者经常会在垃圾邮件中使用混合字符集,这进一步破坏了这种方法。

我发现检测俄罗斯垃圾邮件的最佳方法(或至少是该过程中的初步步骤)是 grep 查找最常用的字符集:

koi8-r
windows-1251
iso-8859-5

下一步是对剩余的字符集尝试一些语言检测算法。如果问题足够大,请使用付费服务,例如谷歌翻译(也可以“检测”)或施乐。这些服务为 IMO 提供了最好的语言检测。

using the unicode cyrillic charset as suggested by JG is fine if everything is encoded as such. however, this is spam and for the most part, things are not. additionally, spammers will very often use a mix of charsets in spams which further screws up this approach.

i find that the best way (or at least the preliminary step in the process) of detecting russian spam is to grep for the most commonly used charsets:

koi8-r
windows-1251
iso-8859-5

next step after that would be to try some language detection algorithms on what remains. if it's a big enough problem, use a paid service such as google translate (which also "detects") or xerox. these services provide IMO the best language detection around.

枯寂 2024-08-11 04:18:47

对于登陆这里的任何人来说,有一个非常不错的 Perl 模块 Lingua::Guess 来检测字符串的语言。例如,在命令行中,您可以像这样使用它:

echo "Je suis en train d'essayer ce module, et voyons si ça marche bien." | perl -MLingua::Guess -MEncode=decode_utf8 -nlE 'say "Language is ", Lingua::Guess->new->guess( decode_utf8($_) )->[0]->{name}'

将产生: Language is french

echo "このモジュールを上手く使えるかな。" | perl -MLingua::Guess -MEncode=decode_utf8 -nlE 'say "Language is ", Lingua::Guess->new->guess( decode_utf8($_) )->[0]->{name}'

将产生: Language is japanese

在此示例中 -M 加载 perl 模块 Linguage::Guess 和模块 Encode (a core 模块)与其函数 decode_utf8 一起使用,将 utf-8 字符串解码为 perl 的内部编码。

For anyone landing here, there is a very decent perl module Lingua::Guess to detect the language of a string. In command line for example, you could use it like this:

echo "Je suis en train d'essayer ce module, et voyons si ça marche bien." | perl -MLingua::Guess -MEncode=decode_utf8 -nlE 'say "Language is ", Lingua::Guess->new->guess( decode_utf8($_) )->[0]->{name}'

would yield: Language is french

echo "このモジュールを上手く使えるかな。" | perl -MLingua::Guess -MEncode=decode_utf8 -nlE 'say "Language is ", Lingua::Guess->new->guess( decode_utf8($_) )->[0]->{name}'

would yield: Language is japanese

In this example -M loads the perl module Linguage::Guess and the module Encode (a core module) is used with its function decode_utf8 to decode the utf-8 string into perl's internal encoding.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文