如何使用 Perl 检测俄罗斯垃圾邮件帖子?
我有一个用 Perl 编写的英语论坛网站,该网站不断受到俄语垃圾邮件的轰炸。有没有办法使用 Perl 和正则表达式来检测俄语文本,以便我可以阻止它?
I have an English language forum site written in perl that is continually bombarded with spam in Russian. Is there a way using Perl and regex to detect Russian text so I can block it?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
您可以使用以下命令来检测西里尔字母字符(用于俄语):
如果您如果真的只想要俄语字符,您可以查看上述文档,其中包含基本俄语字母表使用的确切范围,即
[\u0410-\u044F]
。当然,您还需要考虑仅在俄语中使用的扩展西里尔字符——文档中也提到了。You can use the following to detect Cyrillic characters (used in Russian):
If you really just want Russian characters, you can take a look at the aforesaid document, which contains the exact range used for the Basic Russian alphabet which is
[\u0410-\u044F]
. Of course you'd also need to consider extension Cyrillic characters that are used exclusively in Russian -- also mentioned in the document.如果一切都这样编码,那么使用 JG 建议的 unicode 西里尔字符集就可以了。然而,这是垃圾邮件,而且大多数情况都不是。此外,垃圾邮件发送者经常会在垃圾邮件中使用混合字符集,这进一步破坏了这种方法。
我发现检测俄罗斯垃圾邮件的最佳方法(或至少是该过程中的初步步骤)是 grep 查找最常用的字符集:
下一步是对剩余的字符集尝试一些语言检测算法。如果问题足够大,请使用付费服务,例如谷歌翻译(也可以“检测”)或施乐。这些服务为 IMO 提供了最好的语言检测。
using the unicode cyrillic charset as suggested by JG is fine if everything is encoded as such. however, this is spam and for the most part, things are not. additionally, spammers will very often use a mix of charsets in spams which further screws up this approach.
i find that the best way (or at least the preliminary step in the process) of detecting russian spam is to grep for the most commonly used charsets:
next step after that would be to try some language detection algorithms on what remains. if it's a big enough problem, use a paid service such as google translate (which also "detects") or xerox. these services provide IMO the best language detection around.
对于登陆这里的任何人来说,有一个非常不错的 Perl 模块 Lingua::Guess 来检测字符串的语言。例如,在命令行中,您可以像这样使用它:
将产生:
Language is french
将产生:
Language is japanese
在此示例中
-M
加载 perl 模块Linguage::Guess
和模块 Encode (a core 模块)与其函数 decode_utf8 一起使用,将 utf-8 字符串解码为 perl 的内部编码。For anyone landing here, there is a very decent perl module Lingua::Guess to detect the language of a string. In command line for example, you could use it like this:
would yield:
Language is french
would yield:
Language is japanese
In this example
-M
loads the perl moduleLinguage::Guess
and the module Encode (a core module) is used with its function decode_utf8 to decode the utf-8 string into perl's internal encoding.