如何使用 Perl 正则表达式检测阿拉伯字符?
我正在解析一些 html 页面,并且需要检测里面的任何阿拉伯字符.. 尝试了各种正则表达式,但没有运气..
有谁知道这样做的工作方法吗?
谢谢
这是我正在处理的页面: http://pastie.org/2509936
我的代码是:
#!/usr/bin/perl
use LWP::UserAgent;
@MyAgent::ISA = qw(LWP::UserAgent);
# set inheritance
$ua = LWP::UserAgent->new;
$q = 'pastie.org/2509936';;
$request = HTTP::Request->new('GET', $q);
$response = $ua->request($request);
if ($response->is_success) {
if ($response->content=~/[\p{Script=Arabic}]/g) {
print "found arabic";
} else {
print "not found";
}
}
I'm parsing some html pages, and need to detect any Arabic char inside..
Tried various regexs, but no luck..
Does anyone know working way to do that?
Thanks
Here is the page I'm processing: http://pastie.org/2509936
And my code is:
#!/usr/bin/perl
use LWP::UserAgent;
@MyAgent::ISA = qw(LWP::UserAgent);
# set inheritance
$ua = LWP::UserAgent->new;
$q = 'pastie.org/2509936';;
$request = HTTP::Request->new('GET', $q);
$response = $ua->request($request);
if ($response->is_success) {
if ($response->content=~/[\p{Script=Arabic}]/g) {
print "found arabic";
} else {
print "not found";
}
}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
如果您使用 Perl,您应该能够使用 Unicode 脚本匹配运算符。
/\p{Arabic}/
如果这不起作用,您必须查找阿拉伯语的 Unicode 字符范围,并像这样测试它们
/[\x{ 0600}\x{0601}...\x{06FF}]/
。If you're using Perl, you should be able to use the Unicode script matching operator.
/\p{Arabic}/
If that doesn't work, you'll have to look up the range of Unicode characters for Arabic, and test them something like this
/[\x{0600}\x{0601}...\x{06FF}]/
.编辑(因为我显然已经进入了基督的专业领域)。跳过使用
$response->content
(它始终返回原始字节字符串),并使用$response->decoded_content
,它应用从响应中获取的任何解码提示标头。您下载的页面是UTF-8编码的,但您没有将其读取为UTF-8(公平地说,页面上没有任何关于编码是什么的提示
[更新:服务器确实返回标头
Content-Type: text/html; charset=utf-8
,不过])。如果您检查
$response->content
,您可以看到这一点:如果您得到的值小于 256,那么您正在以原始字节的形式读取此内容,并且您的字符串将永远不会匹配
/\p{阿拉伯语}/
。在应用正则表达式之前,您必须将输入解码为 UTF-8:有时(现在我已经远远超出了我的专业领域)您正在加载的页面包含有关如何解码的提示,以及
$response-> ;content
可能已经被正确解码。在这种情况下,上面的decode
调用是不必要的,并且可能是有害的。有关检测任意字符串的编码,请参阅其他 SO 帖子。EDIT (as I have obviously wandered into tchrist's area of expertise). Skip using
$response->content
, which always returns a raw byte string, and use$response->decoded_content
, which applies any decoding hints it gets from the response headers.The page you are downloading is UTF-8 encoded, but you are not reading it as UTF-8 (in fairness, there are no hints on the page about what the encoding is
[update: the server does return the header
Content-Type: text/html; charset=utf-8
, though]).You can see if this if you examine
$response->content
:If you get a value less than 256, then you are reading this content in as raw bytes, and your strings will never match
/\p{Arabic}/
. You must decode the input as UTF-8 before you apply the regex:Sometimes (and now I am wading well outside my area of expertise) the page you are loading contains hints about how it is decoded, and
$response->content
may already be decoded correctly. In that case, thedecode
call above is unnecessary and may be harmful. See other SO posts on detecting the encoding of an arbitrary string.仅供记录,至少在 .NET 正则表达式中,您需要使用
\p{IsArabic}
。Just for the record, at least in .NET regexps, you need to use
\p{IsArabic}
.