如何使用 Perl 正则表达式检测阿拉伯字符?

发布于 2024-12-03 16:41:08 字数 656 浏览 0 评论 0原文

我正在解析一些 html 页面,并且需要检测里面的任何阿拉伯字符.. 尝试了各种正则表达式,但没有运气..

有谁知道这样做的工作方法吗?

谢谢


这是我正在处理的页面: http://pastie.org/2509936

我的代码是:

#!/usr/bin/perl 
use LWP::UserAgent; 
@MyAgent::ISA = qw(LWP::UserAgent); 

# set inheritance 
$ua = LWP::UserAgent->new; 
$q = 'pastie.org/2509936';; 
$request = HTTP::Request->new('GET', $q); 
$response = $ua->request($request); 
if ($response->is_success) { 
    if ($response->content=~/[\p{Script=Arabic}]/g) { 
        print "found arabic"; 
    } else { 
        print "not found"; 
    } 
}

I'm parsing some html pages, and need to detect any Arabic char inside..
Tried various regexs, but no luck..

Does anyone know working way to do that?

Thanks


Here is the page I'm processing: http://pastie.org/2509936

And my code is:

#!/usr/bin/perl 
use LWP::UserAgent; 
@MyAgent::ISA = qw(LWP::UserAgent); 

# set inheritance 
$ua = LWP::UserAgent->new; 
$q = 'pastie.org/2509936';; 
$request = HTTP::Request->new('GET', $q); 
$response = $ua->request($request); 
if ($response->is_success) { 
    if ($response->content=~/[\p{Script=Arabic}]/g) { 
        print "found arabic"; 
    } else { 
        print "not found"; 
    } 
}

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

夜空下最亮的亮点 2024-12-10 16:41:08

如果您使用 Perl,您应该能够使用 Unicode 脚本匹配运算符。 /\p{Arabic}/

如果这不起作用,您必须查找阿拉伯语的 Unicode 字符范围,并像这样测试它们 /[\x{ 0600}\x{0601}...\x{06FF}]/

If you're using Perl, you should be able to use the Unicode script matching operator. /\p{Arabic}/

If that doesn't work, you'll have to look up the range of Unicode characters for Arabic, and test them something like this /[\x{0600}\x{0601}...\x{06FF}]/.

转瞬即逝 2024-12-10 16:41:08

编辑(因为我显然已经进入了基督的专业领域)。跳过使用 $response->content(它始终返回原始字节字符串),并使用 $response->decoded_content,它应用从响应中获取的任何解码提示标头。


您下载的页面是UTF-8编码的,但您没有将其读取为UTF-8(公平地说,页面上没有任何关于编码是什么的提示
[更新:服务器确实返回标头 Content-Type: text/html; charset=utf-8,不过])。

如果您检查 $response->content,您可以看到这一点:

use List::Util 'max';
my $max_ord = max map{ord}split //, $response->content;
print "max ord of response content is $max_ord\n";

如果您得到的值小于 256,那么您正在以原始字节的形式读取此内容,并且您的字符串将永远不会匹配 /\p{阿拉伯语}/。在应用正则表达式之前,您必须将输入解码为 UTF-8:

use Encode;
my $content = decode('utf-8', $response->content);
# now check  $content =~ /\p{Arabic}/

有时(现在我已经远远超出了我的专业领域)您正在加载的页面包含有关如何解码的提示,以及 $response-> ;content 可能已经被正确解码。在这种情况下,上面的 decode 调用是不必要的,并且可能是有害的。有关检测任意字符串的编码,请参阅其他 SO 帖子

EDIT (as I have obviously wandered into tchrist's area of expertise). Skip using $response->content, which always returns a raw byte string, and use $response->decoded_content, which applies any decoding hints it gets from the response headers.


The page you are downloading is UTF-8 encoded, but you are not reading it as UTF-8 (in fairness, there are no hints on the page about what the encoding is
[update: the server does return the header Content-Type: text/html; charset=utf-8, though]).

You can see if this if you examine $response->content:

use List::Util 'max';
my $max_ord = max map{ord}split //, $response->content;
print "max ord of response content is $max_ord\n";

If you get a value less than 256, then you are reading this content in as raw bytes, and your strings will never match /\p{Arabic}/. You must decode the input as UTF-8 before you apply the regex:

use Encode;
my $content = decode('utf-8', $response->content);
# now check  $content =~ /\p{Arabic}/

Sometimes (and now I am wading well outside my area of expertise) the page you are loading contains hints about how it is decoded, and $response->content may already be decoded correctly. In that case, the decode call above is unnecessary and may be harmful. See other SO posts on detecting the encoding of an arbitrary string.

蓝天白云 2024-12-10 16:41:08

仅供记录,至少在 .NET 正则表达式中,您需要使用 \p{IsArabic}

Just for the record, at least in .NET regexps, you need to use \p{IsArabic}.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文