如何在 perl 中正确显示 HTML 实体

发布于 2024-12-03 20:46:56 字数 1819 浏览 0 评论 0原文

我正在使用 PERL 编写一个网络爬虫,当我尝试使用 HTML::Entities::decode_entities 显示字符串时,我意识到有一个奇怪的行为。

我正在处理包含汉字的字符串和像Jìngyè这样的字符串。 我使用 HTML::Entities::decode_entities 来解码中文字符,效果很好。但是,当字符串不包含中文字符时,字符串显示会很奇怪(J�ngy�)。

我编写了一段小代码来测试 2 个字符串的不同行为。

字符串1是“台湾台北市中山区正基三路22号10466”,字符串2是“104台湾台北市中山区正基三路20号”。

以下是我的代码:

print "before: $1\n";
my $decoded = HTML::Entities::decode_entities($1."&#34399");#I add the last character just for testing
print "decoded $decoded\n";
my $chopped = substr($decoded, 0, -1);
print "chopped: $chopped\n";

这些是我的结果:

之前: No. 22, J�ngy� 3rd Road, Jhongshan District,台北市,台湾 10466

解码 No. 22, Jìngyè 3rd Road, Jhongshan District,台北市,台湾 10466 号 (正确)

砍:台湾台北市中山区正基三路22号 10466 (不正确)

前:104 台湾台北市中山区敬业三路20号

解码104 台湾台北市中山区敬业三路20号(正确)

砍:104 台湾台北市中山区敬业三路20号(正确 ) )

有人可以解释一下为什么会发生这种情况吗?以及如何解决这个问题以便我的字符串能够正确显示。

非常感谢。

抱歉,我没有把问题说清楚,下面是我编写的代码,其中网址为 http ://maps.google.com/maps/place?cid=10931902633578573013

sub getInfoURLs {
my ($url) = @_;
unless (defined $url){
    print 'URL was not defined when extracting info\n';
    return 0;
}

my $contain_request = LWP::UserAgent->new->get($url);
if($contain_request -> is_success){
    my $contain_content = $contain_request -> decoded_content;

    #store address
    if ($contain_content =~ m/$address_pattern/i){
        print "before: $1\n";
        my $decoded = HTML::Entities::decode_entities($1."&#34399");
        print "decoded $decoded\n";
        my $chopped = substr($decoded, 0, -1);
        print "chopped: $chopped\n";
        #unicode conversion
        #store in database            
    }
 }
}

I was writing a web crawler using PERL, and I realized there was a weird behavior when I try to display string using HTML::Entities::decode_entities.

I was handling strings that contain contain Chinese characters and strings like Jìngyè.
I used HTML::Entities::decode_entities to decode chinese characters, which works well. However, when the string contain no Chinese characters, the string displayed weirdly (J�ngy�).

I wrote a small code to test different behaviors on 2 strings.

String 1 is "No. 22, J�ngy� 3rd Road, Jhongshan District, Taipei City, Taiwan 10466" and string 2 was "104 Taiwan Taipei City Jhongshan District J�ngy� 3rd Road 20號".

Below is my code:

print "before: $1\n";
my $decoded = HTML::Entities::decode_entities($1."號");#I add the last character just for testing
print "decoded $decoded\n";
my $chopped = substr($decoded, 0, -1);
print "chopped: $chopped\n";

These are my results:

before: No. 22, J�ngy� 3rd Road, Jhongshan District, Taipei City, Taiwan 10466

decoded No. 22, Jìngyè 3rd Road, Jhongshan District, Taipei City, Taiwan 10466號 (correct)

chopped: No. 22, J�ngy� 3rd Road, Jhongshan District, Taipei City, Taiwan 10466 (incorrect)

before: 104 Taiwan Taipei City Jhongshan District J�ngy� 3rd Road 20號

decoded 104 Taiwan Taipei City Jhongshan District Jìngyè 3rd Road 20號號 (correct)

chopped: 104 Taiwan Taipei City Jhongshan District Jìngyè 3rd Road 20號 (correct)

Can someone please explain me why was this happening? And how to solve this so that my String will display properly.

Thank you very much.

Sorry, I did not make my question clear, below is the code I wrote, where URL is http://maps.google.com/maps/place?cid=10931902633578573013:

sub getInfoURLs {
my ($url) = @_;
unless (defined $url){
    print 'URL was not defined when extracting info\n';
    return 0;
}

my $contain_request = LWP::UserAgent->new->get($url);
if($contain_request -> is_success){
    my $contain_content = $contain_request -> decoded_content;

    #store address
    if ($contain_content =~ m/$address_pattern/i){
        print "before: $1\n";
        my $decoded = HTML::Entities::decode_entities($1."號");
        print "decoded $decoded\n";
        my $chopped = substr($decoded, 0, -1);
        print "chopped: $chopped\n";
        #unicode conversion
        #store in database            
    }
 }
}

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

哎呦我呸! 2024-12-10 20:46:56

首先,始终使用 use strict;使用警告;!!!

问题是您没有对输出进行编码。文件句柄只能传输字节,但您正在传递解码后的文本。

当您传递明显错误的内容时,Perl 将输出 UTF-8 (-ish)。 chr(0x865F) 显然不是一个字节,因此:

$ perl -we'print "\xE8\x{865F}\n"'
Wide character in print at -e line 1.
è號

但并不总是很明显出现问题。 chr(0xE8) 可以是一个字节,因此:

$ perl -we'print "\xE8\n"'
�

将一个值转换为一系列字节的过程称为“序列化”。序列化文本的具体情况称为字符编码。

Encode的encode用于提供字符编码。您还可以使用 open< 来自动调用 encode /a> 模块。

$ perl -we'use open ":std", ":locale"; print "\xE8\x{865F}\n"'
è號

$ perl -we'use open ":std", ":locale"; print "\xE8\n"'
è

First, always use use strict; use warnings;!!!

The problem is that you're not encoding your output. File handles can only transmit bytes, but you're passing decoded text.

Perl will output UTF-8 (-ish) when you pass something that's obviously wrong. chr(0x865F) is obviously not a byte, so:

$ perl -we'print "\xE8\x{865F}\n"'
Wide character in print at -e line 1.
è號

But it's not always obvious that something is wrong. chr(0xE8) could be a byte, so:

$ perl -we'print "\xE8\n"'
�

The process of converting a value into to a series of bytes is called "serialization". The specific case of serializing text is known as character encoding.

Encode's encode is used to provide character encoding. You can also have encode called automatically using the open module.

$ perl -we'use open ":std", ":locale"; print "\xE8\x{865F}\n"'
è號

$ perl -we'use open ":std", ":locale"; print "\xE8\n"'
è
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文