如何在 perl 中正确显示 HTML 实体
我正在使用 PERL 编写一个网络爬虫,当我尝试使用 HTML::Entities::decode_entities 显示字符串时,我意识到有一个奇怪的行为。
我正在处理包含汉字的字符串和像Jìngyè这样的字符串。 我使用 HTML::Entities::decode_entities 来解码中文字符,效果很好。但是,当字符串不包含中文字符时,字符串显示会很奇怪(J�ngy�)。
我编写了一段小代码来测试 2 个字符串的不同行为。
字符串1是“台湾台北市中山区正基三路22号10466”,字符串2是“104台湾台北市中山区正基三路20号”。
以下是我的代码:
print "before: $1\n";
my $decoded = HTML::Entities::decode_entities($1."號");#I add the last character just for testing
print "decoded $decoded\n";
my $chopped = substr($decoded, 0, -1);
print "chopped: $chopped\n";
这些是我的结果:
之前: No. 22, J�ngy� 3rd Road, Jhongshan District,台北市,台湾 10466
解码 No. 22, Jìngyè 3rd Road, Jhongshan District,台北市,台湾 10466 号 (正确)
砍:台湾台北市中山区正基三路22号 10466 (不正确)
前:104 台湾台北市中山区敬业三路20号
解码104 台湾台北市中山区敬业三路20号(正确)
砍:104 台湾台北市中山区敬业三路20号(正确 ) )
有人可以解释一下为什么会发生这种情况吗?以及如何解决这个问题以便我的字符串能够正确显示。
非常感谢。
抱歉,我没有把问题说清楚,下面是我编写的代码,其中网址为 http ://maps.google.com/maps/place?cid=10931902633578573013:
sub getInfoURLs {
my ($url) = @_;
unless (defined $url){
print 'URL was not defined when extracting info\n';
return 0;
}
my $contain_request = LWP::UserAgent->new->get($url);
if($contain_request -> is_success){
my $contain_content = $contain_request -> decoded_content;
#store address
if ($contain_content =~ m/$address_pattern/i){
print "before: $1\n";
my $decoded = HTML::Entities::decode_entities($1."號");
print "decoded $decoded\n";
my $chopped = substr($decoded, 0, -1);
print "chopped: $chopped\n";
#unicode conversion
#store in database
}
}
}
I was writing a web crawler using PERL, and I realized there was a weird behavior when I try to display string using HTML::Entities::decode_entities.
I was handling strings that contain contain Chinese characters and strings like Jìngyè.
I used HTML::Entities::decode_entities to decode chinese characters, which works well. However, when the string contain no Chinese characters, the string displayed weirdly (J�ngy�).
I wrote a small code to test different behaviors on 2 strings.
String 1 is "No. 22, J�ngy� 3rd Road, Jhongshan District, Taipei City, Taiwan 10466" and string 2 was "104 Taiwan Taipei City Jhongshan District J�ngy� 3rd Road 20號".
Below is my code:
print "before: $1\n";
my $decoded = HTML::Entities::decode_entities($1."號");#I add the last character just for testing
print "decoded $decoded\n";
my $chopped = substr($decoded, 0, -1);
print "chopped: $chopped\n";
These are my results:
before: No. 22, J�ngy� 3rd Road, Jhongshan District, Taipei City, Taiwan 10466
decoded No. 22, Jìngyè 3rd Road, Jhongshan District, Taipei City, Taiwan 10466號 (correct)
chopped: No. 22, J�ngy� 3rd Road, Jhongshan District, Taipei City, Taiwan 10466 (incorrect)
before: 104 Taiwan Taipei City Jhongshan District J�ngy� 3rd Road 20號
decoded 104 Taiwan Taipei City Jhongshan District Jìngyè 3rd Road 20號號 (correct)
chopped: 104 Taiwan Taipei City Jhongshan District Jìngyè 3rd Road 20號 (correct)
Can someone please explain me why was this happening? And how to solve this so that my String will display properly.
Thank you very much.
Sorry, I did not make my question clear, below is the code I wrote, where URL is http://maps.google.com/maps/place?cid=10931902633578573013:
sub getInfoURLs {
my ($url) = @_;
unless (defined $url){
print 'URL was not defined when extracting info\n';
return 0;
}
my $contain_request = LWP::UserAgent->new->get($url);
if($contain_request -> is_success){
my $contain_content = $contain_request -> decoded_content;
#store address
if ($contain_content =~ m/$address_pattern/i){
print "before: $1\n";
my $decoded = HTML::Entities::decode_entities($1."號");
print "decoded $decoded\n";
my $chopped = substr($decoded, 0, -1);
print "chopped: $chopped\n";
#unicode conversion
#store in database
}
}
}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
首先,始终使用
use strict;使用警告;
!!!问题是您没有对输出进行编码。文件句柄只能传输字节,但您正在传递解码后的文本。
当您传递明显错误的内容时,Perl 将输出 UTF-8 (-ish)。
chr(0x865F)
显然不是一个字节,因此:但并不总是很明显出现问题。
chr(0xE8)
可以是一个字节,因此:将一个值转换为一系列字节的过程称为“序列化”。序列化文本的具体情况称为字符编码。
Encode的
encode
用于提供字符编码。您还可以使用open
< 来自动调用encode
/a> 模块。First, always use
use strict; use warnings;
!!!The problem is that you're not encoding your output. File handles can only transmit bytes, but you're passing decoded text.
Perl will output UTF-8 (-ish) when you pass something that's obviously wrong.
chr(0x865F)
is obviously not a byte, so:But it's not always obvious that something is wrong.
chr(0xE8)
could be a byte, so:The process of converting a value into to a series of bytes is called "serialization". The specific case of serializing text is known as character encoding.
Encode's
encode
is used to provide character encoding. You can also haveencode
called automatically using theopen
module.