为什么 Perl 的 LWP 给我的编码与原始网站不同?
可以说我有这个代码:
use strict;
use LWP qw ( get );
my $content = get ( "http://www.msn.co.il" );
print STDERR $content;
错误日志显示类似“\xd7\x9c\xd7\x94\xd7\x93\xd7\xa4\xd7\xa1\xd7\x94”的内容 我猜它是 utf-16 ?
网站的编码是
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=windows-1255">
这样的,为什么会出现这些字符而不是 windows-1255 字符?
而且,另一个奇怪的事情是我有两个服务器:
第一个服务器返回 CP1255 字符,我可以简单地将其转换为 utf8, 当前的服务器给了我这些字符,我无法用它做任何事情...
apache/perl/module 中是否有任何配置文件弄乱了编码? 强迫某事...?
我在第二台服务器上的网站的结果是,perl 文件和标题都是 utf8,所以当我编写不是英文字符的文本时,上面示例中的内容显示正常(即使它是奇怪的 utf 字符) )但我自己的静态文本看起来像“×ס'××ס××:”
我测试的另一件事是......
通过perl:
my $content = `curl "http://www.anglo-saxon.co.il"`;
我得到utf8编码。
通过 Bash:
curl "http://www.anglo-saxon.co.il"
在这里我得到 CP1255 ( Windows-1255 ) 编码......
另外, 当我在 bash 中运行脚本时 - 它给出 CP1255,当通过网络运行它时 - 然后它又是 utf8 ...
通过将内容从 utf8 更改为应该的内容,然后返回 utf8 解决了问题:
use Text::Iconv;
my $converter = Text::Iconv->new("utf8", "CP1255");
$content=$converter->convert($content);
my $converter = Text::Iconv->new("CP1255", "utf8");
$content=$converter->convert($content);
Lets say i have this code:
use strict;
use LWP qw ( get );
my $content = get ( "http://www.msn.co.il" );
print STDERR $content;
The error log shows something like "\xd7\x9c\xd7\x94\xd7\x93\xd7\xa4\xd7\xa1\xd7\x94"
which i'm guessing it's utf-16 ?
The website's encoding is with
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=windows-1255">
so why these characters appear and not the windows-1255 chars ?
And, another weird thing is that i have two servers:
the first server returning CP1255 chars and i can simply convert it to utf8,
and the current server gives me these chars and i can't do anything with it ...
is there any configuration file in apache/perl/module that is messing up the encoding ?
forcing something ... ?
The result in my website at the second server, is that the perl file and the headers are all utf8, so when i write text that aren't english chars, the content from the example above is showing ok ( even though it's weird utf chars ) but my own static text are look like "×ס'××ר××:"
One more thing that i tested is ...
Through perl:
my $content = `curl "http://www.anglo-saxon.co.il"`;
I get utf8 encoding.
Through Bash:
curl "http://www.anglo-saxon.co.il"
and here i get CP1255 ( Windows-1255 ) encoding ...
Also,
when i run the script in bash - it gives CP1255, and when run it through the web - then it's utf8 again ...
fixed the problem by changin the content from utf8 - to what is supposed to, and then back to utf8:
use Text::Iconv;
my $converter = Text::Iconv->new("utf8", "CP1255");
$content=$converter->convert($content);
my $converter = Text::Iconv->new("CP1255", "utf8");
$content=$converter->convert($content);
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
所有这些手动编码和解码都是不必要的。当 HTML 说页面是用 windows-1255 编码时,它是在骗你;服务器说它正在提供 UTF-8,确实如此。归咎于微软的 HTML 生成工具。
无论如何,由于服务器确实返回了正确的编码,因此这是有效的:
$content
现在是一个perl字符串,可以执行您需要的任何操作。如果你想将其转换为其他编码,那么调用Encode::encode
就可以了;不要使用Encode::decode
,因为它已经被解码过一次。All of this manual encoding and decoding is unnecessary. The HTML is lying to you when it says that the page is encoded in windows-1255; the server says it's serving UTF-8, and it is. Blame Microsoft HTML-generation tools.
Anyway, since the server does return the correct encoding, this works:
$content
is now a perl character string, ready to do whatever you need. If you want to convert it to some other encoding, then callingEncode::encode
on it is appropriate; do not useEncode::decode
as it's already been decoded once.http://www.msn.co.il 采用 UTF-8 格式,并正确指示。字符串“\xd7\x9c\xd7\x94\xd7\x93\xd7\xa4\xd7\xa1\xd7\x94”也是正确的 UTF-8 (להדפסה)。我没有看到问题所在。
我认为你的第二个问题是由于你混合了不同的编码(UTF-8 和 Windows-1252)。您可能想要正确编码/解码您的字符串。
http://www.msn.co.il is in UTF-8, and indicates that properly. The string "\xd7\x9c\xd7\x94\xd7\x93\xd7\xa4\xd7\xa1\xd7\x94" is also proper UTF-8 (להדפסה). I don't see the problem.
I think your second problem is due to you mixing different encodings (UTF-8 and Windows-1252). You might want to encode/decode your strings properly.
首先,请注意,您应该从 LWP::Simple< 导入
get
/a>.其次,一切正常:这向我表明问题在于您将输出发送到的文件句柄的编码。
First, note that you should import
get
from LWP::Simple. Second, everything works fine with:which indicates to me that the problem is the encoding of the filehandle to which you are sending the output.
您提供的具有十六进制值的字符串似乎是 UTF-8 编码。你得到这个是因为 Perl 在处理字符串时“喜欢”使用 UTF-8。
LWP::Simple->get()
方法自动解码来自服务器的内容,其中包括撤消任何内容编码以及转换为 UTF-8。您可以深入研究内部结构并获取确实更改字符编码的版本(请参阅 HTTP::消息的decoded_content,由HTTP::Response的decoded_content使用,您可以从 LWP::UserAgent 的 get 获取它。但是,使用所需的编码重新编码数据可能会更容易,例如
您看到的混合可读/垃圾字符是由于在同一流中混合了多个不兼容的编码而导致的。可能该流被标记为 UTF-8,但您将 CP1255 编码字符放入其中。您需要将流标记为 CP1255 并仅将 CP1255 编码的数据放入其中,或者将其标记为 UTF-8 并仅将 UTF-8 编码的数据放入其中。提醒自己字节不是字符并在它们之间进行适当的转换。
The string with the hex values that you gave appears to be a UTF-8 encoding. You are getting this because Perl ‘likes to’ use UTF-8 when it deals with strings. The
LWP::Simple->get()
method automatically decodes the content from the server which includes undoing any Content-Encoding as well as converting to UTF-8.You could dig into the internals and get a version that does change the character encoding (see HTTP::Message's decoded_content, which is used by HTTP::Response's decoded_content, which you can get from LWP::UserAgent's get). But it may be easier to re-encode the data in your desired encoding with something like
The mixed readable/garbage characters you see are due to mixing multiple, incompatible encodings in the same stream. Probably the stream is labeled as UTF-8 but you are putting CP1255 encoded characters into it. You either need to label the stream as CP1255 and put only CP1255-encoded data into it, or label it as UTF-8 and put only UTF-8-encoded data into it. Remind yourself that bytes are not characters and convert between them appropriately.