为什么 Perl 的 LWP 给我的编码与原始网站不同?

发布于 2024-08-23 01:28:58 字数 1298 浏览 12 评论 0原文

可以说我有这个代码:

use strict;
use LWP qw ( get );

my $content = get ( "http://www.msn.co.il" );

print STDERR $content;

错误日志显示类似“\xd7\x9c\xd7\x94\xd7\x93\xd7\xa4\xd7\xa1\xd7\x94”的内容 我猜它是 utf-16 ?

网站的编码是

<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=windows-1255">

这样的,为什么会出现这些字符而不是 windows-1255 字符?

而且,另一个奇怪的事情是我有两个服务器:

第一个服务器返回 CP1255 字符,我可以简单地将其转换为 utf8, 当前的服务器给了我这些字符,我无法用它做任何事情...

apache/perl/module 中是否有任何配置文件弄乱了编码? 强迫某事...?

我在第二台服务器上的网站的结果是,perl 文件和标题都是 utf8,所以当我编写不是英文字符的文本时,上面示例中的内容显示正常(即使它是奇怪的 utf 字符) )但我自己的静态文本看起来像“×ס'××ס××:”

我测试的另一件事是......

通过perl:

my $content = `curl "http://www.anglo-saxon.co.il"`;    

我得到utf8编码。

通过 Bash:

curl "http://www.anglo-saxon.co.il"

在这里我得到 CP1255 ( Windows-1255 ) 编码......

另外, 当我在 bash 中运行脚本时 - 它给出 CP1255,当通过网络运行它时 - 然后它又是 utf8 ...

通过将内容从 utf8 更改为应该的内容,然后返回 utf8 解决了问题:

use Text::Iconv;

my $converter = Text::Iconv->new("utf8", "CP1255");
   $content=$converter->convert($content);

my $converter = Text::Iconv->new("CP1255", "utf8");
   $content=$converter->convert($content);

Lets say i have this code:

use strict;
use LWP qw ( get );

my $content = get ( "http://www.msn.co.il" );

print STDERR $content;

The error log shows something like "\xd7\x9c\xd7\x94\xd7\x93\xd7\xa4\xd7\xa1\xd7\x94"
which i'm guessing it's utf-16 ?

The website's encoding is with

<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=windows-1255">

so why these characters appear and not the windows-1255 chars ?

And, another weird thing is that i have two servers:

the first server returning CP1255 chars and i can simply convert it to utf8,
and the current server gives me these chars and i can't do anything with it ...

is there any configuration file in apache/perl/module that is messing up the encoding ?
forcing something ... ?

The result in my website at the second server, is that the perl file and the headers are all utf8, so when i write text that aren't english chars, the content from the example above is showing ok ( even though it's weird utf chars ) but my own static text are look like "×ס'××ר××:"

One more thing that i tested is ...

Through perl:

my $content = `curl "http://www.anglo-saxon.co.il"`;    

I get utf8 encoding.

Through Bash:

curl "http://www.anglo-saxon.co.il"

and here i get CP1255 ( Windows-1255 ) encoding ...

Also,
when i run the script in bash - it gives CP1255, and when run it through the web - then it's utf8 again ...

fixed the problem by changin the content from utf8 - to what is supposed to, and then back to utf8:

use Text::Iconv;

my $converter = Text::Iconv->new("utf8", "CP1255");
   $content=$converter->convert($content);

my $converter = Text::Iconv->new("CP1255", "utf8");
   $content=$converter->convert($content);

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

已下线请稍等 2024-08-30 01:28:58

所有这些手动编码和解码都是不必要的。当 HTML 说页面是用 windows-1255 编码时,它是在骗你;服务器说它正在提供 UTF-8,确实如此。归咎于微软的 HTML 生成工具。

无论如何,由于服务器确实返回了正确的编码,因此这是有效的:

my $response = LWP::UserAgent->new->get("http://www.msn.co.il/");
my $content = $res->decoded_content;

$content现在是一个perl字符串,可以执行您需要的任何操作。如果你想将其转换为其他编码,那么调用 Encode::encode 就可以了;不要使用Encode::decode,因为它已经被解码过一次。

All of this manual encoding and decoding is unnecessary. The HTML is lying to you when it says that the page is encoded in windows-1255; the server says it's serving UTF-8, and it is. Blame Microsoft HTML-generation tools.

Anyway, since the server does return the correct encoding, this works:

my $response = LWP::UserAgent->new->get("http://www.msn.co.il/");
my $content = $res->decoded_content;

$content is now a perl character string, ready to do whatever you need. If you want to convert it to some other encoding, then calling Encode::encode on it is appropriate; do not use Encode::decode as it's already been decoded once.

安人多梦 2024-08-30 01:28:58

http://www.msn.co.il 采用 UTF-8 格式,并正确指示。字符串“\xd7\x9c\xd7\x94\xd7\x93\xd7\xa4\xd7\xa1\xd7\x94”也是正确的 UTF-8 (להדפסה)。我没有看到问题所在。

我认为你的第二个问题是由于你混合了不同的编码(UTF-8 和 Windows-1252)。您可能想要正确编码/解码您的字符串。

http://www.msn.co.il is in UTF-8, and indicates that properly. The string "\xd7\x9c\xd7\x94\xd7\x93\xd7\xa4\xd7\xa1\xd7\x94" is also proper UTF-8 (להדפסה). I don't see the problem.

I think your second problem is due to you mixing different encodings (UTF-8 and Windows-1252). You might want to encode/decode your strings properly.

匿名。 2024-08-30 01:28:58

首先,请注意,您应该从 LWP::Simple< 导入 get /a>.其次,一切正常:

#!/usr/bin/perl
use strict; use warnings;
use LWP::Simple qw ( getstore );
getstore 'http://www.msn.co.il', 'test.html';

这向我表明问题在于您将输出发送到的文件句柄的编码。

First, note that you should import get from LWP::Simple. Second, everything works fine with:

#!/usr/bin/perl
use strict; use warnings;
use LWP::Simple qw ( getstore );
getstore 'http://www.msn.co.il', 'test.html';

which indicates to me that the problem is the encoding of the filehandle to which you are sending the output.

世态炎凉 2024-08-30 01:28:58

您提供的具有十六进制值的字符串似乎是 UTF-8 编码。你得到这个是因为 Perl 在处理字符串时“喜欢”使用 UTF-8。 LWP::Simple->get() 方法自动解码来自服务器的内容,其中包括撤消任何内容编码以及转换为 UTF-8。

您可以深入研究内部结构并获取确实更改字符编码的版本(请参阅 HTTP::消息的decoded_content,由HTTP::Response的decoded_content使用,您可以从 LWP::UserAgent 的 get 获取它。但是,使用所需的编码重新编码数据可能会更容易,例如

use Encode; 
...; 
$cp1255_bytes = encode('CP1255', decode('UTF_8', $utf8_bytes));

您看到的混合可读/垃圾字符是由于在同一流中混合了多个不兼容的编码而导致的。可能该流被标记为 UTF-8,但您将 CP1255 编码字符放入其中。您需要将流标记为 CP1255 并仅将 CP1255 编码的数据放入其中,或者将其标记为 UTF-8 并仅将 UTF-8 编码的数据放入其中。提醒自己字节不是字符并在它们之间进行适当的转换。

The string with the hex values that you gave appears to be a UTF-8 encoding. You are getting this because Perl ‘likes to’ use UTF-8 when it deals with strings. The LWP::Simple->get() method automatically decodes the content from the server which includes undoing any Content-Encoding as well as converting to UTF-8.

You could dig into the internals and get a version that does change the character encoding (see HTTP::Message's decoded_content, which is used by HTTP::Response's decoded_content, which you can get from LWP::UserAgent's get). But it may be easier to re-encode the data in your desired encoding with something like

use Encode; 
...; 
$cp1255_bytes = encode('CP1255', decode('UTF_8', $utf8_bytes));

The mixed readable/garbage characters you see are due to mixing multiple, incompatible encodings in the same stream. Probably the stream is labeled as UTF-8 but you are putting CP1255 encoded characters into it. You either need to label the stream as CP1255 and put only CP1255-encoded data into it, or label it as UTF-8 and put only UTF-8-encoded data into it. Remind yourself that bytes are not characters and convert between them appropriately.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文