为什么 Perl 的 LWP 给我的编码与原始网站不同？

发布于 2024-08-23 01:28:58 字数 1298 浏览 21 评论 0原文

可以说我有这个代码：

use strict;
use LWP qw ( get );

my $content = get ( "http://www.msn.co.il" );

print STDERR $content;

错误日志显示类似“\xd7\x9c\xd7\x94\xd7\x93\xd7\xa4\xd7\xa1\xd7\x94”的内容我猜它是 utf-16 ？

网站的编码是

<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=windows-1255">

这样的，为什么会出现这些字符而不是 windows-1255 字符？

而且，另一个奇怪的事情是我有两个服务器：

第一个服务器返回 CP1255 字符，我可以简单地将其转换为 utf8，当前的服务器给了我这些字符，我无法用它做任何事情...

apache/perl/module 中是否有任何配置文件弄乱了编码？强迫某事...？

我在第二台服务器上的网站的结果是，perl 文件和标题都是 utf8，所以当我编写不是英文字符的文本时，上面示例中的内容显示正常（即使它是奇怪的 utf 字符））但我自己的静态文本看起来像“××¡'×××¡××：”

我测试的另一件事是......

通过perl：

my $content = `curl "http://www.anglo-saxon.co.il"`;

我得到utf8编码。

通过 Bash：

curl "http://www.anglo-saxon.co.il"

在这里我得到 CP1255 ( Windows-1255 ) 编码......

另外，当我在 bash 中运行脚本时 - 它给出 CP1255，当通过网络运行它时 - 然后它又是 utf8 ...

通过将内容从 utf8 更改为应该的内容，然后返回 utf8 解决了问题：

use Text::Iconv;

my $converter = Text::Iconv->new("utf8", "CP1255");
   $content=$converter->convert($content);

my $converter = Text::Iconv->new("CP1255", "utf8");
   $content=$converter->convert($content);

原文

Lets say i have this code:

use strict;
use LWP qw ( get );

my $content = get ( "http://www.msn.co.il" );

print STDERR $content;

The error log shows something like "\xd7\x9c\xd7\x94\xd7\x93\xd7\xa4\xd7\xa1\xd7\x94"
which i'm guessing it's utf-16 ?

The website's encoding is with

<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=windows-1255">

so why these characters appear and not the windows-1255 chars ?

And, another weird thing is that i have two servers:

the first server returning CP1255 chars and i can simply convert it to utf8,
and the current server gives me these chars and i can't do anything with it ...

is there any configuration file in apache/perl/module that is messing up the encoding ?
forcing something ... ?

The result in my website at the second server, is that the perl file and the headers are all utf8, so when i write text that aren't english chars, the content from the example above is showing ok ( even though it's weird utf chars ) but my own static text are look like "××¡'×××¨××:"

One more thing that i tested is ...

Through perl:

my $content = `curl "http://www.anglo-saxon.co.il"`;

I get utf8 encoding.

Through Bash:

curl "http://www.anglo-saxon.co.il"

and here i get CP1255 ( Windows-1255 ) encoding ...

Also,
when i run the script in bash - it gives CP1255, and when run it through the web - then it's utf8 again ...

fixed the problem by changin the content from utf8 - to what is supposed to, and then back to utf8:

use Text::Iconv;

my $converter = Text::Iconv->new("utf8", "CP1255");
   $content=$converter->convert($content);

my $converter = Text::Iconv->new("CP1255", "utf8");
   $content=$converter->convert($content);

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

已下线请稍等 2024-08-30 01:28:58

所有这些手动编码和解码都是不必要的。当 HTML 说页面是用 windows-1255 编码时，它是在骗你；服务器说它正在提供 UTF-8，确实如此。归咎于微软的 HTML 生成工具。

无论如何，由于服务器确实返回了正确的编码，因此这是有效的：

my $response = LWP::UserAgent->new->get("http://www.msn.co.il/");
my $content = $res->decoded_content;

$content现在是一个perl字符串，可以执行您需要的任何操作。如果你想将其转换为其他编码，那么调用 Encode::encode 就可以了；不要使用Encode::decode，因为它已经被解码过一次。

All of this manual encoding and decoding is unnecessary. The HTML is lying to you when it says that the page is encoded in windows-1255; the server says it's serving UTF-8, and it is. Blame Microsoft HTML-generation tools.

Anyway, since the server does return the correct encoding, this works:

my $response = LWP::UserAgent->new->get("http://www.msn.co.il/");
my $content = $res->decoded_content;

$content is now a perl character string, ready to do whatever you need. If you want to convert it to some other encoding, then calling Encode::encode on it is appropriate; do not use Encode::decode as it's already been decoded once.

回复收藏 0 原文

安人多梦 2024-08-30 01:28:58

http://www.msn.co.il 采用 UTF-8 格式，并正确指示。字符串“\xd7\x9c\xd7\x94\xd7\x93\xd7\xa4\xd7\xa1\xd7\x94”也是正确的 UTF-8 (להדפסה)。我没有看到问题所在。

我认为你的第二个问题是由于你混合了不同的编码（UTF-8 和 Windows-1252）。您可能想要正确编码/解码您的字符串。

回复收藏 0 原文

匿名。 2024-08-30 01:28:58

首先，请注意，您应该从 LWP::Simple< 导入 get /a>.其次，一切正常：

#!/usr/bin/perl
use strict; use warnings;
use LWP::Simple qw ( getstore );
getstore 'http://www.msn.co.il', 'test.html';

这向我表明问题在于您将输出发送到的文件句柄的编码。

First, note that you should import get from LWP::Simple. Second, everything works fine with:

#!/usr/bin/perl
use strict; use warnings;
use LWP::Simple qw ( getstore );
getstore 'http://www.msn.co.il', 'test.html';

which indicates to me that the problem is the encoding of the filehandle to which you are sending the output.

回复收藏 0 原文

世态炎凉 2024-08-30 01:28:58

您提供的具有十六进制值的字符串似乎是 UTF-8 编码。你得到这个是因为 Perl 在处理字符串时“喜欢”使用 UTF-8。 LWP::Simple->get() 方法自动解码来自服务器的内容，其中包括撤消任何内容编码以及转换为 UTF-8。

您可以深入研究内部结构并获取确实更改字符编码的版本（请参阅 HTTP::消息的decoded_content，由HTTP::Response的decoded_content使用，您可以从 LWP::UserAgent 的 get 获取它。但是，使用所需的编码重新编码数据可能会更容易，例如

use Encode; 
...; 
$cp1255_bytes = encode('CP1255', decode('UTF_8', $utf8_bytes));

您看到的混合可读/垃圾字符是由于在同一流中混合了多个不兼容的编码而导致的。可能该流被标记为 UTF-8，但您将 CP1255 编码字符放入其中。您需要将流标记为 CP1255 并仅将 CP1255 编码的数据放入其中，或者将其标记为 UTF-8 并仅将 UTF-8 编码的数据放入其中。提醒自己字节不是字符并在它们之间进行适当的转换。

The string with the hex values that you gave appears to be a UTF-8 encoding. You are getting this because Perl ‘likes to’ use UTF-8 when it deals with strings. The LWP::Simple->get() method automatically decodes the content from the server which includes undoing any Content-Encoding as well as converting to UTF-8.

You could dig into the internals and get a version that does change the character encoding (see HTTP::Message's decoded_content, which is used by HTTP::Response's decoded_content, which you can get from LWP::UserAgent's get). But it may be easier to re-encode the data in your desired encoding with something like

use Encode; 
...; 
$cp1255_bytes = encode('CP1255', decode('UTF_8', $utf8_bytes));

The mixed readable/garbage characters you see are due to mixing multiple, incompatible encodings in the same stream. Probably the stream is labeled as UTF-8 but you are putting CP1255 encoded characters into it. You either need to label the stream as CP1255 and put only CP1255-encoded data into it, or label it as UTF-8 and put only UTF-8-encoded data into it. Remind yourself that bytes are not characters and convert between them appropriately.

回复收藏 0 原文

~没有更多了~