如何查找“宽字符”由 perl 打印?

发布于 2024-09-12 04:53:48 字数 422 浏览 2 评论 0原文

从网站抓取静态 html 页面并将其写入单个文件的 Perl 脚本似乎可以工作,但也会在 ./script.pl 第 n 行的 print 中打印许多宽字符实例到控制台:一个对于抓取的每一页。

然而,简单浏览一下生成的 html 文件并没有发现抓取过程中出现任何明显的错误。如何找到/修复有问题的字符?我应该关心修复它吗?

相关代码:

use WWW::Mechanize;
my $mech = WWW::Mechanize->new;   
...
foreach (@urls) {
    $mech->get($_); 
    print FILE $mech->content;  #MESSAGE REFERS TO THIS LINE
...

这是在带有 Perl 5.8.8 的 OSX 上。

A perl script that scrapes static html pages from a website and writes them to individual files appears to work, but also prints many instances of wide character in print at ./script.pl line n to console: one for each page scraped.

However, a brief glance at the html files generated does not reveal any obvious mistakes in the scraping. How can I find/fix the problem character(s)? Should I even care about fixing it?

The relevant code:

use WWW::Mechanize;
my $mech = WWW::Mechanize->new;   
...
foreach (@urls) {
    $mech->get($_); 
    print FILE $mech->content;  #MESSAGE REFERS TO THIS LINE
...

This is on OSX with Perl 5.8.8.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

拥醉 2024-09-19 04:53:48

如果您想在事后修复文件,那么您可以通过 fix_latin 这将确保它们都是 UTF-8(假设输入已经是 ASCII、Latin-1、CP1252 或 UTF-8 的混合)。

将来,您可以使用 $mech->response->decoded_content ,无论 Web 服务器使用什么编码,它都会为您提供 UTF-8。在写入之前,您需要binmode(FILE, ':utf8'),以确保 Perl 的内部字符串表示形式在输出时转换为严格的 UTF-8 字节。

If you want to fix up the files after the fact, then you could pipe them through fix_latin which will make sure they're all UTF-8 (assuming the input is some mixture of ASCII, Latin-1, CP1252 or UTF-8 already).

For the future, you could use $mech->response->decoded_content which should give you UTF-8 regardless of what encoding the web server used. The you would binmode(FILE, ':utf8') before writing to it, to ensure that Perl's internal string representation is converted to strict UTF-8 bytes on output.

↘紸啶 2024-09-19 04:53:48

我假设你正在抓取图像或类似的东西,无论如何你可以通过添加 binmode(FILE); 来解决这个问题。或者,如果它们是网页并且是 UTF-8,则尝试 binmode( FILE, ':utf8' )。有关详细信息,请参阅 perldoc -f binmodeperldoc perlopentutperldoc PerlIO

“:bytes”、“:crlf”和“:utf8”以及“:...”形式的任何其他指令称为 I/O 层。 “open”编译指示可用于建立默认 I/O 层。请参阅打开。

要将 FILEHANDLE 标记为 UTF-8,请使用“:utf8”或“:encoding(utf8)”。 ":utf8" 只是将数据标记为 UTF-8 而不进行进一步检查,而 ":encoding(utf8)" 检查数据是否实际存在
有效的 UTF-8。更多详细信息可以在 PerlIO::encoding 中找到。

I assume you're crawling images or something of that sort, anyway you can get around the problem by adding binmode(FILE); or if they are webpages and UTF-8 then try binmode( FILE, ':utf8' ). See perldoc -f binmode, perldoc perlopentut, and perldoc PerlIO for more information..

The ":bytes", ":crlf", and ":utf8", and any other directives of the form ":...", are called I/O layers. The "open" pragma can be used to establish default I/O layers. See open.

To mark FILEHANDLE as UTF-8, use ":utf8" or ":encoding(utf8)". ":utf8" just marks the data as UTF-8 without further checking, while ":encoding(utf8)" checks the data for actually being
valid UTF-8. More details can be found in PerlIO::encoding.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文