如何查找“宽字符”由 perl 打印?
从网站抓取静态 html 页面并将其写入单个文件的 Perl 脚本似乎可以工作,但也会在 ./script.pl 第 n 行的 print 中打印许多宽字符实例到控制台:一个对于抓取的每一页。
然而,简单浏览一下生成的 html 文件并没有发现抓取过程中出现任何明显的错误。如何找到/修复有问题的字符?我应该关心修复它吗?
相关代码:
use WWW::Mechanize;
my $mech = WWW::Mechanize->new;
...
foreach (@urls) {
$mech->get($_);
print FILE $mech->content; #MESSAGE REFERS TO THIS LINE
...
这是在带有 Perl 5.8.8 的 OSX 上。
A perl script that scrapes static html pages from a website and writes them to individual files appears to work, but also prints many instances of wide character in print at ./script.pl line n
to console: one for each page scraped.
However, a brief glance at the html files generated does not reveal any obvious mistakes in the scraping. How can I find/fix the problem character(s)? Should I even care about fixing it?
The relevant code:
use WWW::Mechanize;
my $mech = WWW::Mechanize->new;
...
foreach (@urls) {
$mech->get($_);
print FILE $mech->content; #MESSAGE REFERS TO THIS LINE
...
This is on OSX with Perl 5.8.8.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
如果您想在事后修复文件,那么您可以通过 fix_latin 这将确保它们都是 UTF-8(假设输入已经是 ASCII、Latin-1、CP1252 或 UTF-8 的混合)。
将来,您可以使用
$mech->response->decoded_content
,无论 Web 服务器使用什么编码,它都会为您提供 UTF-8。在写入之前,您需要binmode(FILE, ':utf8')
,以确保 Perl 的内部字符串表示形式在输出时转换为严格的 UTF-8 字节。If you want to fix up the files after the fact, then you could pipe them through fix_latin which will make sure they're all UTF-8 (assuming the input is some mixture of ASCII, Latin-1, CP1252 or UTF-8 already).
For the future, you could use
$mech->response->decoded_content
which should give you UTF-8 regardless of what encoding the web server used. The you wouldbinmode(FILE, ':utf8')
before writing to it, to ensure that Perl's internal string representation is converted to strict UTF-8 bytes on output.我假设你正在抓取图像或类似的东西,无论如何你可以通过添加
binmode(FILE)
; 来解决这个问题。或者,如果它们是网页并且是 UTF-8,则尝试binmode( FILE, ':utf8' )
。有关详细信息,请参阅perldoc -f binmode
、perldoc perlopentut
和perldoc PerlIO
。I assume you're crawling images or something of that sort, anyway you can get around the problem by adding
binmode(FILE)
; or if they are webpages and UTF-8 then trybinmode( FILE, ':utf8' )
. Seeperldoc -f binmode
,perldoc perlopentut
, andperldoc PerlIO
for more information..