使用 Perl 的网页缩略图截图 :: Mechanize
我使用 WWW::Mechanize::Firefox 来控制 Firefox 实例并使用 $mech->content_as_png 转储渲染的页面。
新更新:请参阅初始发布的末尾: 感谢 user1126070,我们有了一个新的解决方案 - 我想在当天晚些时候尝试一下 [现在我在办公室而不是在家 - 在装有该程序的机器前]
$mech->repl->repl->setup_client( { extra_client_args => { timeout => 5*60 } } );
我尝试了放置链接的版本@list 并使用 eval 并执行以下操作:
while (scalar(@list)) {
my $link = pop(@list);
print "trying $link\n";
eval{
$mech->get($link);
sleep (5);
my $png = $mech->content_as_png();
my $name = "$_";
$name =~s/^www\.//;
$name .= ".png";
open(OUTPUT, ">$name");
print OUTPUT $png;
close(OUTPUT);
}
if ($@){
print "link: $link failed\n";
push(@list,$link);#put the end of the list
next;
}
print "$link is done!\n";
}
顺便说一句: user1126070 将图像修剪为缩略图大小。我应该在这里使用成像仪吗?你能在这里提出一些解决方案吗...!?那太好了。
更新结束
这里继续问题概述 - 正如本问答的开头所写的那样。
问题概要:我有一个包含 2500 个网站的列表,需要抓取它们的缩略图。我该怎么做?我可以尝试使用 Perl 来解析这些站点。- Mechanize 将是一件好事。注意:我只需要结果作为缩略图,长尺寸最大为 240 像素。目前,我有一个速度慢且不返回缩略图的解决方案:如何使脚本运行得更快,开销更少 - 吐出缩略图
但我必须意识到,设置它可能会带来很大的挑战。 如果一切按预期工作,您可以简单地使用这样的脚本来转储所需网站的图像,但您应该启动 Firefox 并手动将其大小调整为所需的宽度(高度并不重要,WWW::Mechanize::Firefox 总是转储整个页面)。
我到目前为止所做的事情很多 - 我与 mozrepl 合作。目前我在超时问题上挣扎:有没有办法用 WWW::Mechanize::Firefox 指定 Net::Telnet 超时? 目前我的互联网连接速度非常慢,有时我会收到错误
with $mech->get():
command timed-out at /usr/local/share/perl/5.12.3/MozRepl/Client.pm line 186
请看这个:
> $mech->repl->repl->timeout(100000);
不幸的是它不起作用:无法通过包“MozRepl”找到对象方法“超时” 文档说这应该是:
$mech->repl->repl->setup_client( { extra_client_args => { timeout => 1 +80 } } );
我已经尝试过了;这是:
#!/usr/bin/perl
use strict;
use warnings;
use WWW::Mechanize::Firefox;
my $mech = new WWW::Mechanize::Firefox();
open(INPUT, "<urls.txt") or die $!;
while (<INPUT>) {
chomp;
print "$_\n";
$mech->get($_);
my $png = $mech->content_as_png();
my $name = "$_";
$name =~s/^www\.//;
$name .= ".png";
open(OUTPUT, ">$name");
print OUTPUT $png;
sleep (5);
}
嗯,这不关心大小:请参阅输出命令行:
linux-vi17:/home/martin/perl # perl mecha_test_1.pl
www.google.com
www.cnn.com
www.msnbc.com
command timed-out at /usr/lib/perl5/site_perl/5.12.3/MozRepl/Client.pm line 186
linux-vi17:/home/martin/perl #
这里 - 这是我的来源:请参阅我在网址列表中拥有的网站的片段示例。
urls.txt - 来源列表
www.google.com
www.cnn.com
www.msnbc.com
news.bbc.co.uk
www.bing.com
www.yahoo.com and so on...
顺便说一句: 有了这么多 url,我们必须预料到其中一些会失败并进行处理。例如,我们将失败的放入数组或哈希中,然后重试 X 次。
UTSL
嗯,这个人在这里怎么样……
sub content_as_png {
my ($self, $tab, $rect) = @_;
$tab ||= $self->tab;
$rect ||= {};
# Mostly taken from
# http://wiki.github.com/bard/mozrepl/interactor-screenshot-server
my $screenshot = $self->repl->declare(<<'JS');
function (tab,rect) {
var browser = tab.linkedBrowser;
var browserWindow = Components.classes['@mozilla.org/appshell/window-mediator;1']
.getService(Components.interfaces.nsIWindowMediator)
.getMostRecentWindow('navigator:browser');
var win = browser.contentWindow;
var body = win.document.body;
if(!body) {
return;
};
var canvas = browserWindow
.document
.createElementNS('http://www.w3.org/1999/xhtml', 'canvas');
var left = rect.left || 0;
var top = rect.top || 0;
var width = rect.width || body.clientWidth;
var height = rect.height || body.clientHeight;
canvas.width = width;
canvas.height = height;
var ctx = canvas.getContext('2d');
ctx.clearRect(0, 0, width, height);
ctx.save();
ctx.scale(1.0, 1.0);
ctx.drawWindow(win, left, top, width, height, 'rgb(255,255,255)');
ctx.restore();
//return atob(
return canvas
.toDataURL('image/png', '')
.split(',')[1]
// );
}
JS
my $scr = $screenshot->($tab, $rect);
return $scr ? decode_base64($scr) : undef
};
很高兴收到您的来信! 问候零
i use WWW::Mechanize::Firefox to control a firefox instance and dump the rendered page with $mech->content_as_png.
New update: see at the end of the initial posting:
thanks to user1126070 we have a new solution - which i want to try out later the day [right now i am in office and not at home - in front of the machine with the programme ]
$mech->repl->repl->setup_client( { extra_client_args => { timeout => 5*60 } } );
i try out the version that put links to @list and use eval
and do the following:
while (scalar(@list)) {
my $link = pop(@list);
print "trying $link\n";
eval{
$mech->get($link);
sleep (5);
my $png = $mech->content_as_png();
my $name = "$_";
$name =~s/^www\.//;
$name .= ".png";
open(OUTPUT, ">$name");
print OUTPUT $png;
close(OUTPUT);
}
if ($@){
print "link: $link failed\n";
push(@list,$link);#put the end of the list
next;
}
print "$link is done!\n";
}
BTW: user1126070 what with the trimming down the images to thumbnail-size. Should i use imager here. Can you suggest some solution thing here...!? That would be great.
end of Update
Here the problem-outline continues - as written at the very beginning of this Q & A
problem-outline: I have a list of 2500 websites and need to grab a thumbnail screenshot of them. How do I do that? I could try to parse the sites either with Perl.- Mechanize would be a good thing. Note: i only need the results as a thumbnails that are a maximum 240 pixels in the long dimension. At the moment i have a solution which is slow and does not give back thumbnails: How to make the script running faster with less overhead - spiting out the thumbnails
But i have to be aware that setting it up can pose quite a challenge, though.
If all works as expected, you can simply use a script like this to dump images of the desired websites, but you should start Firefox and resize it to the desired width manually (height doesn't matter, WWW::Mechanize::Firefox always dumps the whole page).
What i have done so far is alot - i work with mozrepl. At the moment i struggle with timeouts: Is there a way to specify Net::Telnet timeout with WWW::Mechanize::Firefox?
At the moment my internet connection is very slow and sometimes I get error
with $mech->get():
command timed-out at /usr/local/share/perl/5.12.3/MozRepl/Client.pm line 186
SEE THIS ONE:
> $mech->repl->repl->timeout(100000);
Unfortunatly it does not work: Can't locate object method "timeout" via package "MozRepl"
Documentation says this should:
$mech->repl->repl->setup_client( { extra_client_args => { timeout => 1 +80 } } );
What i have tried allready; here it is:
#!/usr/bin/perl
use strict;
use warnings;
use WWW::Mechanize::Firefox;
my $mech = new WWW::Mechanize::Firefox();
open(INPUT, "<urls.txt") or die $!;
while (<INPUT>) {
chomp;
print "$_\n";
$mech->get($_);
my $png = $mech->content_as_png();
my $name = "$_";
$name =~s/^www\.//;
$name .= ".png";
open(OUTPUT, ">$name");
print OUTPUT $png;
sleep (5);
}
Well this does not care about the size: See the output commandline:
linux-vi17:/home/martin/perl # perl mecha_test_1.pl
www.google.com
www.cnn.com
www.msnbc.com
command timed-out at /usr/lib/perl5/site_perl/5.12.3/MozRepl/Client.pm line 186
linux-vi17:/home/martin/perl #
And here - this is my source: see a snippet-example of the sites i have in the url-list.
urls.txt - the list of sources
www.google.com
www.cnn.com
www.msnbc.com
news.bbc.co.uk
www.bing.com
www.yahoo.com and so on...
BTW: With that many url's we have to expect that some will fail and handle that. For example, we put the failed ones in an array or hash and retry them X times.
UTSL
well how is this one here...
sub content_as_png {
my ($self, $tab, $rect) = @_;
$tab ||= $self->tab;
$rect ||= {};
# Mostly taken from
# http://wiki.github.com/bard/mozrepl/interactor-screenshot-server
my $screenshot = $self->repl->declare(<<'JS');
function (tab,rect) {
var browser = tab.linkedBrowser;
var browserWindow = Components.classes['@mozilla.org/appshell/window-mediator;1']
.getService(Components.interfaces.nsIWindowMediator)
.getMostRecentWindow('navigator:browser');
var win = browser.contentWindow;
var body = win.document.body;
if(!body) {
return;
};
var canvas = browserWindow
.document
.createElementNS('http://www.w3.org/1999/xhtml', 'canvas');
var left = rect.left || 0;
var top = rect.top || 0;
var width = rect.width || body.clientWidth;
var height = rect.height || body.clientHeight;
canvas.width = width;
canvas.height = height;
var ctx = canvas.getContext('2d');
ctx.clearRect(0, 0, width, height);
ctx.save();
ctx.scale(1.0, 1.0);
ctx.drawWindow(win, left, top, width, height, 'rgb(255,255,255)');
ctx.restore();
//return atob(
return canvas
.toDataURL('image/png', '')
.split(',')[1]
// );
}
JS
my $scr = $screenshot->($tab, $rect);
return $scr ? decode_base64($scr) : undef
};
Love to hear from you!
greetings zero
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
你尝试过这个吗?有效吗?
将链接添加到 @list 并使用 eval
Are you tried this out? It is working?
put links to @list and use eval