使用 Perl 的网页缩略图截图 :: Mechanize

发布于 2025-01-07 17:32:37 字数 4439 浏览 1 评论 0原文

我使用 WWW::Mechanize::Firefox 来控制 Firefox 实例并使用 $mech->content_as_png 转储渲染的页面。

新更新：请参阅初始发布的末尾：感谢 user1126070，我们有了一个新的解决方案 - 我想在当天晚些时候尝试一下 [现在我在办公室而不是在家 - 在装有该程序的机器前]

$mech->repl->repl->setup_client( { extra_client_args => { timeout => 5*60 } } );

我尝试了放置链接的版本@list 并使用 eval 并执行以下操作：

while (scalar(@list)) {
        my $link = pop(@list);
        print "trying $link\n";
        eval{
        $mech->get($link);
        sleep (5);
        my $png = $mech->content_as_png();
        my $name = "$_";
        $name =~s/^www\.//;
        $name .= ".png";
        open(OUTPUT, ">$name");
        print OUTPUT $png;        
        close(OUTPUT);
        }
        if ($@){
          print "link: $link failed\n";
          push(@list,$link);#put the end of the list
          next;
        }
        print "$link is done!\n";

}

顺便说一句： user1126070 将图像修剪为缩略图大小。我应该在这里使用成像仪吗？你能在这里提出一些解决方案吗...！？那太好了。

更新结束

这里继续问题概述 - 正如本问答的开头所写的那样。

问题概要：我有一个包含 2500 个网站的列表，需要抓取它们的缩略图。我该怎么做？我可以尝试使用 Perl 来解析这些站点。- Mechanize 将是一件好事。注意：我只需要结果作为缩略图，长尺寸最大为 240 像素。目前，我有一个速度慢且不返回缩略图的解决方案：如何使脚本运行得更快，开销更少 - 吐出缩略图

但我必须意识到，设置它可能会带来很大的挑战。如果一切按预期工作，您可以简单地使用这样的脚本来转储所需网站的图像，但您应该启动 Firefox 并手动将其大小调整为所需的宽度（高度并不重要，WWW::Mechanize::Firefox 总是转储整个页面）。

我到目前为止所做的事情很多 - 我与 mozrepl 合作。目前我在超时问题上挣扎：有没有办法用 WWW::Mechanize::Firefox 指定 Net::Telnet 超时？目前我的互联网连接速度非常慢，有时我会收到错误

with $mech->get():
command timed-out at /usr/local/share/perl/5.12.3/MozRepl/Client.pm line 186

请看这个：

> $mech->repl->repl->timeout(100000);

不幸的是它不起作用：无法通过包“MozRepl”找到对象方法“超时” 文档说这应该是：

$mech->repl->repl->setup_client( { extra_client_args => { timeout => 1 +80 } } );

我已经尝试过了；这是：

#!/usr/bin/perl

use strict;
use warnings;
use WWW::Mechanize::Firefox;

my $mech = new WWW::Mechanize::Firefox();

open(INPUT, "<urls.txt") or die $!;

while (<INPUT>) {
        chomp;
        print "$_\n";
        $mech->get($_);
        my $png = $mech->content_as_png();
        my $name = "$_";
        $name =~s/^www\.//;
        $name .= ".png";
        open(OUTPUT, ">$name");
        print OUTPUT $png;
        sleep (5);
}

嗯，这不关心大小：请参阅输出命令行：

linux-vi17:/home/martin/perl # perl mecha_test_1.pl
www.google.com
www.cnn.com
www.msnbc.com
command timed-out at /usr/lib/perl5/site_perl/5.12.3/MozRepl/Client.pm line 186
linux-vi17:/home/martin/perl #

这里 - 这是我的来源：请参阅我在网址列表中拥有的网站的片段示例。

urls.txt - 来源列表

www.google.com
www.cnn.com
www.msnbc.com
news.bbc.co.uk
www.bing.com
www.yahoo.com and so on...

顺便说一句： 有了这么多 url，我们必须预料到其中一些会失败并进行处理。例如，我们将失败的放入数组或哈希中，然后重试 X 次。

UTSL

嗯，这个人在这里怎么样……

 sub content_as_png {

my ($self, $tab, $rect) = @_;
$tab ||= $self->tab;
$rect ||= {};

# Mostly taken from
# http://wiki.github.com/bard/mozrepl/interactor-screenshot-server
my $screenshot = $self->repl->declare(<<'JS');
function (tab,rect) {
    var browser = tab.linkedBrowser;
    var browserWindow = Components.classes['@mozilla.org/appshell/window-mediator;1']
        .getService(Components.interfaces.nsIWindowMediator)
        .getMostRecentWindow('navigator:browser');
    var win = browser.contentWindow;
    var body = win.document.body;
    if(!body) {
        return;
    };
    var canvas = browserWindow
           .document
           .createElementNS('http://www.w3.org/1999/xhtml', 'canvas');
    var left = rect.left || 0;
    var top = rect.top || 0;
    var width = rect.width || body.clientWidth;
    var height = rect.height || body.clientHeight;
    canvas.width = width;
    canvas.height = height;
    var ctx = canvas.getContext('2d');
    ctx.clearRect(0, 0, width, height);
    ctx.save();
    ctx.scale(1.0, 1.0);
    ctx.drawWindow(win, left, top, width, height, 'rgb(255,255,255)');
    ctx.restore();

    //return atob(
    return canvas
           .toDataURL('image/png', '')
           .split(',')[1]
    // );
}
JS
    my $scr = $screenshot->($tab, $rect);
    return $scr ? decode_base64($scr) : undef
};

很高兴收到您的来信！问候零

原文

i use WWW::Mechanize::Firefox to control a firefox instance and dump the rendered page with $mech->content_as_png.

New update: see at the end of the initial posting:
thanks to user1126070 we have a new solution - which i want to try out later the day [right now i am in office and not at home - in front of the machine with the programme ]

$mech->repl->repl->setup_client( { extra_client_args => { timeout => 5*60 } } );

i try out the version that put links to @list and use eval and do the following:

while (scalar(@list)) {
        my $link = pop(@list);
        print "trying $link\n";
        eval{
        $mech->get($link);
        sleep (5);
        my $png = $mech->content_as_png();
        my $name = "$_";
        $name =~s/^www\.//;
        $name .= ".png";
        open(OUTPUT, ">$name");
        print OUTPUT $png;        
        close(OUTPUT);
        }
        if ($@){
          print "link: $link failed\n";
          push(@list,$link);#put the end of the list
          next;
        }
        print "$link is done!\n";

}

BTW: user1126070 what with the trimming down the images to thumbnail-size. Should i use imager here. Can you suggest some solution thing here...!? That would be great.

end of Update

Here the problem-outline continues - as written at the very beginning of this Q & A

problem-outline: I have a list of 2500 websites and need to grab a thumbnail screenshot of them. How do I do that? I could try to parse the sites either with Perl.- Mechanize would be a good thing. Note: i only need the results as a thumbnails that are a maximum 240 pixels in the long dimension. At the moment i have a solution which is slow and does not give back thumbnails: How to make the script running faster with less overhead - spiting out the thumbnails

But i have to be aware that setting it up can pose quite a challenge, though.
If all works as expected, you can simply use a script like this to dump images of the desired websites, but you should start Firefox and resize it to the desired width manually (height doesn't matter, WWW::Mechanize::Firefox always dumps the whole page).

What i have done so far is alot - i work with mozrepl. At the moment i struggle with timeouts: Is there a way to specify Net::Telnet timeout with WWW::Mechanize::Firefox?
At the moment my internet connection is very slow and sometimes I get error

with $mech->get():
command timed-out at /usr/local/share/perl/5.12.3/MozRepl/Client.pm line 186

SEE THIS ONE:

> $mech->repl->repl->timeout(100000);

Unfortunatly it does not work: Can't locate object method "timeout" via package "MozRepl"
Documentation says this should:

$mech->repl->repl->setup_client( { extra_client_args => { timeout => 1 +80 } } );

What i have tried allready; here it is:

#!/usr/bin/perl

use strict;
use warnings;
use WWW::Mechanize::Firefox;

my $mech = new WWW::Mechanize::Firefox();

open(INPUT, "<urls.txt") or die $!;

while (<INPUT>) {
        chomp;
        print "$_\n";
        $mech->get($_);
        my $png = $mech->content_as_png();
        my $name = "$_";
        $name =~s/^www\.//;
        $name .= ".png";
        open(OUTPUT, ">$name");
        print OUTPUT $png;
        sleep (5);
}

Well this does not care about the size: See the output commandline:

linux-vi17:/home/martin/perl # perl mecha_test_1.pl
www.google.com
www.cnn.com
www.msnbc.com
command timed-out at /usr/lib/perl5/site_perl/5.12.3/MozRepl/Client.pm line 186
linux-vi17:/home/martin/perl #

And here - this is my source: see a snippet-example of the sites i have in the url-list.

urls.txt - the list of sources

www.google.com
www.cnn.com
www.msnbc.com
news.bbc.co.uk
www.bing.com
www.yahoo.com and so on...

BTW: With that many url's we have to expect that some will fail and handle that. For example, we put the failed ones in an array or hash and retry them X times.

UTSL

well how is this one here...

 sub content_as_png {

my ($self, $tab, $rect) = @_;
$tab ||= $self->tab;
$rect ||= {};

# Mostly taken from
# http://wiki.github.com/bard/mozrepl/interactor-screenshot-server
my $screenshot = $self->repl->declare(<<'JS');
function (tab,rect) {
    var browser = tab.linkedBrowser;
    var browserWindow = Components.classes['@mozilla.org/appshell/window-mediator;1']
        .getService(Components.interfaces.nsIWindowMediator)
        .getMostRecentWindow('navigator:browser');
    var win = browser.contentWindow;
    var body = win.document.body;
    if(!body) {
        return;
    };
    var canvas = browserWindow
           .document
           .createElementNS('http://www.w3.org/1999/xhtml', 'canvas');
    var left = rect.left || 0;
    var top = rect.top || 0;
    var width = rect.width || body.clientWidth;
    var height = rect.height || body.clientHeight;
    canvas.width = width;
    canvas.height = height;
    var ctx = canvas.getContext('2d');
    ctx.clearRect(0, 0, width, height);
    ctx.save();
    ctx.scale(1.0, 1.0);
    ctx.drawWindow(win, left, top, width, height, 'rgb(255,255,255)');
    ctx.restore();

    //return atob(
    return canvas
           .toDataURL('image/png', '')
           .split(',')[1]
    // );
}
JS
    my $scr = $screenshot->($tab, $rect);
    return $scr ? decode_base64($scr) : undef
};

Love to hear from you!
greetings zero

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

兮颜 2025-01-14 17:32:37

你尝试过这个吗？有效吗？

$mech->repl->repl->setup_client( { extra_client_args => { timeout => 5*60 } } );

将链接添加到 @list 并使用 eval

while (scalar(@list)) {
        my $link = pop(@list);
        print "trying $link\n";
        eval{
        $mech->get($link);
        sleep (5);
        my $png = $mech->content_as_png();
        my $name = "$_";
        $name =~s/^www\.//;
        $name .= ".png";
        open(OUTPUT, ">$name");
        print OUTPUT $png;        
        close(OUTPUT);
        }
        if ($@){
          print "link: $link failed\n";
          push(@list,$link);#put the end of the list
          next;
        }
        print "$link is done!\n";

}

Are you tried this out? It is working?

$mech->repl->repl->setup_client( { extra_client_args => { timeout => 5*60 } } );

put links to @list and use eval

while (scalar(@list)) {
        my $link = pop(@list);
        print "trying $link\n";
        eval{
        $mech->get($link);
        sleep (5);
        my $png = $mech->content_as_png();
        my $name = "$_";
        $name =~s/^www\.//;
        $name .= ".png";
        open(OUTPUT, ">$name");
        print OUTPUT $png;        
        close(OUTPUT);
        }
        if ($@){
          print "link: $link failed\n";
          push(@list,$link);#put the end of the list
          next;
        }
        print "$link is done!\n";

}

回复收藏 0 原文

~没有更多了~