我的第一个 Perl 脚本:使用“get($url)” 循环中的方法?

发布于 2024-07-12 10:03:08 字数 2395 浏览 4 评论 0原文

所以这看起来很容易。 使用一系列嵌套循环来浏览大量按年/月/日排序的 URL 并下载 XML 文件。 由于这是我的第一个脚本,因此我从循环开始; 任何语言中都熟悉的东西。 我运行它只是打印构建的 URL,它运行得很好。

然后,我编写了代码来下载内容并单独保存,并且在多个测试用例上使用示例 URL 也能完美运行。

但是当我组合这两段代码时,它崩溃了,程序卡住了并且什么也没做。

因此,我运行了调试器,当我单步调试它时,它卡在了这一行:

warnings::register::import(/usr/share/perl/5.10/warnings/register.pm:25):25:vec($warnings::Bits{$k}, $warnings::LAST_BIT, 1) = 0;

如果我只是按 r 从子例程返回,它就会工作,并继续到返回调用堆栈的另一点,在该点上会发生类似的情况并持续了一段时间。 堆栈跟踪:

warnings::register::import('warnings::register') called from file `/usr/lib/perl/5.10/Socket.pm' line 7
Socket::BEGIN() called from file `/usr/lib/perl/5.10/Socket.pm' line 7
eval {...} called from file `/usr/lib/perl/5.10/Socket.pm' line 7
require 'Socket.pm' called from file `/usr/lib/perl/5.10/IO/Socket.pm' line 12
IO::Socket::BEGIN() called from file `/usr/lib/perl/5.10/Socket.pm' line 7
eval {...} called from file `/usr/lib/perl/5.10/Socket.pm' line 7
require 'IO/Socket.pm' called from file `/usr/share/perl5/LWP/Simple.pm' line 158
LWP::Simple::_trivial_http_get('www.aDatabase.com', 80, '/sittings/1987/oct/20.xml') called from file `/usr/share/perl5/LWP/Simple.pm' line 136
LWP::Simple::_get('http://www.aDatabase.com/1987/oct/20.xml') called from file `xmlfetch.pl' line 28

正如您所看到的,它被困在这个“get($url)”方法中,我不知道为什么? 这是我的代码:

#!/usr/bin/perl

use LWP::Simple;

$urlBase = 'http://www.aDatabase.com/subheading/';
$day=1;
$month=1;
@months=("list of months","jan","feb","mar","apr","may","jun","jul","aug","sep","oct","nov","dec");
$year=1987;
$nullXML = "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<nil-classes type=\"array\"/>\n";
    
while($year<=2006)
    {
    $month=1;
    while($month<=12)
        {
        $day=1;
        while($day<=31)
            {
            $newUrl = "$urlBase$year/$months[$month]/$day.xml";
            $content = get($newUrl);
            if($content ne $nullXML)
                {
                $filename = "$year-$month-$day.xml";
                open(FILE, ">$filename");
                print FILE $content;
                close(FILE);
                }
            $day++;
            }
        $month++;
        }
    $year++;
    }

我几乎肯定这是我不知道的小东西,但谷歌还没有发现任何东西。

编辑:这是官方的,它只是永远挂在这个 get 方法中,运行几个循环,然后再次挂起一段时间。 但这仍然是一个问题。 为什么会发生这种情况?

So it seemed easy enough. Use a series of nested loops to go though a ton of URLs sorted by year/month/day and download the XML files.
As this is my first script, I started with the loop; something familiar in any language. I ran it just printing the constructed URLs and it worked perfect.

I then wrote the code to download the content and save it separately, and that worked perfect as well with a sample URL on multiple test cases.

But when I combined these two bits of code, it broke, the program just got stuck and did nothing at all.

I therefore ran the debugger and as I stepped through it, it became stuck on this one line:

warnings::register::import(/usr/share/perl/5.10/warnings/register.pm:25):25:vec($warnings::Bits{$k}, $warnings::LAST_BIT, 1) = 0;

If I just hit r to return from the subroutine it works and continues to another point on its way back down the call stack where something similar happens over and over for some time. The stack trace:

warnings::register::import('warnings::register') called from file `/usr/lib/perl/5.10/Socket.pm' line 7
Socket::BEGIN() called from file `/usr/lib/perl/5.10/Socket.pm' line 7
eval {...} called from file `/usr/lib/perl/5.10/Socket.pm' line 7
require 'Socket.pm' called from file `/usr/lib/perl/5.10/IO/Socket.pm' line 12
IO::Socket::BEGIN() called from file `/usr/lib/perl/5.10/Socket.pm' line 7
eval {...} called from file `/usr/lib/perl/5.10/Socket.pm' line 7
require 'IO/Socket.pm' called from file `/usr/share/perl5/LWP/Simple.pm' line 158
LWP::Simple::_trivial_http_get('www.aDatabase.com', 80, '/sittings/1987/oct/20.xml') called from file `/usr/share/perl5/LWP/Simple.pm' line 136
LWP::Simple::_get('http://www.aDatabase.com/1987/oct/20.xml') called from file `xmlfetch.pl' line 28

As you can see it is getting stuck inside this "get($url)" method, and I have no clue why?
Here is my code:

#!/usr/bin/perl

use LWP::Simple;

$urlBase = 'http://www.aDatabase.com/subheading/';
$day=1;
$month=1;
@months=("list of months","jan","feb","mar","apr","may","jun","jul","aug","sep","oct","nov","dec");
$year=1987;
$nullXML = "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<nil-classes type=\"array\"/>\n";
    
while($year<=2006)
    {
    $month=1;
    while($month<=12)
        {
        $day=1;
        while($day<=31)
            {
            $newUrl = "$urlBase$year/$months[$month]/$day.xml";
            $content = get($newUrl);
            if($content ne $nullXML)
                {
                $filename = "$year-$month-$day.xml";
                open(FILE, ">$filename");
                print FILE $content;
                close(FILE);
                }
            $day++;
            }
        $month++;
        }
    $year++;
    }

I am almost positive it is something tiny I just dont know, but google has not turned up anything.

EDIT: It's official, it just hangs forever inside this get method, runs for several loops then hangs again for a while. But its still a problem. Why is this happening?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

囍笑 2024-07-19 10:03:08

http://www.adatabase.com/1987/oct/20.xml 是一个 404(并且无论如何都不是可以从您的程序生成的东西(路径中没有“副标题”),我假设这不是您正在使用的真正链接,这让我们很难作为一般规则,请使用 example.com 而不是组成主机名,这就是

您应该

use strict;
use warnings;

在代码中保留它的原因 - 这将有助于突出显示您可能遇到的任何范围问题(如果它出现,我会感到惊讶)确实如此,但 LWP 代码的一部分可能会弄乱您的 $urlBase 或其他内容),我认为更改初始变量声明(以及 $newUrl、$content 和 $filename)就足够了。将“my”放在前面以使您的代码严格

如果使用严格和警告不能让您更接近解决方案,您可以警告您将要使用每个循环的链接,以便当它粘住时您可以尝试它。在浏览器中查看会发生什么,或者使用数据包嗅探器(例如Wireshark)可以给你一些线索。

Since http://www.adatabase.com/1987/oct/20.xml is a 404 (and isn't something that can be generated from your program anyway (no 'subheading' in the path), I'm assuming that isn't the real link you are using, which makes it hard for us to test. As a general rule, please use example.com instead of making up hostnames, that's why it is reserved.

You should really

use strict;
use warnings;

in your code - this will help highlight any scoping issues you may have (I'd be surprised if it was the case, but there is a chance that a part of the LWP code is messing around with your $urlBase or something). I think it should be enough to change the inital variable declarations (and $newUrl, $content and $filename) to put 'my' in front to make your code strict.

If using strict and warnings doesn't get you any closer to a solution, you could warn out the link you are about to use each loop so when it sticks you can try it in a browser and see what happens, or alternatively using a packet sniffer (such as Wireshark) could give you some clues.

寄意 2024-07-19 10:03:08

(2006 - 1986) * 12 * 31 超过了 7000。没有暂停地请求网页是不好的。

稍微更像 Perl 的版本(代码风格方面):

#!/usr/bin/perl
use strict;
use warnings;

use LWP::Simple qw(get);    

my $urlBase = 'http://www.example.com/subheading/';
my @months  = qw/jan feb mar apr may jun jul aug sep oct nov dec/;
my $nullXML = <<'NULLXML';
<?xml version="1.0" encoding="UTF-8"?>
<nil-classes type="array"/>
NULLXML

for my $year (1987..2006) {
    for my $month (0..$#months) {
        for my $day (1..31) {
            my $newUrl = "$urlBase$year/$months[$month]/$day.xml";
            my $content = "abc"; #XXX get($newUrl);
            if ($content ne $nullXML) {
               my $filename = "$year-@{[$month+1]}-$day.xml";
               open my $fh, ">$filename" 
                   or die "Can't open '$filename': $!";
               print $fh $content;
               # $fh implicitly closed
            }
        }
    }
}

(2006 - 1986) * 12 * 31 is more then 7000. Requesting web pages without a pause is not nice.

Slightly more Perl-like version (code-style wise):

#!/usr/bin/perl
use strict;
use warnings;

use LWP::Simple qw(get);    

my $urlBase = 'http://www.example.com/subheading/';
my @months  = qw/jan feb mar apr may jun jul aug sep oct nov dec/;
my $nullXML = <<'NULLXML';
<?xml version="1.0" encoding="UTF-8"?>
<nil-classes type="array"/>
NULLXML

for my $year (1987..2006) {
    for my $month (0..$#months) {
        for my $day (1..31) {
            my $newUrl = "$urlBase$year/$months[$month]/$day.xml";
            my $content = "abc"; #XXX get($newUrl);
            if ($content ne $nullXML) {
               my $filename = "$year-@{[$month+1]}-$day.xml";
               open my $fh, ">$filename" 
                   or die "Can't open '$filename': $!";
               print $fh $content;
               # $fh implicitly closed
            }
        }
    }
}
幸福不弃 2024-07-19 10:03:08

LWP 有一个 getstore 函数,可以为您完成大部分提取工作,然后保存工作。 您还可以查看 LWP::Parallel::UserAgent 以及对如何进行更多控制点击远程站点。

LWP has a getstore function that does most of the fetching then saving work for you. You might also check out LWP::Parallel::UserAgent and a bit more control over how you hit the remote site.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文