Perl 问题与 UserAgent 获取网站循环

发布于 2024-10-11 16:41:30 字数 835 浏览 2 评论 0原文

我能够很好地抓取第一张图像,但内容似乎在其内部循环。不确定我做错了什么。

#!/usr/bin/perl
use LWP::Simple;
use LWP::UserAgent;
my $ua = LWP::UserAgent->new;
for(my $id=1;$id<55;$id++)
{
    my $response = $ua->get("http://www.gamereplays.org/community/index.php?act=medals&CODE=showmedal&MDSID=" . $id );
    my $content = $response->content;    
        for(my $id2=1;$id2<10;$id2++)
        {
                $content =~ /<img src="http:\/\/www\.gamereplays.org\/community\/style_medals\/(.*)$id2\.gif" alt=""\/>/;
                $url = "http://www.gamereplays.org/community/style_medals/" . $1 . $id2 . ".gif";
  print "--\n\r";
  print "ID: ".$id."\n\r";
  print "ID2: ".$id2."\n\r";
  print "URL: ".$url."\n\r";
  print "1: ".$1."\n\r";
  print "--\n\r";
  getstore($url, $1 . $id2 . ".gif");
        }
}

I'm able to grab the first image fine, but then the content seems to be looping inside itself. Not sure what I'm doing wrong.

#!/usr/bin/perl
use LWP::Simple;
use LWP::UserAgent;
my $ua = LWP::UserAgent->new;
for(my $id=1;$id<55;$id++)
{
    my $response = $ua->get("http://www.gamereplays.org/community/index.php?act=medals&CODE=showmedal&MDSID=" . $id );
    my $content = $response->content;    
        for(my $id2=1;$id2<10;$id2++)
        {
                $content =~ /<img src="http:\/\/www\.gamereplays.org\/community\/style_medals\/(.*)$id2\.gif" alt=""\/>/;
                $url = "http://www.gamereplays.org/community/style_medals/" . $1 . $id2 . ".gif";
  print "--\n\r";
  print "ID: ".$id."\n\r";
  print "ID2: ".$id2."\n\r";
  print "URL: ".$url."\n\r";
  print "1: ".$1."\n\r";
  print "--\n\r";
  getstore($url, $1 . $id2 . ".gif");
        }
}

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

红尘作伴 2024-10-18 16:41:30

正如其他人所说,这实际上是 HTML::Parser 的工作。另外,你应该“使用严格;”并删除 use LWP::Simple 因为您没有使用该库。

您可以将正则表达式更改为以下内容:

$content =~ m{http://www\.gamereplays\.org/community/style_medals/([\w\_]+)$id2\.gif}s;

但您不会得到 style_medals/comp_graphics_10.gif - 这可能是您想要的。我认为像下面这样的东西会更好。我对风格的改变表示歉意,但我无法抗拒为 PBP 进行修改。

#!/usr/bin/perl                                                                 

use LWP::UserAgent;
use Carp;
use strict;

my $ua = LWP::UserAgent->new();

# Fetch pages from 1 to 55.  Are we sure we won't have page 56?                 
# Perhaps consider running until a 404 is found.                                
for (my $id = 1; $id < 55; $id++) {

    # Get the page data                                                         
    my $response = $ua->get( 'http://www.gamereplays.org/community/index.php?ac\
t=medals&CODE=showmedal&MDSID='.$id );

    # Check for failure and abort                                               
    if (!defined $response || !$response->is_success) {
        croak 'Request failed! '.$response->status_line();
    }

    my $content = $response->content();

    # Run this loop each time we find the url                                   
  CONTENT_LOOP:
    while ($content =~ s{<img src="(http://www\.gamereplays\.org/community/styl\
e_medals/([^\"]+))" }{}ms) {

        my $url   = $1;  # The entire url, no need to recreate the domain       
        my $file  = $2;  # Just the file name portion                           
        my ($id2) = $file =~ m{ _(\d+)\.gif \Z}xms; # extract id2 for debug     

        next CONTENT_LOOP if !defined $id2;         # Handle SOTW.gif file(s)   

        # Display stats about each id found                                     
        print "--\n";
        print "ID:  $id\n";
        print "ID2: $id2\n";
        print "URL: $url\n";
        print "1:   $file\n";
        print "--\n";

        # You might want to consider involving the $id in the filename as       
        # you could have the same filename on multiple pages                    
        getstore( $url, $file);
    }
}

As others have stated, this is really a job for an HTML::Parser. Also, you should 'use strict;' and remove use LWP::Simple as you're not using the library.

You could change your regex to the following:

$content =~ m{http://www\.gamereplays\.org/community/style_medals/([\w\_]+)$id2\.gif}s;

But you won't get style_medals/comp_graphics_10.gif - which may be what you want. I think something like the following would work better. My apologies for the style changes but I can't resist modifying for PBP.

#!/usr/bin/perl                                                                 

use LWP::UserAgent;
use Carp;
use strict;

my $ua = LWP::UserAgent->new();

# Fetch pages from 1 to 55.  Are we sure we won't have page 56?                 
# Perhaps consider running until a 404 is found.                                
for (my $id = 1; $id < 55; $id++) {

    # Get the page data                                                         
    my $response = $ua->get( 'http://www.gamereplays.org/community/index.php?ac\
t=medals&CODE=showmedal&MDSID='.$id );

    # Check for failure and abort                                               
    if (!defined $response || !$response->is_success) {
        croak 'Request failed! '.$response->status_line();
    }

    my $content = $response->content();

    # Run this loop each time we find the url                                   
  CONTENT_LOOP:
    while ($content =~ s{<img src="(http://www\.gamereplays\.org/community/styl\
e_medals/([^\"]+))" }{}ms) {

        my $url   = $1;  # The entire url, no need to recreate the domain       
        my $file  = $2;  # Just the file name portion                           
        my ($id2) = $file =~ m{ _(\d+)\.gif \Z}xms; # extract id2 for debug     

        next CONTENT_LOOP if !defined $id2;         # Handle SOTW.gif file(s)   

        # Display stats about each id found                                     
        print "--\n";
        print "ID:  $id\n";
        print "ID2: $id2\n";
        print "URL: $url\n";
        print "1:   $file\n";
        print "--\n";

        # You might want to consider involving the $id in the filename as       
        # you could have the same filename on multiple pages                    
        getstore( $url, $file);
    }
}
捶死心动 2024-10-18 16:41:30

问题出在你的正则表达式上。 (.*) 是贪婪的,它将匹配 style_medals/$id2.gif 之间的所有字符。当 $id2 为 1 时,这很好,但当 $id2 为 2 时,它将匹配 2.gif 之前的所有内容,即包括 1.gif 中的完整字符串。

尝试通过添加 ? 非贪婪修饰符来使 (.*) 非贪婪:(.*?)。这应该可以解决你的问题。

编辑:理想情况下,您不会使用 正则表达式来解析 HTML,而不是使用诸如 HTML 之类的内容::解析器

The problem comes in your regular expression. (.*) is greedy, in which it will match all characters between style_medals/ and $id2.gif. When $id2 is 1, this is fine, but when $id2 is 2, it'll match everything up until 2.gif, which includes the full string from 1.gif.

Try making (.*) non-greedy by adding the ? non-greedy modifier: (.*?). This should fix your problem.

Edit: Ideally you wouldn't be using a regular expression to parse HTML, instead using something like, say, HTML::Parser.

转瞬即逝 2024-10-18 16:41:30

我不会推动 HTML 解析模块(尽管 LinkExtor 可以 成为您的朋友...),因为我了解 HTML 解析器可能带来的问题:如果 HTML 不正确有效,他们经常感到窒息,只要你在寻找正确的东西,一个简单的正则表达式就可以解决任何问题,无论多么糟糕。

正如 CanSpice 上面所说,(.*) 是贪婪的。非贪婪修饰符通常会做你想做的事。然而,另一种选择是让它贪婪,但确保它不会抓取任何超过图像标签的带引号的 src 属性的内容:

/<img src="http:\/\/www\.gamereplays.org\/community\/style_medals\/([^"]*)$id2\.gif"[^>]*>/

注意:我还修改了它以不关心是否有 alt 属性。但是,我不熟悉您从中获取内容的网站。

如果它是生成的代码,那应该没问题,除非他们大规模地改变了一些东西。但为了避免这种意外情况,即使不使用正确的 HTML 解析器,您也可能需要自己为图像标签编写一个小型解析器——将图像标签提取到哈希的键中(使用像 /<\s*(img\s+[^>])\s>/) 然后对于哈希中的每个键(使用哈希可以避免重复),然后将引号内的所有内容读取到单独的存储中,并替换引号内的值以删除引号内的任何空格,然后将其拆分为空格上的属性(元素 0 是标记名,其余的属性是您拆分为 = 上的值的属性,得到返回您刚才存储的值(或者当它们没有值时将其视为“0E0”之类的值 - 从而保持它们真实但实际上没有价值)

但是,如果它是手写代码,您可能会遇到一些噩梦因为许多人对属性上引号的使用不一致(如果他们确实使用了它们的话)。

I won't push on the HTML parsing module (though LinkExtor can be your friend here...) as I understand the problems that can come with HTML parsers: If the HTML isn't properly valid, they often choke, where a simple regex can do the trick on anything no matter how broken as long as you're looking for the right thing.

As has been stated above by CanSpice, (.*) is greedy. The non-greedy modifier will usually do what you want. However, another option is to let it be greedy, but make sure it doesn't grab anything past the quoted src attribute of the image tag:

/<img src="http:\/\/www\.gamereplays.org\/community\/style_medals\/([^"]*)$id2\.gif"[^>]*>/

Note: I also modified it to not care if there's an alt attribute. However, I'm not familiar with the site you're grabbing things from.

If it's generated code it should be fine unless they change something on a grand scale. But to avoid that contingency, even not using a proper HTML parser, you may want to write a mini-parser just for the image tags yourself -- extract the image tags into the keys of a hash (grab them with a regex like /<\s*(img\s+[^>])\s>/) and then for each key in the hash (using a hash avoids dupes), then read everything inside quotes into separate storage and replace the quoted values to remove any whitespace inside quotes, then split it into attributes on whitespace (with element 0 being the tagname, and the rest being attributes which you split into values on the =, getting back the values you just stored a moment ago (or treat as something like '0E0' when they don't have a value--thus keeping them true but effectively valueless)

If it's handwritten code, however, you may be up against some nightmares because many people aren't consistent with their use of quotes on attributes, if they use them at all.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文