Perl 问题与 UserAgent 获取网站循环
我能够很好地抓取第一张图像,但内容似乎在其内部循环。不确定我做错了什么。
#!/usr/bin/perl
use LWP::Simple;
use LWP::UserAgent;
my $ua = LWP::UserAgent->new;
for(my $id=1;$id<55;$id++)
{
my $response = $ua->get("http://www.gamereplays.org/community/index.php?act=medals&CODE=showmedal&MDSID=" . $id );
my $content = $response->content;
for(my $id2=1;$id2<10;$id2++)
{
$content =~ /<img src="http:\/\/www\.gamereplays.org\/community\/style_medals\/(.*)$id2\.gif" alt=""\/>/;
$url = "http://www.gamereplays.org/community/style_medals/" . $1 . $id2 . ".gif";
print "--\n\r";
print "ID: ".$id."\n\r";
print "ID2: ".$id2."\n\r";
print "URL: ".$url."\n\r";
print "1: ".$1."\n\r";
print "--\n\r";
getstore($url, $1 . $id2 . ".gif");
}
}
I'm able to grab the first image fine, but then the content seems to be looping inside itself. Not sure what I'm doing wrong.
#!/usr/bin/perl
use LWP::Simple;
use LWP::UserAgent;
my $ua = LWP::UserAgent->new;
for(my $id=1;$id<55;$id++)
{
my $response = $ua->get("http://www.gamereplays.org/community/index.php?act=medals&CODE=showmedal&MDSID=" . $id );
my $content = $response->content;
for(my $id2=1;$id2<10;$id2++)
{
$content =~ /<img src="http:\/\/www\.gamereplays.org\/community\/style_medals\/(.*)$id2\.gif" alt=""\/>/;
$url = "http://www.gamereplays.org/community/style_medals/" . $1 . $id2 . ".gif";
print "--\n\r";
print "ID: ".$id."\n\r";
print "ID2: ".$id2."\n\r";
print "URL: ".$url."\n\r";
print "1: ".$1."\n\r";
print "--\n\r";
getstore($url, $1 . $id2 . ".gif");
}
}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
正如其他人所说,这实际上是 HTML::Parser 的工作。另外,你应该“使用严格;”并删除 use LWP::Simple 因为您没有使用该库。
您可以将正则表达式更改为以下内容:
但您不会得到 style_medals/comp_graphics_10.gif - 这可能是您想要的。我认为像下面这样的东西会更好。我对风格的改变表示歉意,但我无法抗拒为 PBP 进行修改。
As others have stated, this is really a job for an HTML::Parser. Also, you should 'use strict;' and remove use LWP::Simple as you're not using the library.
You could change your regex to the following:
But you won't get style_medals/comp_graphics_10.gif - which may be what you want. I think something like the following would work better. My apologies for the style changes but I can't resist modifying for PBP.
问题出在你的正则表达式上。
(.*)
是贪婪的,它将匹配style_medals/
和$id2.gif
之间的所有字符。当$id2
为 1 时,这很好,但当$id2
为 2 时,它将匹配2.gif
之前的所有内容,即包括1.gif
中的完整字符串。尝试通过添加
?
非贪婪修饰符来使(.*)
非贪婪:(.*?)
。这应该可以解决你的问题。编辑:理想情况下,您不会使用 正则表达式来解析 HTML,而不是使用诸如
HTML 之类的内容::解析器
。The problem comes in your regular expression.
(.*)
is greedy, in which it will match all characters betweenstyle_medals/
and$id2.gif
. When$id2
is 1, this is fine, but when$id2
is 2, it'll match everything up until2.gif
, which includes the full string from1.gif
.Try making
(.*)
non-greedy by adding the?
non-greedy modifier:(.*?)
. This should fix your problem.Edit: Ideally you wouldn't be using a regular expression to parse HTML, instead using something like, say,
HTML::Parser
.我不会推动 HTML 解析模块(尽管 LinkExtor 可以 成为您的朋友...),因为我了解 HTML 解析器可能带来的问题:如果 HTML 不正确有效,他们经常感到窒息,只要你在寻找正确的东西,一个简单的正则表达式就可以解决任何问题,无论多么糟糕。
正如 CanSpice 上面所说,(.*) 是贪婪的。非贪婪修饰符通常会做你想做的事。然而,另一种选择是让它贪婪,但确保它不会抓取任何超过图像标签的带引号的 src 属性的内容:
注意:我还修改了它以不关心是否有 alt 属性。但是,我不熟悉您从中获取内容的网站。
如果它是生成的代码,那应该没问题,除非他们大规模地改变了一些东西。但为了避免这种意外情况,即使不使用正确的 HTML 解析器,您也可能需要自己为图像标签编写一个小型解析器——将图像标签提取到哈希的键中(使用像 /<\s*(img\s+[^>])\s>/) 然后对于哈希中的每个键(使用哈希可以避免重复),然后将引号内的所有内容读取到单独的存储中,并替换引号内的值以删除引号内的任何空格,然后将其拆分为空格上的属性(元素 0 是标记名,其余的属性是您拆分为 = 上的值的属性,得到返回您刚才存储的值(或者当它们没有值时将其视为“0E0”之类的值 - 从而保持它们真实但实际上没有价值)
但是,如果它是手写代码,您可能会遇到一些噩梦因为许多人对属性上引号的使用不一致(如果他们确实使用了它们的话)。
I won't push on the HTML parsing module (though LinkExtor can be your friend here...) as I understand the problems that can come with HTML parsers: If the HTML isn't properly valid, they often choke, where a simple regex can do the trick on anything no matter how broken as long as you're looking for the right thing.
As has been stated above by CanSpice, (.*) is greedy. The non-greedy modifier will usually do what you want. However, another option is to let it be greedy, but make sure it doesn't grab anything past the quoted src attribute of the image tag:
Note: I also modified it to not care if there's an alt attribute. However, I'm not familiar with the site you're grabbing things from.
If it's generated code it should be fine unless they change something on a grand scale. But to avoid that contingency, even not using a proper HTML parser, you may want to write a mini-parser just for the image tags yourself -- extract the image tags into the keys of a hash (grab them with a regex like /<\s*(img\s+[^>])\s>/) and then for each key in the hash (using a hash avoids dupes), then read everything inside quotes into separate storage and replace the quoted values to remove any whitespace inside quotes, then split it into attributes on whitespace (with element 0 being the tagname, and the rest being attributes which you split into values on the =, getting back the values you just stored a moment ago (or treat as something like '0E0' when they don't have a value--thus keeping them true but effectively valueless)
If it's handwritten code, however, you may be up against some nightmares because many people aren't consistent with their use of quotes on attributes, if they use them at all.