从“Yahoo!”运行 Perl hack Google 目录 Mindshare”
有没有人运行 http://oreilly.com/pub/h/974#code< 上给出的 perl 脚本/a>?
这是一个著名的,用于从 Yahoo! 获取 URL。目录并且许多人已经成功使用它。
我试图获取网址。我创建了自己的 Google API 密钥并在代码中替换了它。 除此之外我没有做任何改变。
脚本既不会产生任何错误,也不会产生任何 URL。
#!/usr/bin/perl -w
use strict;
use LWP::Simple;
use HTML::LinkExtor;
use SOAP::Lite;
my $google_key = "your API key goes here";
my $google_wdsl = "GoogleSearch.wsdl";
my $yahoo_dir = shift || "/Computers_and_Internet/Data_Formats/XML_ _".
"eXtensible_Markup_Language_/RSS/News_Aggregators/";
# download the Yahoo! directory.
my $data = get("http://dir.yahoo.com" . $yahoo_dir) or die $!;
# create our Google object.
my $google_search = SOAP::Lite->service("file:$google_wdsl");
my %urls; # where we keep our counts and titles.
# extract all the links and parse 'em.
HTML::LinkExtor->new(\&mindshare)->parse($data);
sub mindshare { # for each link we find...
my ($tag, %attr) = @_;
print "$tag\n";
# continue on only if the tag was a link,
# and the URL matches Yahoo!'s redirectory.
return if $tag ne 'a';
return unless $attr{href} =~ /srd.yahoo/;
return unless $attr{href} =~ /\*http/;
# now get our real URL.
$attr{href} =~ /\*(http.*)/; my $url = $1;
print "hi";
# and process each URL through Google.
my $results = $google_search->doGoogleSearch(
$google_key,"link:$url", 0, 1,
"true", "", "false", "", "", ""
); # wheee, that was easy, guvner.
$urls{$url} = $results->{estimatedTotalResultsCount};
print "1\n";
}
# now sort and display.
my @sorted_urls = sort { $urls{$b} <=> $urls{$a} } keys %urls;
foreach my $url (@sorted_urls) { print "$urls{$url}: $url\n"; }
程序进入循环,并在第一次迭代时出现“my @sorted_urls = sort { $urls{$b} <=> $urls{$a} } keys %urls;”。
我对 perl 没有任何了解,但这个任务应该是微不足道的。
当然,我错过了一些非常明显的东西,因为这个脚本已被许多人成功使用。
提前致谢。
Has anyone run perl script given at http://oreilly.com/pub/h/974#code ?
This is a famous one, used to get URLs from Yahoo! directory and many people have successfully used it.
I was trying to get URLs. I created my own Google API key and replaced that in the code.
Apart from that I did not make any change.
Script is neither producing any error nor any URL.
#!/usr/bin/perl -w
use strict;
use LWP::Simple;
use HTML::LinkExtor;
use SOAP::Lite;
my $google_key = "your API key goes here";
my $google_wdsl = "GoogleSearch.wsdl";
my $yahoo_dir = shift || "/Computers_and_Internet/Data_Formats/XML_ _".
"eXtensible_Markup_Language_/RSS/News_Aggregators/";
# download the Yahoo! directory.
my $data = get("http://dir.yahoo.com" . $yahoo_dir) or die $!;
# create our Google object.
my $google_search = SOAP::Lite->service("file:$google_wdsl");
my %urls; # where we keep our counts and titles.
# extract all the links and parse 'em.
HTML::LinkExtor->new(\&mindshare)->parse($data);
sub mindshare { # for each link we find...
my ($tag, %attr) = @_;
print "$tag\n";
# continue on only if the tag was a link,
# and the URL matches Yahoo!'s redirectory.
return if $tag ne 'a';
return unless $attr{href} =~ /srd.yahoo/;
return unless $attr{href} =~ /\*http/;
# now get our real URL.
$attr{href} =~ /\*(http.*)/; my $url = $1;
print "hi";
# and process each URL through Google.
my $results = $google_search->doGoogleSearch(
$google_key,"link:$url", 0, 1,
"true", "", "false", "", "", ""
); # wheee, that was easy, guvner.
$urls{$url} = $results->{estimatedTotalResultsCount};
print "1\n";
}
# now sort and display.
my @sorted_urls = sort { $urls{$b} <=> $urls{$a} } keys %urls;
foreach my $url (@sorted_urls) { print "$urls{$url}: $url\n"; }
Program goes into the loop, and comes out at first iteration to "my @sorted_urls = sort { $urls{$b} <=> $urls{$a} } keys %urls;".
I don't have any understanding about perl but this task should have been trivial.
Surely,I am missing something very obvious, because this script has been successfully used by many.
Thanks in advance.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您是否向脚本提供目录?因为如果您不是,并且脚本中的这一行
不是格式化工件,那么您正在尝试抓取不存在的页面。
Are you supplying a directory to the script? Because if you are not, and this line in your script
is not a formatting artefact, then you're trying to scrape a non-existent page.