从“Yahoo!”运行 Perl hack Google 目录 Mindshare”

发布于 2025-01-03 14:20:30 字数 2033 浏览 1 评论 0原文

有没有人运行 http://oreilly.com/pub/h/974#code< 上给出的 perl 脚本/a>?

这是一个著名的,用于从 Yahoo! 获取 URL。目录并且许多人已经成功使用它。

我试图获取网址。我创建了自己的 Google API 密钥并在代码中替换了它。 除此之外我没有做任何改变。

脚本既不会产生任何错误,也不会产生任何 URL。

#!/usr/bin/perl -w

use strict;
use LWP::Simple;
use HTML::LinkExtor;
use SOAP::Lite;

my $google_key  = "your API key goes here";
my $google_wdsl = "GoogleSearch.wsdl";
my $yahoo_dir   = shift || "/Computers_and_Internet/Data_Formats/XML_  _".
              "eXtensible_Markup_Language_/RSS/News_Aggregators/";

# download the Yahoo! directory.
my $data = get("http://dir.yahoo.com" . $yahoo_dir) or die $!;

# create our Google object.
my $google_search = SOAP::Lite->service("file:$google_wdsl");
my %urls; # where we keep our counts and titles.

# extract all the links and parse 'em.
HTML::LinkExtor->new(\&mindshare)->parse($data);

sub mindshare { # for each link we find...

  my ($tag, %attr) = @_;

  print "$tag\n";   

  # continue on only if the tag was a link,

  # and the URL matches Yahoo!'s redirectory.

  return if $tag ne 'a';   

  return unless $attr{href} =~ /srd.yahoo/;

  return unless $attr{href} =~ /\*http/;



  # now get our real URL.

  $attr{href} =~ /\*(http.*)/; my $url = $1;

  print "hi";

  # and process each URL through Google.

  my $results = $google_search->doGoogleSearch(

                      $google_key,"link:$url", 0, 1,

                      "true", "", "false", "", "", ""

                ); # wheee, that was easy, guvner.

  $urls{$url} = $results->{estimatedTotalResultsCount};

  print "1\n";

} 

# now sort and display.

my @sorted_urls = sort { $urls{$b} <=> $urls{$a} } keys %urls;

foreach my $url (@sorted_urls) { print "$urls{$url}: $url\n"; }

程序进入循环,并在第一次迭代时出现“my @sorted_urls = sort { $urls{$b} <=> $urls{$a} } keys %urls;”。

我对 perl 没有任何了解,但这个任务应该是微不足道的。

当然,我错过了一些非常明显的东西,因为这个脚本已被许多人成功使用。

提前致谢。

Has anyone run perl script given at http://oreilly.com/pub/h/974#code ?

This is a famous one, used to get URLs from Yahoo! directory and many people have successfully used it.

I was trying to get URLs. I created my own Google API key and replaced that in the code.
Apart from that I did not make any change.

Script is neither producing any error nor any URL.

#!/usr/bin/perl -w

use strict;
use LWP::Simple;
use HTML::LinkExtor;
use SOAP::Lite;

my $google_key  = "your API key goes here";
my $google_wdsl = "GoogleSearch.wsdl";
my $yahoo_dir   = shift || "/Computers_and_Internet/Data_Formats/XML_  _".
              "eXtensible_Markup_Language_/RSS/News_Aggregators/";

# download the Yahoo! directory.
my $data = get("http://dir.yahoo.com" . $yahoo_dir) or die $!;

# create our Google object.
my $google_search = SOAP::Lite->service("file:$google_wdsl");
my %urls; # where we keep our counts and titles.

# extract all the links and parse 'em.
HTML::LinkExtor->new(\&mindshare)->parse($data);

sub mindshare { # for each link we find...

  my ($tag, %attr) = @_;

  print "$tag\n";   

  # continue on only if the tag was a link,

  # and the URL matches Yahoo!'s redirectory.

  return if $tag ne 'a';   

  return unless $attr{href} =~ /srd.yahoo/;

  return unless $attr{href} =~ /\*http/;



  # now get our real URL.

  $attr{href} =~ /\*(http.*)/; my $url = $1;

  print "hi";

  # and process each URL through Google.

  my $results = $google_search->doGoogleSearch(

                      $google_key,"link:$url", 0, 1,

                      "true", "", "false", "", "", ""

                ); # wheee, that was easy, guvner.

  $urls{$url} = $results->{estimatedTotalResultsCount};

  print "1\n";

} 

# now sort and display.

my @sorted_urls = sort { $urls{$b} <=> $urls{$a} } keys %urls;

foreach my $url (@sorted_urls) { print "$urls{$url}: $url\n"; }

Program goes into the loop, and comes out at first iteration to "my @sorted_urls = sort { $urls{$b} <=> $urls{$a} } keys %urls;".

I don't have any understanding about perl but this task should have been trivial.

Surely,I am missing something very obvious, because this script has been successfully used by many.

Thanks in advance.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

双手揣兜 2025-01-10 14:20:30

您是否向脚本提供目录?因为如果您不是,并且脚本中的这一行

"/Computers_and_Internet/Data_Formats/XML_  _".
              "eXtensible_Markup_Language_/RSS/News_Aggregators/"

不是格式化工件,那么您正在尝试抓取不存在的页面。

Are you supplying a directory to the script? Because if you are not, and this line in your script

"/Computers_and_Internet/Data_Formats/XML_  _".
              "eXtensible_Markup_Language_/RSS/News_Aggregators/"

is not a formatting artefact, then you're trying to scrape a non-existent page.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文