You agree not to access (or attempt to access) any of the Services by any means other than through the interface that is provided by Google, unless you have been specifically allowed to do so in a separate agreement with Google. You specifically agree not to access (or attempt to access) any of the Services through any automated means (including use of scripts or web crawlers) and shall ensure that you comply with the instructions set out in any robots.txt file present on the Services.
That being said, there exists an official API to query web search programmatically.
The JSON/Atom Custom Search API lets you develop websites and programs to retrieve and display search results from your Google Custom Search programmatically. With this API, you can use RESTful requests to get search results in either JSON or Atom format.
If you need to search over a wider variety of hosts, you'll need to use the older, deprecated Websearch API, but that will limit the number of queries you can make per day.
Barring that, you'll need to do a lot of html scraping and parsing.
use WWW::Mechanize;
use 5.10.0;
use strict;
use warnings;
my $mech = new WWW::Mechanize;
my $option = shift;
#you may customize your google search by editing this url (always end it with "q=" though)
my $google = 'http://www.google.co.uk/search?q=';
my @dork = ("this is my search one","this is my search two");
#declare necessary variables
my $max = 0;
my $link;
my $sc = scalar(@dork);
#start the main loop, one itineration for every google search
for my $i ( 0 .. $sc ) {
#loop until the maximum number of results chosen isn't reached
while ( $max <= $option ) {
#say $google . $dork[$i] . "&start=" . $max;
$mech->get( $google . $dork[$i] . "&start=" . $max );
#get all the google results
foreach $link ( $mech->links() ) {
my $google_url = $link->url;
if ( $google_url !~ /^\// && $google_url !~ /google/ ) {
say $google_url;
}
}
$max += 10;
}
}
顺便说一句,我不久前写了这个,所以它不完全符合标准,但它确实完成了工作,而且我懒得启动 linux 来找到这个的更新版本......
Here is how a simple script could look like (and yes it violates TOS so it's just PoC, and you shouldn't use it...)
use WWW::Mechanize;
use 5.10.0;
use strict;
use warnings;
my $mech = new WWW::Mechanize;
my $option = shift;
#you may customize your google search by editing this url (always end it with "q=" though)
my $google = 'http://www.google.co.uk/search?q=';
my @dork = ("this is my search one","this is my search two");
#declare necessary variables
my $max = 0;
my $link;
my $sc = scalar(@dork);
#start the main loop, one itineration for every google search
for my $i ( 0 .. $sc ) {
#loop until the maximum number of results chosen isn't reached
while ( $max <= $option ) {
#say $google . $dork[$i] . "&start=" . $max;
$mech->get( $google . $dork[$i] . "&start=" . $max );
#get all the google results
foreach $link ( $mech->links() ) {
my $google_url = $link->url;
if ( $google_url !~ /^\// && $google_url !~ /google/ ) {
say $google_url;
}
}
$max += 10;
}
}
By the way I wrote this a while back, so it's not exactly up to the par, but it does the job, and I am too lazy to boot linux to find the newer version of this...
发布评论
评论(3)
在继续操作之前,请先了解 Google 服务条款。
话虽这么说,有一个官方 API 可以以编程方式查询网络搜索。
您可以使用 XML::Atom::Client 或 LWP+JSON::Any 或许多其他库来执行 REST< /a> 调用。
(您可能仍会找到对旧版 Google Web Search API 的引用,但它已被弃用并受到限制。)
Before proceeding, please be aware of the Google Terms of Service.
That being said, there exists an official API to query web search programmatically.
You can use XML::Atom::Client or LWP+JSON::Any or many other libraries to perform the REST calls.
(You may still find references to the older Google Web Search API but it's deprecated and limited.)
看一下 Google 自定义搜索 API:
http://code.google.com/apis/customsearch/
如果您需要搜索对于更广泛的主机,您需要使用较旧的、已弃用的 Websearch API,但这会限制您每天可以进行的查询数量。
除此之外,您将需要进行大量的 html 抓取和解析。
Take a look at the Google Custom search API:
http://code.google.com/apis/customsearch/
If you need to search over a wider variety of hosts, you'll need to use the older, deprecated Websearch API, but that will limit the number of queries you can make per day.
Barring that, you'll need to do a lot of html scraping and parsing.
这是一个简单的脚本的样子(是的,它违反了 TOS,所以它只是 PoC,你不应该使用它......)
顺便说一句,我不久前写了这个,所以它不完全符合标准,但它确实完成了工作,而且我懒得启动 linux 来找到这个的更新版本......
Here is how a simple script could look like (and yes it violates TOS so it's just PoC, and you shouldn't use it...)
By the way I wrote this a while back, so it's not exactly up to the par, but it does the job, and I am too lazy to boot linux to find the newer version of this...