Preg_match(_all) 无法从 Google 收集一些数据

发布于 2024-10-31 12:06:38 字数 457 浏览 1 评论 0原文

我正在为我的网站创建一个工具,以查看它们在 Google 中不同关键字的位置。

现在,我想收集他们的源代码的这一部分:

<a href="http://www.test.com/" class=l onmousedown="return clk(this.href,'','','','1','','0CBoQFjAA')">Linktitle in Google!</a>

问题是 preg_match 或 preg_match_all 函数与“onmousedown”或“this.href”或链接的 ,'1' 部分不匹配。这正是我需要的部分......

有谁知道这是为什么,更重要的是......如何解决这个问题???

我使用的代码很明显..我什至尝试使用“/onmousedown/”或“/\'1\'/”,但没有帮助。

非常感谢!!!!

I'm creating a tool for my websites to see what position they are in Google on different keywords.

Now, I want to collect this part of their sourcecode:

<a href="http://www.test.com/" class=l onmousedown="return clk(this.href,'','','','1','','0CBoQFjAA')">Linktitle in Google!</a>

The problem is that the preg_match OR preg_match_all function doesn't match "onmousedown" or "this.href" or the ,'1' part of the link. And that is exactly the part i need...

Does anyone has an idea why this is, and more important.. how to solve this???

The code I use is obvious.. i even tried to use "/onmousedown/" or "/\'1\'/" but it didn't help.

Thank you very much!!!!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

活泼老夫 2024-11-07 12:06:38

除了抓取 Google 的道德和可能的法律影响之外,您不应该使用正则表达式来提取 HTML 部分。正则表达式不是为解析 HTML 而设计的,也不具备特定的语法。

尝试使用 HTML 解析器,例如 DOMDocument。它被设计用来解析 HTML/XML。

Besides the ethical and possible legal implications of scraping Google, you should not be using regular expressions to extract portions of HTML. Regular expressions were not designed to parse HTML and are not equipped for the specific grammar.

Try using a HTML parser, such as DOMDocument. It was designed to parse HTML/XML.

盗梦空间 2024-11-07 12:06:38

使用此代码解析Google搜索结果的锚标记

function parseAnchor($strAnchor)
{
//$strAnchor = "<a onmousedown=\"return clk(this.href,'','','','2','','0CBwQFjAB')\" class=\"l\" href=\"http://php.net/manual/en/function.strpos.php\"><em>PHP</em>: strpos - Manual</a>";

$str_parts = explode(" ",$strAnchor);
$start_index = stripos($str_parts[4],"\"");
$length = strrpos($str_parts[4],"\"") - $start_index;
$link = substr($str_parts[4],$start_index+1,$length-1); //will print the link
print $link;

//Now get postion
$onmousedown_parts = explode(",",$str_parts[2]);

$position = trim($onmousedown_parts[4],"\'");
print "<br>$position"; //will print position
}

尝试此解析HTML页面

http://simplehtmldom.sourceforge.net/

Use this code to parse the anchor tag of google search result

function parseAnchor($strAnchor)
{
//$strAnchor = "<a onmousedown=\"return clk(this.href,'','','','2','','0CBwQFjAB')\" class=\"l\" href=\"http://php.net/manual/en/function.strpos.php\"><em>PHP</em>: strpos - Manual</a>";

$str_parts = explode(" ",$strAnchor);
$start_index = stripos($str_parts[4],"\"");
$length = strrpos($str_parts[4],"\"") - $start_index;
$link = substr($str_parts[4],$start_index+1,$length-1); //will print the link
print $link;

//Now get postion
$onmousedown_parts = explode(",",$str_parts[2]);

$position = trim($onmousedown_parts[4],"\'");
print "<br>$position"; //will print position
}

Try this to parse HTML page

http://simplehtmldom.sourceforge.net/
虫児飞 2024-11-07 12:06:38

使用PHP 类:Google 关键字位置来表示

http://www.phpclasses.org/package/5554-PHP-Determine-the-position-of-a-keyword-in-Google.html

< strong>EX:

文件:google_position.php

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en">
<head>
    <title>Google Keyword Position</title>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
</head>
<body >
  <form name="url_kw" action="search.php" method="get">
      <label for="url">URL:</label>
      <input type="text" name="url" id="url" size="55" value="<?= isset($_GET['url']) ? $_GET['url'] : 'http://' ?>" />
      <br />
      <label for="keyword">Keyword:</label>
      <input type="text" name="keyword" id="keyword" size="35" value="<?= isset($_GET['keyword']) ? $_GET['keyword'] : null ?>" />
      <br />
      <input type="submit" name="submit_button" value="SEARCH" onclick="this.value='Searching...';" />
      <input type="button" value="CANCEL" onclick="javascript: window.location='<?= $_SERVER['HTTP_REFERER'] ?>';" />
      <br />
  </form>
</body>

文件 search.php

<?
include('KeywordPosition.php');
$position=new KeywordPosition($_GET['url'],$_GET['keyword'],10); // you can change the 10 to 100 to get more results :)
$index=$position->GetPosition();
if($index==-1)
echo 'Not in search results';
else
echo 'You are at '.$index;
?> 

实例 @ http://x.co/Z493

Use the PHP Class: Google Keyword Position for that

http://www.phpclasses.org/package/5554-PHP-Determine-the-position-of-a-keyword-in-Google.html

EX:

File: google_position.php

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en">
<head>
    <title>Google Keyword Position</title>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
</head>
<body >
  <form name="url_kw" action="search.php" method="get">
      <label for="url">URL:</label>
      <input type="text" name="url" id="url" size="55" value="<?= isset($_GET['url']) ? $_GET['url'] : 'http://' ?>" />
      <br />
      <label for="keyword">Keyword:</label>
      <input type="text" name="keyword" id="keyword" size="35" value="<?= isset($_GET['keyword']) ? $_GET['keyword'] : null ?>" />
      <br />
      <input type="submit" name="submit_button" value="SEARCH" onclick="this.value='Searching...';" />
      <input type="button" value="CANCEL" onclick="javascript: window.location='<?= $_SERVER['HTTP_REFERER'] ?>';" />
      <br />
  </form>
</body>

file search.php

<?
include('KeywordPosition.php');
$position=new KeywordPosition($_GET['url'],$_GET['keyword'],10); // you can change the 10 to 100 to get more results :)
$index=$position->GetPosition();
if($index==-1)
echo 'Not in search results';
else
echo 'You are at '.$index;
?> 

Live example @ http://x.co/Z493

骄傲 2024-11-07 12:06:38

根据 Google 的规定,您不得抓取他们的网站。

他们的 robots.txt 位于:http://www.google.com/robots.txt

也就是说,对于一家其整个商业模式就是抓取其他人网站的公司来说,这有点虚伪。

考虑一下你自己受到了警告。

正则表达式很简单:

<a [^<]*class=l.*?</a>

现在,对于那些声称无法用正则表达式解析 HTML 的人来说...是的,你是对的,你不能在正则表达式中解析 html。但我们不要在这里开玩笑。

在正则表达式中,从具有已知格式的 HTML 页面中提取特定的文本块绝对是可能的(而且很容易)。这就是正则表达式的用途。

这不是“解析 HTML”,在这种情况下,格式已知并且应用程序不重要,正则表达式就可以了。


我刚刚查了一下,Google 有一个 API,允许您在自定义搜索引擎上免费进行最多 100 个查询。
http://www.google.com/cse/
https://code.google.com/apis/console /?api=customsearch&pli=1#welcome

它需要一个 Google 帐户和一个 API 密钥,您可以从上面的链接获取该密钥。

警告,阅读法律术语将比编写抓取工具困难得多

According to Google, you are not allowed to scrape their website.

Their robot.txt is here: http://www.google.com/robots.txt.

That said, it's a bit hypocritical coming from a company whose whole business model is to scrape other people's websites.

Consider yourself warned.

The regex is simple:

<a [^<]*class=l.*?</a>

Now, for the folks who claim that HTML can not be parsed with regex... yes, you are right, you can not parse html in regex. But let's not be ridiculous here.

Extracting a specific block of text from an HTML page with a known format is definitely possible (and easy) to do in regex. That's what regex is for.

This is not "parsing HTML", and in a case such as this one, where the format is known and the application non-critical, regex does just fine.


I just checked and there is an API from Google which allows you to make up to 100 queries for free on a custom search engine.
http://www.google.com/cse/
https://code.google.com/apis/console/?api=customsearch&pli=1#welcome

It requires a Google account and an API key, which you can get at the links above.

Warning, wadding through the legalese is going to be considerably harder than writing your scraper

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文