我可以使用什么正则表达式从 Google 搜索中提取 URL?

发布于 2024-08-18 21:24:54 字数 841 浏览 5 评论 0原文

我将 Delphi 与 JCLRegEx 结合使用,并希望从 google 搜索中捕获所有结果 URL。我查看了 HackingSearch.com,他们有一个看起来正确的示例正则表达式,但当我尝试时我无法得到任何结果。

我使用它类似于:

Var re:JVCLRegEx;
    I:Integer; 
Begin
  re := TJclRegEx.Create;

  With re do try
    Compile('class="?r"?>.+?href="(.+?)".*?>(.+?)<\/a>.+?class="?s"?>(.+?)<cite>.+?class="?gl"?><a href="(.+?)"><\/div><[li|\/ol]',false,false);  
    If match(memo1.lines.text) then begin
      For I := 0 to captureCount -1 do
        memo2.lines.add(captures[1]);
    end;
  finally free;
  end;
  freeandnil(re);
end;

正则表达式可在 hackingsearch.com

我正在使用 Delphi Jedi 版本,因为每次我安装 TPerlRegEx 时都会与两者发生冲突......

I'm using Delphi with the JCLRegEx and want to capture all the result URL's from a google search. I looked at HackingSearch.com and they have an example RegEx that looks right, but I cannot get any results when I try it.

I'm using it similar to:

Var re:JVCLRegEx;
    I:Integer; 
Begin
  re := TJclRegEx.Create;

  With re do try
    Compile('class="?r"?>.+?href="(.+?)".*?>(.+?)<\/a>.+?class="?s"?>(.+?)<cite>.+?class="?gl"?><a href="(.+?)"><\/div><[li|\/ol]',false,false);  
    If match(memo1.lines.text) then begin
      For I := 0 to captureCount -1 do
        memo2.lines.add(captures[1]);
    end;
  finally free;
  end;
  freeandnil(re);
end;

Regex is available at hackingsearch.com

I'm using the Delphi Jedi version, since everytime I install TPerlRegEx I get a conflict with the two...

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

笑,眼淚并存 2024-08-25 21:24:54

题外话:您可以尝试 Google AJAX 搜索 API:http://code.google.com/apis /ajaxsearch/文档/

Offtopic: You can try Google AJAX Search API: http://code.google.com/apis/ajaxsearch/documentation/

青春如此纠结 2024-08-25 21:24:54

以下是 Google 搜索结果中术语 python tuple 的相关部分。 (我通过在这里和那里添加新行来修改它以适应屏幕,但我在从 Google 源获取的原始字符串上测试了您的正则表达式,如 Firebug 所示)。您的正则表达式没有给出该字符串的匹配项。

<li class="g w0">
  <h3 class="r">
    <a onmousedown="return rwt(this,'','','res','2','AFQjCNG5WXSP8xy6BkJFyA2Emg8JrFW2_g','&sig2=4MpG_Ib3MrwYmIG6DbZjSg','0CBUQFjAB')" 
      class="l" href="http://www.korokithakis.net/tutorials/python">Learn <em>Python</em> in 10 minutes | Stavros's Stuff</a>
  </h3>
  <span style="display: inline-block;">
    <button class="w10">
    </button>
    <button class="w20">
    </button>
  </span>
  <span class="m"> <span dir="ltr">- 2 visits</span> <span dir="ltr">- Jan 21</span></span>
  <div class="s">
  The data structures available in <em>python</em> are lists, <em>tuples</em>
   and dictionaries. Sets are available in the sets library (but are built-in in <em>
  Python</em> 2.5 and <b>...</b><br>
  <cite>
    www.korokithakis.net/tutorials/<b>
    python</b>
     - 
  </cite>
  <span class="gl">
    <a onmousedown="return rwt(this,'','','clnk','2','AFQjCNFVaSJCprC5enuMZ9Nt7OZ8VzDkMg','&sig2=4qxw5AldSTW70S01iulYeA')" 
      href="http://74.125.153.132/search?q=cache:oeYpHokMeBAJ:www.korokithakis.net/tutorials/python+python+tuple&cd=2&hl=en&ct=clnk&client=firefox-a">
      Cached
    </a>
     - <button title="Comment" class="wci">
    </button>
    <button class="w4" title="Promote">
    </button>
    <button class="w5" title="Remove">
    </button>
  </span>
  </div>
  <div class="wce">
  </div>
  <!--n-->
  <!--m-->
</li>

FWIW,我想众多原因之一是这个结果中根本没有 。我从 Firebug 复制了完整的 html 源代码并尝试将其与您的正则表达式匹配 - 根本没有任何匹配。

Google 可能会不时改变显示结果的方式 - 在给定时间,它可能会根据您的登录状态、网络历史记录等因素而变化。您提出的特定正则表达式可能目前适合您,但从长远来看,它会变得难以维护。人们建议使用 html 解析器而不是给出正则表达式,因为他们知道解决方案不稳定。

Below is a relevant section from Google search results for the term python tuple. (I modified it to fit the screen here by adding new lines here and there, but I tested your regex on the raw string obtained from Google's source as revealed by Firebug). Your regex gave no matches for this string.

<li class="g w0">
  <h3 class="r">
    <a onmousedown="return rwt(this,'','','res','2','AFQjCNG5WXSP8xy6BkJFyA2Emg8JrFW2_g','&sig2=4MpG_Ib3MrwYmIG6DbZjSg','0CBUQFjAB')" 
      class="l" href="http://www.korokithakis.net/tutorials/python">Learn <em>Python</em> in 10 minutes | Stavros's Stuff</a>
  </h3>
  <span style="display: inline-block;">
    <button class="w10">
    </button>
    <button class="w20">
    </button>
  </span>
  <span class="m"> <span dir="ltr">- 2 visits</span> <span dir="ltr">- Jan 21</span></span>
  <div class="s">
  The data structures available in <em>python</em> are lists, <em>tuples</em>
   and dictionaries. Sets are available in the sets library (but are built-in in <em>
  Python</em> 2.5 and <b>...</b><br>
  <cite>
    www.korokithakis.net/tutorials/<b>
    python</b>
     - 
  </cite>
  <span class="gl">
    <a onmousedown="return rwt(this,'','','clnk','2','AFQjCNFVaSJCprC5enuMZ9Nt7OZ8VzDkMg','&sig2=4qxw5AldSTW70S01iulYeA')" 
      href="http://74.125.153.132/search?q=cache:oeYpHokMeBAJ:www.korokithakis.net/tutorials/python+python+tuple&cd=2&hl=en&ct=clnk&client=firefox-a">
      Cached
    </a>
     - <button title="Comment" class="wci">
    </button>
    <button class="w4" title="Promote">
    </button>
    <button class="w5" title="Remove">
    </button>
  </span>
  </div>
  <div class="wce">
  </div>
  <!--n-->
  <!--m-->
</li>

FWIW, I guess one of the many reasons is that there is no <Va> in this result at all. I copied the full html source from Firebug and tried to match it with your regex - didn't get any match at all.

Google might change the way they display the results from time to time - at a given time, it can vary depending on factors like your logged in status, web history etc. The particular regex you came up with might be working for you for now, but in the long run it will become difficult to maintain. People suggest using html parser instead of giving a regex because they know that the solution won't be stable.

囚你心 2024-08-25 21:24:54

如果您需要调试任何语言的正则表达式,您需要查看 RegExBuddy,它不是免费的,但它会一天之内就能收回成本。

If you need to debug regular expressions in any language you need to look at RegExBuddy, its not free but it will pay for itself in a day.

×纯※雪 2024-08-25 21:24:54
class=r?>.+?href="(.+?)".*?>(.+?)<\/a>.+?class="?s"?>(.+?)<cite>.+?class="?gl"?>

目前有效。

class=r?>.+?href="(.+?)".*?>(.+?)<\/a>.+?class="?s"?>(.+?)<cite>.+?class="?gl"?>

works for now.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文