Groovy 使用 HttpBuilder 抓取 Google 搜索 - 结果似乎无法解析为 html 或 xml

发布于 2024-12-06 02:12:09 字数 2099 浏览 0 评论 0原文

我正在编写一个简单的 Groovy 脚本来请求 Google 搜索的简单搜索,然后解析结果集。我知道有自定义搜索 API - 但这对我不起作用,所以请不要向我指出那个方向。

我正在使用 HTTPBuilder 来发出请求。我发现所有其他方法“string”.toURL()、HTMLCleaner...如果您使用它们进行调用,它们都会获得 http 403 代码。我假设这是因为请求标题对 Google 无效。

我可以让 HTTP Builder 发出并获取非 403 请求。也就是说,当我在“html”上执行 println 时(请参阅下面的代码片段),它看起来不像 html 或 xml。它看起来就像文字一样。

这是获取响应的 HTTPBuilder 片段:

    //build query
    def query = ""
    queryTerms.eachWithIndex({term , i -> (i > 0) ? (query += "+" + term) : (query        += term)})

    def http = new HTTPBuilder(baseUrl)

    http.request(Method.GET,ContentType.TEXT) { req ->
        headers.'User-Agent' = 'Mozilla/5.0' }

    def html = http.get(path : searchPath, contentType : ContentType.HTML, query : [q:query])
    // println html
    assert html instanceof groovy.util.slurpersupport.GPathResult
    assert html.HEAD.size() == 1
    assert html.BODY.size() == 1

我正在返回一些结果,因此我尝试按如下方式解析它。我将首先提供实际结构,然后提供解析。也就是说,任何解析的元素中都没有显示任何内容。

实际结构:

html->body#gsr->div#main->div->div#cnt->div#rcnt->div#center_col->div#res.med->div #search->div#ires->ol#rso->

Code:

    def mainDiv = html.body.div.findAll {[email protected]() == 'main'}
    println mainDiv
    def rcntDiv = mainDiv.div.div.div.findAll { [email protected]() == 'rcnt' }
    println rcntDiv
    def searchDiv = rcntDiv.div.findAll { [email protected] == "center_col" }.div.div.findAll { [email protected] == "search" }
    println searchDiv
    searchDiv.div.ol.li.each { println it }

那么这是不可能的吗?谷歌是否在欺骗我并向我发送垃圾数​​据,或者我是否需要进一步调整我的 HTTPBuilder?有什么想法吗?

I am writing a simple Groovy script to request simple searches from Google Search and then parse the result set. I know that there is the Custom Search API - but that won't work for me, so please don't point me in that direction.

I am using HTTPBuilder to make the request. I found that all of the other methods "string".toURL(), HTMLCleaner... all of them get a http 403 code if you make the call with them. I am assuming it is because the request heading is not valid for Google.

I can get HTTP Builder to make and get a non 403 request. That said, when I do a println on the "html" (see code snippet below), it does not look like html or xml. It looks just like text.

here is the HTTPBuilder snippet to get the response:

    //build query
    def query = ""
    queryTerms.eachWithIndex({term , i -> (i > 0) ? (query += "+" + term) : (query        += term)})

    def http = new HTTPBuilder(baseUrl)

    http.request(Method.GET,ContentType.TEXT) { req ->
        headers.'User-Agent' = 'Mozilla/5.0' }

    def html = http.get(path : searchPath, contentType : ContentType.HTML, query : [q:query])
    // println html
    assert html instanceof groovy.util.slurpersupport.GPathResult
    assert html.HEAD.size() == 1
    assert html.BODY.size() == 1

I am getting back some result so I try to parse it as per below. I will provide the actual structure first and then the parsing. That said, nothing shows up in any of the parsed elements.

Actual Structure:

html->body#gsr->div#main->div->div#cnt->div#rcnt->div#center_col->div#res.med->div#search->div#ires->ol#rso->

Code:

    def mainDiv = html.body.div.findAll {[email protected]() == 'main'}
    println mainDiv
    def rcntDiv = mainDiv.div.div.div.findAll { [email protected]() == 'rcnt' }
    println rcntDiv
    def searchDiv = rcntDiv.div.findAll { [email protected] == "center_col" }.div.div.findAll { [email protected] == "search" }
    println searchDiv
    searchDiv.div.ol.li.each { println it }

So is this just not possible? Is google spoofing me and sending me garbage data or do I need to tune my HTTPBuilder some more? Any ideas?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

多彩岁月 2024-12-13 02:12:09

您没有提到您使用的搜索网址,所以我无法解释为什么您会收到 403。以下代码使用标准 Google 站点进行搜索,并且对我来说没有任何禁止或其他状态错误:

@Grab(group='org.codehaus.groovy.modules.http-builder', module='http-builder', version='0.5.1' )

import static groovyx.net.http.Method.GET
import static groovyx.net.http.ContentType.*

def http = new groovyx.net.http.HTTPBuilder('http://www.google.com')

def queryTerms =['queen','of','hearts']

http.request(GET,HTML) { req ->
    uri.path = '/search'
    uri.query= [q: queryTerms.join('+'), hl: 'en']

    headers.'User-Agent' = 'Mozilla/5.0'

  response.success = { resp, html ->
      println "Site title: ${html.HEAD.TITLE.text()}"
  }
  response.failure = { resp ->
    println resp.statusLine
  }
}

它输出站点标题,以表明它正在成功解析 HTML:

网站标题:queen+of+hearts - Google 搜索

You didn't mention the search URL you were using, so I can't speak to why you were getting 403s. The following code does a search with the standard Google site, and works for me without any Forbidden or other status errors:

@Grab(group='org.codehaus.groovy.modules.http-builder', module='http-builder', version='0.5.1' )

import static groovyx.net.http.Method.GET
import static groovyx.net.http.ContentType.*

def http = new groovyx.net.http.HTTPBuilder('http://www.google.com')

def queryTerms =['queen','of','hearts']

http.request(GET,HTML) { req ->
    uri.path = '/search'
    uri.query= [q: queryTerms.join('+'), hl: 'en']

    headers.'User-Agent' = 'Mozilla/5.0'

  response.success = { resp, html ->
      println "Site title: ${html.HEAD.TITLE.text()}"
  }
  response.failure = { resp ->
    println resp.statusLine
  }
}

It outputs site title, to show that it is successfully parsing HTML:

Site title: queen+of+hearts - Google Search

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文