HTTPBuilder - 如何获取网页的 HTML 内容？

发布于 2024-11-26 08:00:52 字数 684 浏览 5 评论 0原文

我需要提取网页的 HTML 我在 groovy 中使用 HTTPuilder，进行以下获取：

def http = new HTTPBuilder('http://www.google.com/search')
http.request(Method.GET) {
 requestContentType = ContentType.HTML
 response.success = { resp, reader ->
  println "resp: " + resp
  println "READER: " + reader
 }
 response.failure = { resp, reader ->
  println "Failure"
 }
}

我得到的响应不包含与我探索 www.google.com/search 的 html 源代码时看到的相同的 html。事实上，它既不是 html，也不包含我在页面 html 源代码中看到的相同信息。我尝试设置不同的标头（例如， headers.Accept = 'text/html,application/xhtml+xml,application/xml;q=0.9,/;q=0.8', headers. Accept = 'text/html'，设置用户代理等），但结果是相同的。如何使用 http 生成器获取 www.google.com/search（或任何网页）的 html？

原文

I need to extract the HTML of a web page
I'm using HTTPuilder in groovy, making the following get:

def http = new HTTPBuilder('http://www.google.com/search')
http.request(Method.GET) {
 requestContentType = ContentType.HTML
 response.success = { resp, reader ->
  println "resp: " + resp
  println "READER: " + reader
 }
 response.failure = { resp, reader ->
  println "Failure"
 }
}

The response I get, does not contain the same html I can see when I explore the html source of www.google.com/search. In fact, it's neither an html, and does not contains the same info I can see in the html source of the page.
I've tried setting differents headers (for example, headers.Accept = 'text/html,application/xhtml+xml,application/xml;q=0.9,/;q=0.8', headers.Accept = 'text/html', seting the user-agent, etc), but the result is the same.
How can I get the html of www.google.com/search (or any web page) using http builder?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

南汐寒笙箫 2024-12-03 08:00:52

为什么使用httpBuilder？您可以改为使用

def url = "http://www.google.com/".toURL() 

println url.text`

提取网页内容

Why use httpBuilder? You might instead use

def url = "http://www.google.com/".toURL() 

println url.text`

to extract the content of the webpage

回复收藏 0 原文

[旋木] 2024-12-03 08:00:52

因为httpbuilder会根据内容类型自动解析结果。
要获取原始 html，请尝试从 Entity 获取文本

def htmlResult = http.get(uri: url, contentType: TEXT){ resp->
    return resp.getEntity().getContent().getText()
}

Because the httpbuilder will auto parse the result by the content type.
to get the raw html, try to get text from Entity

def htmlResult = http.get(uri: url, contentType: TEXT){ resp->
    return resp.getEntity().getContent().getText()
}

回复收藏 0 原文

~没有更多了~

关于作者

一口甜

暂无简介

文章

26 人气

关注发私信

友情链接

文江博客

HTTPBuilder - 如何获取网页的 HTML 内容？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

燃烧我的卡路李先生

qq_2gSKZM

∞梦里开花

qq_IklFPL

迷途知返

深海不蓝

友情链接

HTTPBuilder - 如何获取网页的 HTML 内容？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

燃烧我的卡路李先生

qq_2gSKZM

∞梦里开花

qq_IklFPL

迷途知返

深海不蓝

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。