JSoup UserAgent，如何设置呢？

发布于 2024-11-18 15:41:17 字数 584 浏览 2 评论 0原文

我试图用 JSoup 解析 facebook 的首页，但我总是获得移动设备的 HTML 代码，而不是普通浏览器的版本（在我的例子中是 Firefox 5.0）。

我将我的用户代理设置为这样：

doc = Jsoup.connect(url)
      .userAgent("Mozilla/5.0 (Windows NT 6.1; WOW64; rv:5.0) Gecko/20100101 Firefox/5.0")
      .get();

我做错了什么吗？

编辑：

我刚刚解析了 http://whatsmyuseragent.com/ ，看起来用户代理正在工作。现在让我更困惑的是为什么网站 http://www.facebook.com/ 在使用 JSoup 和我的网站时返回不同的版本浏览器。两者都使用相同的用户代理......

我现在也在其他一些网站上注意到了这种行为。如果您能向我解释问题是什么，我会非常高兴。

原文

I'm trying to parse the frontpage of facebook with JSoup but I always get the HTML Code for mobile devices and not the version for normal browsers(In my case Firefox 5.0).

I'm setting my User Agent like this:

doc = Jsoup.connect(url)
      .userAgent("Mozilla/5.0 (Windows NT 6.1; WOW64; rv:5.0) Gecko/20100101 Firefox/5.0")
      .get();

Am I doing something wrong?

EDIT:

I just parsed http://whatsmyuseragent.com/ and it looks like the user Agent is working. Now its even more confusing for me why the site http://www.facebook.com/ returns a different version when using JSoup and my browser. Both are using the same useragent....

I noticed this behaviour on some other sites too now. If you could explain to me what the Issue is I would be more than happy.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

情栀口红 2024-11-25 15:41:17

您也可以尝试设置引用标头：

doc = Jsoup.connect("https://www.facebook.com/")
      .userAgent("Mozilla/5.0 (Windows; U; WindowsNT 5.1; en-US; rv1.8.1.6) Gecko/20070725 Firefox/2.0.0.6")
      .referrer("http://www.google.com")
      .get();

You might try setting the referrer header as well:

doc = Jsoup.connect("https://www.facebook.com/")
      .userAgent("Mozilla/5.0 (Windows; U; WindowsNT 5.1; en-US; rv1.8.1.6) Gecko/20070725 Firefox/2.0.0.6")
      .referrer("http://www.google.com")
      .get();

回复收藏 0 原文

℉服软 2024-11-25 15:41:17

Response response= Jsoup.connect(location)
           .ignoreContentType(true)
           .userAgent("Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:25.0) Gecko/20100101 Firefox/25.0")  
           .referrer("http://www.google.com")   
           .timeout(12000) 
           .followRedirects(true)
           .execute();

Document doc = response.parse();

用户代理

使用最新的用户代理。这是完整列表 http://www.useragentstring.com/pages/useragentstring.php。

超时

另外，不要忘记添加超时，因为有时下载页面所需的时间会超过正常超时时间。

Referer

将 Referer 设置为 google。

遵循重定向

遵循重定向到达该页面。

execute() 而不是 get()

使用execute() 获取响应对象。这可以帮助您检查内容
发生错误时的类型和状态代码。

稍后您可以解析响应对象来获取文档。

托管完整示例在github上

Response response= Jsoup.connect(location)
           .ignoreContentType(true)
           .userAgent("Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:25.0) Gecko/20100101 Firefox/25.0")  
           .referrer("http://www.google.com")   
           .timeout(12000) 
           .followRedirects(true)
           .execute();

Document doc = response.parse();

User Agent

Use the latest User agent. Here's the complete list http://www.useragentstring.com/pages/useragentstring.php.

Timeout

Also don't forget to add timout, since sometimes it takes more than normal timeout to download the page.

Referer

Set the referer as google.

Follow redirects

follow redirects to get to the page.

execute() instead of get()

Use execute() to get the Response object. Which can help you to check for content
type and status codes incase of error.

Later you can parse the response object to obtain the document.

Hosted the full example on github

回复收藏 0 原文

静若繁花 2024-11-25 15:41:17

Facebook 很可能在其请求中设置（然后期望）某些 cookie，并认为缺少任何 cookie 的标头是机器人/移动用户/受限浏览器/其他东西。

关于使用 JSoup 处理 cookie 有几个问题，但您可能会发现它使用起来更简单HttpUrlConnection 或 Apache 的 HttpClient，然后将结果传递给 JSoup。关于您需要了解的所有内容的精彩文章：使用 java.net.URLConnection 来触发和处理 HTTP 请求

调试浏览器和 JSoup 之间差异的一种有用方法是 Chrome 的网络检查器。您可以每次将浏览器中的标头添加到 JSoup 中，直到获得所需的行为，然后精确缩小所需标头的范围。

回复收藏 0 原文