JSoup 文档与页面源不同

发布于 2024-12-03 13:16:58 字数 4203 浏览 0 评论 0原文

我在尝试使用 jSoup 解析网页时遇到问题。如果我在 Chrome 中查看页面源代码,就会看到您期望的所有内容 - 整个页面。但是,当我使用 JSoup 连接到 url 时,Jsoup 文档仅包含页面的少量 html/xml。

知道这是为什么吗?

以下是差异的示例:

页面来源:

<title>Bath Rugby Official Website | First team</title> 
<base href="http://www.bathrugby.com/"></base> 
<meta name="description" content="The Official Website of Bath Rugby. The only place     for in-depth player, match and club news." /><meta property="og:title" content="Bath Rugby     Official Website | First team" /><meta property="og:url"     content="http://www.bathrugby.com/fixtures-and-results/first-team/" /> 
<meta property="og:type" content="sports_team" /> 
<!-- <meta property="og:image"     content="http://www.bathrugby.com/_assets/img/common/bath_rugby_logo.jpg" /> --> 
<meta property="og:site_name" content="Bath Rugby" /> 
<meta property="fb:admins" content="100001941260923" /> 


<link rel="alternate" type="application/rss+xml" title="Bath Rugby - Latest News - RSS feed" href="/rss/rss-all-news.rss" /> 
<link rel="alternate" type="application/rss+xml" title="Bath Rugby - Match News - RSS   feed" href="/rss/rss-match-news.rss" /> 
<link rel="alternate" type="application/rss+xml" title="Bath Rugby - Team News - RSS feed" href="/rss/rss-team-news.rss" /> 
<link rel="alternate" type="application/rss+xml" title="Bath Rugby - Club News - RSS feed" href="/rss/rss-club-news.rss" /> 
<link rel="alternate" type="application/rss+xml" title="Bath Rugby - Academy News - RSS feed" href="/rss/rss-academy-news.rss" /> 
<link rel="alternate" type="application/rss+xml" title="Bath Rugby - Ladies News - RSS feed" href="/rss/rss-ladies-news.rss" /> 
<link rel="alternate" type="application/rss+xml" title="Bath Rugby - Community News - RSS feed" href="/rss/rss-community-news.rss" /> 
<link rel="alternate" type="application/rss+xml" title="Bath Rugby - Programmes and Projects - Inclusion - RSS feed" href="/rss/rss-programmes-and-projects-inclusion.rss" /> 
<link rel="alternate" type="application/rss+xml" title="Bath Rugby - Programmes and Projects - Education - RSS feed" href="/rss/rss-programmes-and-projects-education.rss" /> 
<link rel="alternate" type="application/rss+xml" title="Bath Rugby - Programmes and Projects - Sports - RSS feed" href="/rss/rss-programmes-and-projects-sport.rss" /> 
<link rel="alternate" type="application/rss+xml" title="Bath Rugby - Corporate News - RSS feed" href="/rss/rss-corporate-news.rss" /> 
<link rel="alternate" type="application/rss+xml" title="Bath Rugby - Redevelopment News - RSS feed" href="/rss/rss-redevelopment-news.rss" /> 

JSoup 文档:

<head> 
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> 
  <title>Bath Rugby - Fixtures and results</title> 
  <meta name="Author" content="Positive -&gt; http://www.positivestudio.co.uk" /> 
  <meta name="HandheldFriendly" content="True" /> 
  <meta name="viewport" content="width=device-width; initial-scale=1.0; minimum-scale=1.0; maximum-scale=1.0; user-scalable=0;" /> 
   <meta name="google-site-verification" content="7dRcSqMI5R3Rag9iNiHeFei6ycyBTd3dmhlZvF8QdhA" /> 
   <link rel="Shortcut Icon" type="image/ico" href="/_assets/favicon.ico" /> 
  <link rel="apple-touch-icon-precomposed" href="/_assets/img/icons/apple-touch-icon.png" /> 
   <link rel="stylesheet" type="text/css" media="all" href="/_assets/css/mobile.css" /> 
   <link rel="stylesheet" type="text/css" href="/_assets/css/retina.css" media="only screen and (-webkit-min-device-pixel-ratio: 2)" /> 
  <script type="text/javascript" src="/_assets/js/jquery-1.4.2.min.js"></script> 
   <script src="/_assets/js/css_browser_selector.js" type="text/javascript"></script> 
  <script type="text/javascript" src="/_assets/js/scripts.js"></script> 
  <script type="text/javascript">

    var _gaq = _gaq || [];
    _gaq.push(['_setAccount', 'UA-24788688-3']);
    _gaq.push(['_trackPageview']);

  (function() {

I have a problem when trying to parse a webpage using jSoup. If I view the page source in chrome, there is all the content you'd expect - the full page. However, when I connect to the url with JSoup, the Jsoup Document only contains a small amount of the page's html/xml.

Any idea why this is?

Below is an example of the differences:

Page source:

<title>Bath Rugby Official Website | First team</title> 
<base href="http://www.bathrugby.com/"></base> 
<meta name="description" content="The Official Website of Bath Rugby. The only place     for in-depth player, match and club news." /><meta property="og:title" content="Bath Rugby     Official Website | First team" /><meta property="og:url"     content="http://www.bathrugby.com/fixtures-and-results/first-team/" /> 
<meta property="og:type" content="sports_team" /> 
<!-- <meta property="og:image"     content="http://www.bathrugby.com/_assets/img/common/bath_rugby_logo.jpg" /> --> 
<meta property="og:site_name" content="Bath Rugby" /> 
<meta property="fb:admins" content="100001941260923" /> 


<link rel="alternate" type="application/rss+xml" title="Bath Rugby - Latest News - RSS feed" href="/rss/rss-all-news.rss" /> 
<link rel="alternate" type="application/rss+xml" title="Bath Rugby - Match News - RSS   feed" href="/rss/rss-match-news.rss" /> 
<link rel="alternate" type="application/rss+xml" title="Bath Rugby - Team News - RSS feed" href="/rss/rss-team-news.rss" /> 
<link rel="alternate" type="application/rss+xml" title="Bath Rugby - Club News - RSS feed" href="/rss/rss-club-news.rss" /> 
<link rel="alternate" type="application/rss+xml" title="Bath Rugby - Academy News - RSS feed" href="/rss/rss-academy-news.rss" /> 
<link rel="alternate" type="application/rss+xml" title="Bath Rugby - Ladies News - RSS feed" href="/rss/rss-ladies-news.rss" /> 
<link rel="alternate" type="application/rss+xml" title="Bath Rugby - Community News - RSS feed" href="/rss/rss-community-news.rss" /> 
<link rel="alternate" type="application/rss+xml" title="Bath Rugby - Programmes and Projects - Inclusion - RSS feed" href="/rss/rss-programmes-and-projects-inclusion.rss" /> 
<link rel="alternate" type="application/rss+xml" title="Bath Rugby - Programmes and Projects - Education - RSS feed" href="/rss/rss-programmes-and-projects-education.rss" /> 
<link rel="alternate" type="application/rss+xml" title="Bath Rugby - Programmes and Projects - Sports - RSS feed" href="/rss/rss-programmes-and-projects-sport.rss" /> 
<link rel="alternate" type="application/rss+xml" title="Bath Rugby - Corporate News - RSS feed" href="/rss/rss-corporate-news.rss" /> 
<link rel="alternate" type="application/rss+xml" title="Bath Rugby - Redevelopment News - RSS feed" href="/rss/rss-redevelopment-news.rss" /> 

JSoup Document:

<head> 
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> 
  <title>Bath Rugby - Fixtures and results</title> 
  <meta name="Author" content="Positive -> http://www.positivestudio.co.uk" /> 
  <meta name="HandheldFriendly" content="True" /> 
  <meta name="viewport" content="width=device-width; initial-scale=1.0; minimum-scale=1.0; maximum-scale=1.0; user-scalable=0;" /> 
   <meta name="google-site-verification" content="7dRcSqMI5R3Rag9iNiHeFei6ycyBTd3dmhlZvF8QdhA" /> 
   <link rel="Shortcut Icon" type="image/ico" href="/_assets/favicon.ico" /> 
  <link rel="apple-touch-icon-precomposed" href="/_assets/img/icons/apple-touch-icon.png" /> 
   <link rel="stylesheet" type="text/css" media="all" href="/_assets/css/mobile.css" /> 
   <link rel="stylesheet" type="text/css" href="/_assets/css/retina.css" media="only screen and (-webkit-min-device-pixel-ratio: 2)" /> 
  <script type="text/javascript" src="/_assets/js/jquery-1.4.2.min.js"></script> 
   <script src="/_assets/js/css_browser_selector.js" type="text/javascript"></script> 
  <script type="text/javascript" src="/_assets/js/scripts.js"></script> 
  <script type="text/javascript">

    var _gaq = _gaq || [];
    _gaq.push(['_setAccount', 'UA-24788688-3']);
    _gaq.push(['_trackPageview']);

  (function() {

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

生寂 2024-12-10 13:16:58

该网站似乎检测到您正在通过 Android 浏览,并提供内容较少的适合移动设备的页面。比较您在 Chrome 和 Android 设备上查看 http://www.bathrugby.com 时获得的内容浏览器。

您可以通过执行诸如 connection.userAgent(params, "Mozilla/4.0") 之类的操作来影响通过 Jsoup 提供的服务。

It looks like the website is detecting that you are browsing from an Android and serving up a mobile-friendly page with less content. Compare what you get at http://www.bathrugby.com when you look at it in Chrome and from your Android browser.

You might be able to influence what gets served via Jsoup by doing something like connection.userAgent(params, "Mozilla/4.0").

聽兲甴掵 2024-12-10 13:16:58

我通过将页面的源 html 作为字符串检索,然后将其传递给 jSoup 来回答这个问题。

HttpClient client = new DefaultHttpClient();
            HttpGet request = new HttpGet(url);
            HttpResponse response = client.execute(request);

            String html = "";
            InputStream in = response.getEntity().getContent();
            BufferedReader reader = new BufferedReader(new     InputStreamReader(in));
            StringBuilder str = new StringBuilder();
            String line = null;
            while((line = reader.readLine()) != null)
            {
                str.append(line);
            }
            in.close();
            html = str.toString();              

            doc = Jsoup.parse(html);

I answered this by retrieving the page's source html as a string, then passing this to jSoup.

HttpClient client = new DefaultHttpClient();
            HttpGet request = new HttpGet(url);
            HttpResponse response = client.execute(request);

            String html = "";
            InputStream in = response.getEntity().getContent();
            BufferedReader reader = new BufferedReader(new     InputStreamReader(in));
            StringBuilder str = new StringBuilder();
            String line = null;
            while((line = reader.readLine()) != null)
            {
                str.append(line);
            }
            in.close();
            html = str.toString();              

            doc = Jsoup.parse(html);
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文