使用 JSOUP 从 url 中提取内容

发布于 2024-12-02 10:31:28 字数 3691 浏览 0 评论 0原文

我想从 url 中提取内容,这是我想使用 jsoup 提取的以下内容。

 And I want to extract text of <div id =content>

    <div id="content">
                  <!-- Start: Header -->
<div id="content-group_header-0" class="header content-podgroup-wrapper content-podgroup-0-0 first-pod">

    <h4 class="h4 no-font-replace ">
    Your imagination knows no borders or boundaries.  </h4>


  </div>
<!-- End: Header -->

<!-- Start: Open HTML -->
<div id="content-group_open_html-0" class="open-html wysiwyg content-podgroup-wrapper content-podgroup-0-0">

  <div class="open-html"><p>And we say, let it run wild. Because in that freedom is the shape of what’s to come.</p>
<p>Hello</p>
<p>Whatever it is you do, our pathway to innovation begins and ends with you.</p>
</div>
  </div>

<!-- End: Open HTML -->
<!-- Start: Header -->
<div id="content-group_header-1" class="header content-podgroup-wrapper content-podgroup-1-1">

    <h4 class="h4 no-font-replace ">
    Dreams grow, but not without nurturing.  </h4>

  </div>
<!-- End: Header -->

<!-- Start: Open HTML -->
<div id="content-group_open_html-1" class="open-html wysiwyg content-podgroup-wrapper content-podgroup-1-1">

  <p>In our fiscal year 2010, we invested 23% of our gross revenue or $2.549 billion in R&D—an investment that has increased every year since 2000. It’s how we’ve been able to stay ahead of the curve and deliver on the promise of new wireless technologies:</p>

<ul class="supporting-content">
<li> Hello World</li>

<li>We were the first to produce a single</li>

<li>We were the first to commercialize a chipset</li>

<li>We were the first to deliver GHz processing power integrated with 3G</li>

<li>We are the first to produce a laptop solution</li>

<li>We are the first to commercialize</li>

</ul>
<p>In fact, with a current intellectual property portfolio consisting of more than 77,000 patents granted and pending</p>
  </div>
<!-- End: Open HTML -->
<!-- Start: Header -->
<div id="content-group_header-2" class="header content-podgroup-wrapper content-podgroup-2-2">

    <h4 class="h4 no-font-replace ">
    Our partnerships are our most valuable assets.  </h4>

  </div>
<!-- End: Header -->

<!-- Start: Open HTML -->
<div id="content-group_open_html-2" class="open-html wysiwyg content-podgroup-wrapper content-podgroup-2-2">

<p></p>
<p></p>
</div>

  </div>
<!-- End: Open HTML -->
<!-- Start: Header -->
<div id="content-group_header-3" class="header content-podgroup-wrapper content-podgroup-3-3">

    <h4 class="h4 no-font-replace ">
    So where does the pathway to innovation lead?  </h4>

  </div>
<!-- End: Header -->

<!-- Start: Open HTML -->
<div id="content-group_open_html-3" class="open-html wysiwyg content-podgroup-wrapper content-podgroup-3-3 last-pod">

  <div class="open-html"><p>We’re eager to find out ourselves, because if the pathway begins with you, then it’s up to you to decide where it’s headed. So let’s explore the possibilities. Let’s never stop discovering.</p>
<p>Working together, the pathway to innovation can go as far as our dreams will take us.</p>
</div>
  </div>
<!-- End: Open HTML -->
                </div><!-- END: content -->

谁能解释一下如何使用 JSOUP 来完成它。

I want to extract contents from a url and this is the below content that I want to extract using jsoup.

 And I want to extract text of <div id =content>

    <div id="content">
                  <!-- Start: Header -->
<div id="content-group_header-0" class="header content-podgroup-wrapper content-podgroup-0-0 first-pod">

    <h4 class="h4 no-font-replace ">
    Your imagination knows no borders or boundaries.  </h4>


  </div>
<!-- End: Header -->

<!-- Start: Open HTML -->
<div id="content-group_open_html-0" class="open-html wysiwyg content-podgroup-wrapper content-podgroup-0-0">

  <div class="open-html"><p>And we say, let it run wild. Because in that freedom is the shape of what’s to come.</p>
<p>Hello</p>
<p>Whatever it is you do, our pathway to innovation begins and ends with you.</p>
</div>
  </div>

<!-- End: Open HTML -->
<!-- Start: Header -->
<div id="content-group_header-1" class="header content-podgroup-wrapper content-podgroup-1-1">

    <h4 class="h4 no-font-replace ">
    Dreams grow, but not without nurturing.  </h4>

  </div>
<!-- End: Header -->

<!-- Start: Open HTML -->
<div id="content-group_open_html-1" class="open-html wysiwyg content-podgroup-wrapper content-podgroup-1-1">

  <p>In our fiscal year 2010, we invested 23% of our gross revenue or $2.549 billion in R&D—an investment that has increased every year since 2000. It’s how we’ve been able to stay ahead of the curve and deliver on the promise of new wireless technologies:</p>

<ul class="supporting-content">
<li> Hello World</li>

<li>We were the first to produce a single</li>

<li>We were the first to commercialize a chipset</li>

<li>We were the first to deliver GHz processing power integrated with 3G</li>

<li>We are the first to produce a laptop solution</li>

<li>We are the first to commercialize</li>

</ul>
<p>In fact, with a current intellectual property portfolio consisting of more than 77,000 patents granted and pending</p>
  </div>
<!-- End: Open HTML -->
<!-- Start: Header -->
<div id="content-group_header-2" class="header content-podgroup-wrapper content-podgroup-2-2">

    <h4 class="h4 no-font-replace ">
    Our partnerships are our most valuable assets.  </h4>

  </div>
<!-- End: Header -->

<!-- Start: Open HTML -->
<div id="content-group_open_html-2" class="open-html wysiwyg content-podgroup-wrapper content-podgroup-2-2">

<p></p>
<p></p>
</div>

  </div>
<!-- End: Open HTML -->
<!-- Start: Header -->
<div id="content-group_header-3" class="header content-podgroup-wrapper content-podgroup-3-3">

    <h4 class="h4 no-font-replace ">
    So where does the pathway to innovation lead?  </h4>

  </div>
<!-- End: Header -->

<!-- Start: Open HTML -->
<div id="content-group_open_html-3" class="open-html wysiwyg content-podgroup-wrapper content-podgroup-3-3 last-pod">

  <div class="open-html"><p>We’re eager to find out ourselves, because if the pathway begins with you, then it’s up to you to decide where it’s headed. So let’s explore the possibilities. Let’s never stop discovering.</p>
<p>Working together, the pathway to innovation can go as far as our dreams will take us.</p>
</div>
  </div>
<!-- End: Open HTML -->
                </div><!-- END: content -->

Can anybody explain how it can be done using JSOUP.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

遗心遗梦遗幸福 2024-12-09 10:31:28

为什么要使用正则表达式呢? Jsoup 是一个 HTML 解析器,仅使用 CSS 选择器。只需使用合适的 CSS 选择器即可。我了解到您想要选择

?为此,请使用此格式的 CSS ID 选择器#id

Document document = Jsoup.connect("http://www.host.com/domain").get();
Element content = document.select("#content").first();
System.out.println(content.html()); 
// ...

或者,如果您只想获取文本,请使用 Element#text() 而不是 Element#html()

System.out.println(content.text()); 
// ...

彻底阅读以下文档,了解如何使用 CSS 选择器在 Jsoup 中:

你不应该考虑使用正则表达式来解析HTML

Why would you use regular expressions for this? Jsoup is a HTML parser which eats CSS selectors only. Just use a proper CSS selector. I understand that you want to select <div id="content">? Use the CSS ID-selector of this format #id for this.

Document document = Jsoup.connect("http://www.host.com/domain").get();
Element content = document.select("#content").first();
System.out.println(content.html()); 
// ...

Or if you want to get the text only, use Element#text() instead of Element#html():

System.out.println(content.text()); 
// ...

Read the following documents thoroughly to learn how to use CSS selectors in Jsoup:

You shouldn't think about using regexps to parse HTML.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文