如何为以下内容创建 JSOUP 选择器

发布于 2024-11-29 10:26:19 字数 2957 浏览 0 评论 0原文

例如,我想提取本文中的文本 HTML:

    <div class="description">
            <div style="clear: none;" class="post-fb-like">
              <fb:like class=" fb_edge_widget_with_comment fb_iframe_widget" href="http://mashable.com/2011/08/07/3-handy-mobile-apps/" send="true" width="625" height="61"><span><iframe src="http://www.facebook.com/plugins/like.php?api_key=116628718381794&amp;channel_url=http%3A%2F%2Fstatic.ak.fbcdn.net%2Fconnect%2Fxd_proxy.php%3Fversion%3D3%23cb%3Df138585052991e8%26origin%3Dhttp%253A%252F%252Fmashable.com%252Ff15a8eb75cc2b58%26relation%3Dparent.parent%26transport%3Dpostmessage&amp;href=http%3A%2F%2Fmashable.com%2F2011%2F08%2F07%2F3-handy-mobile-apps%2F&amp;layout=standard&amp;locale=en_US&amp;node_type=link&amp;sdk=joey&amp;send=true&amp;show_faces=true&amp;width=625" class="fb_ltr" title="Like this content on Facebook." style="border: medium none; overflow: hidden; height: 29px; width: 625px;" name="f2d40595a65cf36" id="f24fece5e565ec4" scrolling="no"></iframe></span></fb:like>
            </div>
                        <p><img src="http://ec.mashable.com/wp-content/uploads/2009/01/bizspark2.gif" alt="" align="left"><em>The <a href="http://mashable.com/tag/bizspark">Spark of Genius Series</a> highlights a unique feature of startups and is made possible by <a rel="nofollow" href="http://www.microsoftstartupzone.com/BizSpark/Pages/At_a_Glance.aspx?WT.mc_id=MSZ_Mashable_posts" target="_blank">Microsoft BizSpark</a>. If you would like to have your startup considered for inclusion, please see the details <a href="http://mashable.com/bizspark/">here</a>.</em></p>

<p><img src="http://5.mshcdn.com/wp-content/uploads/2011/08/mobile-devices.jpg" alt="" title="mobile devices" class="alignright" height="141" width="225">Each <a href="http://mashable.com/follow/topics/startup-weekend-roundup">weekend</a>, <em>Mashable</em> hand-picks startups we think are building interesting, unique or niche products. </p>
<p>This week, we’ve rounded up startups making mobile applications that bridge the physical and digital worlds for improved communication and enhanced experiences. </p>
<p>TransFire breaks down global communication barriers with its instant and automatic translation capabilities, while Babbleville facilitates neighbor-to-neighbor communication around events or topics. And, Picdish uses time and place to bring friends together over shared mobile food experiences.</p>
<hr>

我也想从另一个 HTML 页面中提取文本,但它的格式不同。我想从 http://www.cnn.com/2011/WORLD/europe/08/12/uk.riots.dan.rivers/index.html?hpt=hp_c2

我该怎么办创建一个选择器来提取文本,无论给出哪个文章网址?

For example I want to extract the text in this article HTML:

    <div class="description">
            <div style="clear: none;" class="post-fb-like">
              <fb:like class=" fb_edge_widget_with_comment fb_iframe_widget" href="http://mashable.com/2011/08/07/3-handy-mobile-apps/" send="true" width="625" height="61"><span><iframe src="http://www.facebook.com/plugins/like.php?api_key=116628718381794&channel_url=http%3A%2F%2Fstatic.ak.fbcdn.net%2Fconnect%2Fxd_proxy.php%3Fversion%3D3%23cb%3Df138585052991e8%26origin%3Dhttp%253A%252F%252Fmashable.com%252Ff15a8eb75cc2b58%26relation%3Dparent.parent%26transport%3Dpostmessage&href=http%3A%2F%2Fmashable.com%2F2011%2F08%2F07%2F3-handy-mobile-apps%2F&layout=standard&locale=en_US&node_type=link&sdk=joey&send=true&show_faces=true&width=625" class="fb_ltr" title="Like this content on Facebook." style="border: medium none; overflow: hidden; height: 29px; width: 625px;" name="f2d40595a65cf36" id="f24fece5e565ec4" scrolling="no"></iframe></span></fb:like>
            </div>
                        <p><img src="http://ec.mashable.com/wp-content/uploads/2009/01/bizspark2.gif" alt="" align="left"><em>The <a href="http://mashable.com/tag/bizspark">Spark of Genius Series</a> highlights a unique feature of startups and is made possible by <a rel="nofollow" href="http://www.microsoftstartupzone.com/BizSpark/Pages/At_a_Glance.aspx?WT.mc_id=MSZ_Mashable_posts" target="_blank">Microsoft BizSpark</a>. If you would like to have your startup considered for inclusion, please see the details <a href="http://mashable.com/bizspark/">here</a>.</em></p>

<p><img src="http://5.mshcdn.com/wp-content/uploads/2011/08/mobile-devices.jpg" alt="" title="mobile devices" class="alignright" height="141" width="225">Each <a href="http://mashable.com/follow/topics/startup-weekend-roundup">weekend</a>, <em>Mashable</em> hand-picks startups we think are building interesting, unique or niche products. </p>
<p>This week, we’ve rounded up startups making mobile applications that bridge the physical and digital worlds for improved communication and enhanced experiences. </p>
<p>TransFire breaks down global communication barriers with its instant and automatic translation capabilities, while Babbleville facilitates neighbor-to-neighbor communication around events or topics. And, Picdish uses time and place to bring friends together over shared mobile food experiences.</p>
<hr>

And I have another HTML page I want to extract text from too, but its in different format. I want to extract this text from http://www.cnn.com/2011/WORLD/europe/08/12/uk.riots.dan.rivers/index.html?hpt=hp_c2

How would I go about creating a selector to extract the text no matter which article url is given?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

盛夏已如深秋| 2024-12-06 10:26:19

无论给出哪个文章网址,我如何创建一个选择器来提取文本?

你不能。所有网站都有自己的 HTML 结构。自己在网络浏览器中打开页面,右键单击并查看源代码。看。您应该为每个单独的网站创建一个单独的选择器。

对于第一个示例,假设它是整个 HTML,因此文本位于这些

标记内。然后,您可以使用

Document html = Jsoup.parse(yourHtmlString);
Elements paragraphs = html.select("p");
String text = paragraphs.text();
// ...

For your CNN site,根据您想要获取

的所有

的 HTML 源>,所以这个选择器应该这样做:

Document document = Jsoup.connect("http://www.cnn.com/2011/WORLD/europe/08/12/uk.riots.dan.rivers/index.html?hpt=hp_c2").get();
Elements paragraphs = document.select(".cnn_strycntntlft p");
String text = paragraphs.text();
// ...

顺便说一下,使用他们的 RSS 提要而不是解析整个 HTML 会更容易。许多新闻网站都为此目的提供 RSS 源。

How would I go about creating a selector to extract the text no matter which article url is given?

You can't. All websites have their own HTML structure. Open the page in the webbrowser yourself, rightclick and View Source. Look. You should create a separate selector for each individual website.

For your first example, assuming that it's the whole HTML, the text is thus inside those <p> tags. You can then use

Document html = Jsoup.parse(yourHtmlString);
Elements paragraphs = html.select("p");
String text = paragraphs.text();
// ...

For your CNN site, according the HTML source you'd like to get all <p>s of the <div class="cnn_strycntntlft">, so this selector should do:

Document document = Jsoup.connect("http://www.cnn.com/2011/WORLD/europe/08/12/uk.riots.dan.rivers/index.html?hpt=hp_c2").get();
Elements paragraphs = document.select(".cnn_strycntntlft p");
String text = paragraphs.text();
// ...

By the way, it would be easier to just use their RSS feeds instead of parsing the whole HTML. Lot of news sites provides RSS feeds for exactly this purpose.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文