当前位置：文江博客话题详情

维基百科第一段

发布于 2024-12-18 09:25:03 字数 80 浏览 3 评论 0 原文

我正在编写一些 Java 代码，以便使用维基百科的文本实现 NLP 任务。如何使用 JSoup 提取维基百科文章的第一段？

多谢。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

ゃ懵逼小萝莉 2024-12-25 09:25:03

它非常简单，并且对于从中提取信息的每个半结构化页面来说，该过程都非常相似。

首先，您必须唯一标识所需信息所在的 DOM 元素。最简单的方法是使用 Web 开发工具，例如 Firefox 中的 Firebug，或者与 IE（我认为> 6）和 Chrome 捆绑在一起的 Firebug。

以文章 Potato 为例，您会发现 您感兴趣的段落位于以下块中：

<div class="mw-content-ltr" lang="en" dir="ltr">
  <div class="metadata topicon" id="protected-icon" style="display: none; right: 55px;">[...]</div>
  <div class="dablink">[...]</div>
  <div class="dablink">[...]</div>
  <div>[...]</div>
  <p>The potato [...]</p>
  <p>[...]</p>
  <p>[...]</p>

换句话说，您想要找到 < 内的第一个

元素代码>div 带有class 称为 mw-content-ltr。

然后，您只需使用 jsoup 选择该元素，例如使用其选择器语法（与 jQuery 非常相似）：

public class WikipediaParser {
  private final String baseUrl; 

  public WikipediaParser(String lang) {
    this.baseUrl = String.format("http://%s.wikipedia.org/wiki/", lang);
  }

  public String fetchFirstParagraph(String article) throws IOException {
    String url = baseUrl + article;
    Document doc = Jsoup.connect(url).get();
    Elements paragraphs = doc.select(".mw-content-ltr p");

    Element firstParagraph = paragraphs.first();
    return firstParagraph.text();
  }

  public static void main(String[] args) throws IOException {
    WikipediaParser parser = new WikipediaParser("en");
    String firstParagraph = parser.fetchFirstParagraph("Potato");
    System.out.println(firstParagraph); // prints "The potato is a starchy [...]."
  }
}

It is very simple, and the process is quite similar for every semi-structured page from which you are extracting information.

First, you have to uniquely identify the DOM element where the required information lies in. The easiest way to do this is to use a web development tool, such as Firebug in Firefox, or the ones that come bundled with IE (> 6, I think) and Chrome.

Using the article Potato as an example, you will find that the <p>aragraph you are interested in is in the following block:

<div class="mw-content-ltr" lang="en" dir="ltr">
  <div class="metadata topicon" id="protected-icon" style="display: none; right: 55px;">[...]</div>
  <div class="dablink">[...]</div>
  <div class="dablink">[...]</div>
  <div>[...]</div>
  <p>The potato [...]</p>
  <p>[...]</p>
  <p>[...]</p>

In other words, you want to find the first <p> element that is inside the div with a class called mw-content-ltr.

Then, you just need to select that element with jsoup, using its selector syntax for example (which is very similar to jQuery's):

public class WikipediaParser {
  private final String baseUrl; 

  public WikipediaParser(String lang) {
    this.baseUrl = String.format("http://%s.wikipedia.org/wiki/", lang);
  }

  public String fetchFirstParagraph(String article) throws IOException {
    String url = baseUrl + article;
    Document doc = Jsoup.connect(url).get();
    Elements paragraphs = doc.select(".mw-content-ltr p");

    Element firstParagraph = paragraphs.first();
    return firstParagraph.text();
  }

  public static void main(String[] args) throws IOException {
    WikipediaParser parser = new WikipediaParser("en");
    String firstParagraph = parser.fetchFirstParagraph("Potato");
    System.out.println(firstParagraph); // prints "The potato is a starchy [...]."
  }
}

回复收藏 0 原文

甜是你 2024-12-25 09:25:03

看起来第一段也是文档中的第一个

块。所以这可能有效：

Document doc = Jsoup.connect("http://en.wikipedia.org/wiki/B-tree").get();
Elements paragraphs = doc.select("p");
Element firstParagraph = paragraphs.first();

现在你可以获得这个元素的内容

It seems like the first paragraph is also the first <p> block in the document. So this might work:

Document doc = Jsoup.connect("http://en.wikipedia.org/wiki/B-tree").get();
Elements paragraphs = doc.select("p");
Element firstParagraph = paragraphs.first();

Now you can get the content of this element

回复收藏 0 原文

谜兔 2024-12-25 09:25:03

Silva 提出的解决方案适用于大多数情况，除了“JavaScript”和“美国”。段落应选择为 doc.select(".mw-body-content p");

查看此 GitHub 代码了解更多详细信息。您还可以从 HTML 中删除一些元数据信息以提高准确性。

回复收藏 0 原文

~没有更多了~

关于作者

木緿

暂无简介

文章

26 人气

关注发私信

友情链接

文江博客

维基百科第一段

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（3）

关于作者

相关话题

热门标签

推荐作者

李珊平

Quxin

范无咎

github_ZOJ2N8YxBm

若言

南…巷孤猫

友情链接

维基百科第一段

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（3）

关于作者

相关话题

热门标签

推荐作者

李珊平

Quxin

范无咎

github_ZOJ2N8YxBm

若言

南…巷孤猫

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。