维基百科第一段

发布于 2024-12-18 09:25:03 字数 80 浏览 3 评论 0 原文

我正在编写一些 Java 代码,以便使用维基百科的文本实现 NLP 任务。如何使用 JSoup 提取维基百科文章的第一段?

多谢。

I'm writing some Java code in order to realize NLP tasks upon texts using Wikipedia. How can I use JSoup to extract the first paragraph of a Wikipedia article?

Thanks a lot.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

ゃ懵逼小萝莉 2024-12-25 09:25:03

它非常简单,并且对于从中提取信息的每个半结构化页面来说,该过程都非常相似。

首先,您必须唯一标识所需信息所在的 DOM 元素。最简单的方法是使用 Web 开发工具,例如 Firefox 中的 Firebug,或者与 IE(我认为> 6)和 Chrome 捆绑在一起的 Firebug。

以文章 Potato 为例,您会发现 您感兴趣的段落位于以下中:

<div class="mw-content-ltr" lang="en" dir="ltr">
  <div class="metadata topicon" id="protected-icon" style="display: none; right: 55px;">[...]</div>
  <div class="dablink">[...]</div>
  <div class="dablink">[...]</div>
  <div>[...]</div>
  <p>The potato [...]</p>
  <p>[...]</p>
  <p>[...]</p>

换句话说,您想要找到 < 内的第一个

元素代码>div 带有class 称为 mw-content-ltr

然后,您只需使用 jsoup 选择该元素,例如使用其选择器语法(与 jQuery 非常相似):

public class WikipediaParser {
  private final String baseUrl; 

  public WikipediaParser(String lang) {
    this.baseUrl = String.format("http://%s.wikipedia.org/wiki/", lang);
  }

  public String fetchFirstParagraph(String article) throws IOException {
    String url = baseUrl + article;
    Document doc = Jsoup.connect(url).get();
    Elements paragraphs = doc.select(".mw-content-ltr p");

    Element firstParagraph = paragraphs.first();
    return firstParagraph.text();
  }

  public static void main(String[] args) throws IOException {
    WikipediaParser parser = new WikipediaParser("en");
    String firstParagraph = parser.fetchFirstParagraph("Potato");
    System.out.println(firstParagraph); // prints "The potato is a starchy [...]."
  }
}

It is very simple, and the process is quite similar for every semi-structured page from which you are extracting information.

First, you have to uniquely identify the DOM element where the required information lies in. The easiest way to do this is to use a web development tool, such as Firebug in Firefox, or the ones that come bundled with IE (> 6, I think) and Chrome.

Using the article Potato as an example, you will find that the <p>aragraph you are interested in is in the following block:

<div class="mw-content-ltr" lang="en" dir="ltr">
  <div class="metadata topicon" id="protected-icon" style="display: none; right: 55px;">[...]</div>
  <div class="dablink">[...]</div>
  <div class="dablink">[...]</div>
  <div>[...]</div>
  <p>The potato [...]</p>
  <p>[...]</p>
  <p>[...]</p>

In other words, you want to find the first <p> element that is inside the div with a class called mw-content-ltr.

Then, you just need to select that element with jsoup, using its selector syntax for example (which is very similar to jQuery's):

public class WikipediaParser {
  private final String baseUrl; 

  public WikipediaParser(String lang) {
    this.baseUrl = String.format("http://%s.wikipedia.org/wiki/", lang);
  }

  public String fetchFirstParagraph(String article) throws IOException {
    String url = baseUrl + article;
    Document doc = Jsoup.connect(url).get();
    Elements paragraphs = doc.select(".mw-content-ltr p");

    Element firstParagraph = paragraphs.first();
    return firstParagraph.text();
  }

  public static void main(String[] args) throws IOException {
    WikipediaParser parser = new WikipediaParser("en");
    String firstParagraph = parser.fetchFirstParagraph("Potato");
    System.out.println(firstParagraph); // prints "The potato is a starchy [...]."
  }
}
甜是你 2024-12-25 09:25:03

看起来第一段也是文档中的第一个

块。所以这可能有效:

Document doc = Jsoup.connect("http://en.wikipedia.org/wiki/B-tree").get();
Elements paragraphs = doc.select("p");
Element firstParagraph = paragraphs.first();

现在你可以获得这个元素的内容

It seems like the first paragraph is also the first <p> block in the document. So this might work:

Document doc = Jsoup.connect("http://en.wikipedia.org/wiki/B-tree").get();
Elements paragraphs = doc.select("p");
Element firstParagraph = paragraphs.first();

Now you can get the content of this element

谜兔 2024-12-25 09:25:03

Silva 提出的解决方案适用于大多数情况,除了“JavaScript”和“美国”。段落应选择为 doc.select(".mw-body-content p");

查看此 GitHub 代码了解更多详细信息。您还可以从 HTML 中删除一些元数据信息以提高准确性。

The solution proposed by Silva works for most cases except like in "JavaScript" and "United States". Paragraphs should be selected as doc.select(".mw-body-content p");

Check this GitHub code for more details. You can also remove some metadata information from HTML to improve accuracy.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文