维基百科:用于删除维基百科文本标记的 Java 库

发布于 2024-09-01 17:24:03 字数 117 浏览 9 评论 0原文

我下载了维基百科转储,现在想要删除每个页面内容中的维基百科标记。我尝试编写正则表达式,但它们太多了,无法处理。我找到了一个 python 库,但我需要一个 java 库,因为我想集成到我的代码中。

谢谢。

I downloaded wikipedia dump and now want to remove the wikipedia markup in the contents of each page. I tried writing regular expressions but they are too many to handle. I found a python library but I need a java library because, I want to integrate into my code.

Thank you.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

枫以 2024-09-08 17:24:03

分两步完成:

  1. 让一些现有工具将 MediaWiki 标记转换为纯 HTML;
  2. 将纯 HTML 转换为文本。

以下演示:

import net.java.textilej.parser.MarkupParser;
import net.java.textilej.parser.builder.HtmlDocumentBuilder;
import net.java.textilej.parser.markup.mediawiki.MediaWikiDialect;
import javax.swing.text.html.HTMLEditorKit;
import javax.swing.text.html.parser.ParserDelegator;
import java.io.StringReader;
import java.io.StringWriter;

public class Test {

    public static void main(String[] args) throws Exception {

        String markup = "This is ''italic'' and '''that''' is bold. \n"+
                "=Header 1=\n"+
                "a list: \n* item A \n* item B \n* item C";

        StringWriter writer = new StringWriter();

        HtmlDocumentBuilder builder = new HtmlDocumentBuilder(writer);
        builder.setEmitAsDocument(false);

        MarkupParser parser = new MarkupParser(new MediaWikiDialect());
        parser.setBuilder(builder);
        parser.parse(markup);

        final String html = writer.toString();
        final StringBuilder cleaned = new StringBuilder();

        HTMLEditorKit.ParserCallback callback = new HTMLEditorKit.ParserCallback() {
                public void handleText(char[] data, int pos) {
                    cleaned.append(new String(data)).append(' ');
                }
        };
        new ParserDelegator().parse(new StringReader(html), callback, false);

        System.out.println(markup);
        System.out.println("---------------------------");
        System.out.println(html);
        System.out.println("---------------------------");
        System.out.println(cleaned);
    }
}

产生:

This is ''italic'' and '''that''' is bold. 
=Header 1=
a list: 
* item A 
* item B 
* item C
---------------------------
<p>This is <i>italic</i> and <b>that</b> is bold. </p><h1 id="Header1">Header 1</h1><p>a list: </p><ul><li>item A </li><li>item B </li><li>item C</li></ul>
---------------------------
This is  italic  and  that  is bold. Header 1 a list: item A item B item C 

从哪里下载要导入的 java 包?

这里: download.java.net/maven/2/net/java/textile-j/2.2 的 Web 存档链接

Do it in two steps:

  1. let some existing tool convert the MediaWiki mark-up into plain HTML;
  2. convert the plain HTML into text.

The following demo:

import net.java.textilej.parser.MarkupParser;
import net.java.textilej.parser.builder.HtmlDocumentBuilder;
import net.java.textilej.parser.markup.mediawiki.MediaWikiDialect;
import javax.swing.text.html.HTMLEditorKit;
import javax.swing.text.html.parser.ParserDelegator;
import java.io.StringReader;
import java.io.StringWriter;

public class Test {

    public static void main(String[] args) throws Exception {

        String markup = "This is ''italic'' and '''that''' is bold. \n"+
                "=Header 1=\n"+
                "a list: \n* item A \n* item B \n* item C";

        StringWriter writer = new StringWriter();

        HtmlDocumentBuilder builder = new HtmlDocumentBuilder(writer);
        builder.setEmitAsDocument(false);

        MarkupParser parser = new MarkupParser(new MediaWikiDialect());
        parser.setBuilder(builder);
        parser.parse(markup);

        final String html = writer.toString();
        final StringBuilder cleaned = new StringBuilder();

        HTMLEditorKit.ParserCallback callback = new HTMLEditorKit.ParserCallback() {
                public void handleText(char[] data, int pos) {
                    cleaned.append(new String(data)).append(' ');
                }
        };
        new ParserDelegator().parse(new StringReader(html), callback, false);

        System.out.println(markup);
        System.out.println("---------------------------");
        System.out.println(html);
        System.out.println("---------------------------");
        System.out.println(cleaned);
    }
}

produces:

This is ''italic'' and '''that''' is bold. 
=Header 1=
a list: 
* item A 
* item B 
* item C
---------------------------
<p>This is <i>italic</i> and <b>that</b> is bold. </p><h1 id="Header1">Header 1</h1><p>a list: </p><ul><li>item A </li><li>item B </li><li>item C</li></ul>
---------------------------
This is  italic  and  that  is bold. Header 1 a list: item A item B item C 

Where do you download the java packages you are importing?

Here: Web Archive link of download.java.net/maven/2/net/java/textile-j/2.2

单挑你×的.吻 2024-09-08 17:24:03

如果您需要纯文本,您应该使用 WikiClean 库 https://github.com/lintool/wikiclean

我遇到了同样的问题,看起来这是在 java 中对我有用的唯一有效的解决方案。

有两种用例:

1) 当您的文本不是 XML 格式时,您应该添加执行此处理所需的 xml 标签。假设您之前正在处理 XML 文件,现在您的内容没有 XML 结构,那么您只需添加 xmlStartTag 和 xmlEndTag(如下面的代码所示)即可处理它。

String xmlStartTag = "<text xml:space=\"preserve\">";
String xmlEndTag = "</text>";
String articleWithXml = xmlStartTag + article.getText() + xmlEndTag;
WikiClean cleaner = new WikiClean.Builder().build();
String plainWikiText = cleaner.clean(articleWithXml);

2)当您直接阅读维基百科转储文件(xml文件)时,在这种情况下,您只需将其传递给该文件即可。

WikiClean cleaner = new WikiClean.Builder().build();
String plainWikiText = cleaner.clean(XMLFileContents);

If you need plain text you should use WikiClean library https://github.com/lintool/wikiclean.

I had the same problem and it looks like this was the only efficient solution that worked for me in java.

There are two usecases:

1) When you have the text not in XML format then you should add xml tags needed to do this processing. Supposing you are processing XML file earlier, and now you have the content without XML structure, then you just add xmlStartTag and xmlEndTag as in the code bellow, and it processes it.

String xmlStartTag = "<text xml:space=\"preserve\">";
String xmlEndTag = "</text>";
String articleWithXml = xmlStartTag + article.getText() + xmlEndTag;
WikiClean cleaner = new WikiClean.Builder().build();
String plainWikiText = cleaner.clean(articleWithXml);

2) When you are reading the Wikipedia dump file directly (xml file), in that case you just pass it through the file and it goes through.

WikiClean cleaner = new WikiClean.Builder().build();
String plainWikiText = cleaner.clean(XMLFileContents);
隔纱相望 2024-09-08 17:24:03

Mylyn WikiText 可以将各种 Wiki 语法转换为 HTML 和其他格式。它还支持 Wikipedia 使用的 MediaWiki 语法。虽然 Mylyn WikiText 主要是一个 Eclipse 插件,但它也作为独立库提供

Mylyn WikiText can convert various Wiki syntaxes into HTML and other formats. It also supports MediaWiki syntax, which is what Wikipedia uses. Although Mylyn WikiText is primarily an Eclipse plugin, it is also available as standalone library.

喜你已久 2024-09-08 17:24:03

尝试使用 Mediawiki 文本转纯文本方法。您可能必须根据需要改进 PlainTextConverter 类。
结合将 Wikipedia 文本转换为 HTML 的示例,您可以嵌入模板内容。

Try the Mediawiki text to plain text approach. You probably have to improve the PlainTextConverter class for your needs.
Combined with the example for converting Wikipedia texts to HTML you can transclude template contents.

梦回梦里 2024-09-08 17:24:03

您可以使用 Jwiki 获取维基百科数据示例:

Jwiki jwiki = new Jwiki("elon musk");
System.out.println("Title :"+jwiki.getDisplayTitle()); //get title
System.out.println("Text : "+jwiki.getExtractText());  //get summary text
System.out.println("Image : "+jwiki.getImageURL());    //get image URL

You can use Jwiki to get Wikipedia data Example :

Jwiki jwiki = new Jwiki("elon musk");
System.out.println("Title :"+jwiki.getDisplayTitle()); //get title
System.out.println("Text : "+jwiki.getExtractText());  //get summary text
System.out.println("Image : "+jwiki.getImageURL());    //get image URL
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文