当前位置：文江博客话题详情

如何在 Java 中使用 HTML 解析器和 Apache Tika 来提取所有 HTML 标签？

发布于 2024-10-26 16:24:06 字数 112 浏览 0 评论 0原文

我下载了 tika-core 和 tika-parser 库，但找不到将 HTML 文档解析为字符串的示例代码。我必须删除网页源的所有 html 标签。我能做些什么？如何使用 Apache Tika 进行编码？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

柳若烟 2024-11-02 16:24:06

您想要 html 文件的纯文本版本吗？如果是这样，您所需要的只是：

        InputStream input = new FileInputStream("myfile.html");
        ContentHandler handler = new BodyContentHandler();
        Metadata metadata = new Metadata();
        new HtmlParser().parse(input, handler, metadata, new ParseContext());
        String plainText = handler.toString();

BodyContentHandler 在创建时没有构造函数参数或有字符限制，将捕获（仅）html 正文的文本并将其返回给您。

Do you want a plain text version of a html file? If so, all you need is something like:

        InputStream input = new FileInputStream("myfile.html");
        ContentHandler handler = new BodyContentHandler();
        Metadata metadata = new Metadata();
        new HtmlParser().parse(input, handler, metadata, new ParseContext());
        String plainText = handler.toString();

The BodyContentHandler, when created with no constructor arguments or with a character limit, will capture the text (only) of the body of the html and return it to you.

回复收藏 0 原文

ˉ厌 2024-11-02 16:24:06

您还可以使用 Tika AutoDetectParser 来解析任何类型的文件，例如 HTML。
这是一个简单的例子：

    try {
        InputStream input = new FileInputStream(new File(path));
        ContentHandler textHandler = new BodyContentHandler();
        Metadata metadata = new Metadata();
        AutoDetectParser parser = new AutoDetectParser();
        ParseContext context = new ParseContext();
        parser.parse(input, textHandler, metadata, context);
        System.out.println("Title: " + metadata.get(metadata.TITLE));
        System.out.println("Body: " + textHandler.toString());
    } catch (FileNotFoundException e) {
        e.printStackTrace();
    } catch (IOException e) {
        e.printStackTrace();
    } catch (SAXException e) {
        e.printStackTrace();
    } catch (TikaException e) {
        e.printStackTrace();
    }

You can also you Tika AutoDetectParser to parse any type of files such as HTML.
Here is a simple example of that:

    try {
        InputStream input = new FileInputStream(new File(path));
        ContentHandler textHandler = new BodyContentHandler();
        Metadata metadata = new Metadata();
        AutoDetectParser parser = new AutoDetectParser();
        ParseContext context = new ParseContext();
        parser.parse(input, textHandler, metadata, context);
        System.out.println("Title: " + metadata.get(metadata.TITLE));
        System.out.println("Body: " + textHandler.toString());
    } catch (FileNotFoundException e) {
        e.printStackTrace();
    } catch (IOException e) {
        e.printStackTrace();
    } catch (SAXException e) {
        e.printStackTrace();
    } catch (TikaException e) {
        e.printStackTrace();
    }

回复收藏 0 原文

~没有更多了~