如何在Java中将JTextPanes/JEditorPanes html内容清理为字符串?

发布于 2024-10-28 02:45:29 字数 1258 浏览 4 评论 0 原文

我尝试从 JTextPane 获取漂亮的(干净的)文本内容。以下是来自 JTextPane 的示例代码:

JTextPane textPane = new JTextPane ();
textPane.setContentType ("text/html");
textPane.setText ("This <b>is</b> a <b>test</b>.");
String text = textPane.getText ();
System.out.println (text);

JTexPane 中的文本如下所示:

一个测试

我得到这种打印到控制台:

<html>
  <head>

  </head>
  <body>
    This <b>is</b> a <b>test</b>.
  </body>
</html>

我使用了 substring() 和/或 replace() 代码,但使用起来不舒服:

String text = textPane.getText ().replace ("<html> ... <body>\n    , "");

有没有简单的函数从字符串中删除除 标记(内容)之外的所有其他标记?

有时 JTextPane 在内容周围添加

标签,所以我也想删除它们。

像这样:

<html>
  <head>

  </head>
  <body>
    <p style="margin-top: 0">
      hdfhdfgh
    </p>
  </body>
</html>

我只想获取带有标签的文本内容:

This <b>is</b> a <b>test</b>.

I try to get pretty (cleaned) text content from JTextPane. Here is example code from JTextPane:

JTextPane textPane = new JTextPane ();
textPane.setContentType ("text/html");
textPane.setText ("This <b>is</b> a <b>test</b>.");
String text = textPane.getText ();
System.out.println (text);

Text look like this in JTexPane:

This is a test.

I get this kind of print to console:

<html>
  <head>

  </head>
  <body>
    This <b>is</b> a <b>test</b>.
  </body>
</html>

I've used substring() and/or replace() code, but it is uncomfortable to use:

String text = textPane.getText ().replace ("<html> ... <body>\n    , "");

Is there any simple function to remove all other tags than <b> tags (content) from string?

Sometimes JTextPane add <p> tags around content so I want to get rid of them also.

Like this:

<html>
  <head>

  </head>
  <body>
    <p style="margin-top: 0">
      hdfhdfgh
    </p>
  </body>
</html>

I want to get only text content with tags:

This <b>is</b> a <b>test</b>.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

习惯那些不曾习惯的习惯 2024-11-04 02:45:29

我对 HTMLWriter 并覆盖 startTagendTag 以跳过 之外的所有标记。

我没有进行太多测试,似乎工作正常。一个缺点是输出字符串有大量空格。摆脱它应该不会太难。

import java.io.*;
import javax.swing.*;
import javax.swing.text.*;
import javax.swing.text.html.*;

public class Foo {

    public static void main(String[] args) throws Exception {
        JTextPane textPane = new JTextPane();
        textPane.setContentType("text/html");
        textPane.setText("<p>This</p> <b>is</b> a <b>test</b>.");

        StringWriter writer = new StringWriter();
        HTMLDocument doc = (HTMLDocument) textPane.getStyledDocument();

        HTMLWriter htmlWriter = new OnlyBodyHTMLWriter(writer, doc);
        htmlWriter.write();

        System.out.println(writer.toString());
    }

    private static class OnlyBodyHTMLWriter extends HTMLWriter {

        public OnlyBodyHTMLWriter(Writer w, HTMLDocument doc) {
            super(w, doc);
        }

        private boolean inBody = false;

        private boolean isBody(Element elem) {
            // copied from HTMLWriter.startTag()
            AttributeSet attr = elem.getAttributes();
            Object nameAttribute = attr
                    .getAttribute(StyleConstants.NameAttribute);
            HTML.Tag name = null;
            if (nameAttribute instanceof HTML.Tag) {
                name = (HTML.Tag) nameAttribute;
            }
            return name == HTML.Tag.BODY;
        }

        @Override
        protected void startTag(Element elem) throws IOException,
                BadLocationException {
            if (inBody) {
                super.startTag(elem);
            }
            if (isBody(elem)) {
                inBody = true;
            }
        }

        @Override
        protected void endTag(Element elem) throws IOException {
            if (isBody(elem)) {
                inBody = false;
            }
            if (inBody) {
                super.endTag(elem);
            }
        }
    }
}

I subclassed HTMLWriter and overrode startTag and endTag to skip all tags outside of <body>.

I did not test much, it seems to work ok. One drawback is that the output string has quite a lot of whitespace. Getting rid of that shouldn't be too hard.

import java.io.*;
import javax.swing.*;
import javax.swing.text.*;
import javax.swing.text.html.*;

public class Foo {

    public static void main(String[] args) throws Exception {
        JTextPane textPane = new JTextPane();
        textPane.setContentType("text/html");
        textPane.setText("<p>This</p> <b>is</b> a <b>test</b>.");

        StringWriter writer = new StringWriter();
        HTMLDocument doc = (HTMLDocument) textPane.getStyledDocument();

        HTMLWriter htmlWriter = new OnlyBodyHTMLWriter(writer, doc);
        htmlWriter.write();

        System.out.println(writer.toString());
    }

    private static class OnlyBodyHTMLWriter extends HTMLWriter {

        public OnlyBodyHTMLWriter(Writer w, HTMLDocument doc) {
            super(w, doc);
        }

        private boolean inBody = false;

        private boolean isBody(Element elem) {
            // copied from HTMLWriter.startTag()
            AttributeSet attr = elem.getAttributes();
            Object nameAttribute = attr
                    .getAttribute(StyleConstants.NameAttribute);
            HTML.Tag name = null;
            if (nameAttribute instanceof HTML.Tag) {
                name = (HTML.Tag) nameAttribute;
            }
            return name == HTML.Tag.BODY;
        }

        @Override
        protected void startTag(Element elem) throws IOException,
                BadLocationException {
            if (inBody) {
                super.startTag(elem);
            }
            if (isBody(elem)) {
                inBody = true;
            }
        }

        @Override
        protected void endTag(Element elem) throws IOException {
            if (isBody(elem)) {
                inBody = false;
            }
            if (inBody) {
                super.endTag(elem);
            }
        }
    }
}
静赏你的温柔 2024-11-04 02:45:29

您可以使用 JEditorPane 本身使用的 HTML 解析器 HTMLEditorKit.ParserDelegator

请参阅此示例API 文档。

You could use the HTML parser that the JEditorPane uses itself, HTMLEditorKit.ParserDelegator.

See this example, and the API docs.

神经暖 2024-11-04 02:45:29

我通过使用子字符串和替换方法找到了这个问题的解决方案:

// Get textPane content to string
String text = textPane.getText();

// Then I take substring to remove tags (html, head, body)
text = text.substring(44, text.length() - 19);

// Sometimes program sets <p style="margin-top: 0"> and </p> -tags so I remove them
// This isn't necessary to use.
text = text.replace("<p style=\"margin-top: 0\">\n      ", "").replace("\n    </p>", ""));

// This is for convert possible escape characters example & -> &
text = StringEscapeUtils.unescapeHtml(text);

有一个指向 StringEscapeUtils -libraries 的链接,它将转义字符转换回正常视图。感谢 Ozhan Duz 的建议。

(commons-lang - 下载

I find solution to this problem by using substring and replace -methods:

// Get textPane content to string
String text = textPane.getText();

// Then I take substring to remove tags (html, head, body)
text = text.substring(44, text.length() - 19);

// Sometimes program sets <p style="margin-top: 0"> and </p> -tags so I remove them
// This isn't necessary to use.
text = text.replace("<p style=\"margin-top: 0\">\n      ", "").replace("\n    </p>", ""));

// This is for convert possible escape characters example & -> &
text = StringEscapeUtils.unescapeHtml(text);

There is link to StringEscapeUtils -libraries which convert escape characters back to normal view. Thanks to Ozhan Duz for the suggestion.

(commons-lang - download)

メ斷腸人バ 2024-11-04 02:45:29
String text = textPane.getDocument.getText (0,textPane.getText().length());
String text = textPane.getDocument.getText (0,textPane.getText().length());
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文