在 xml 文件中使用 CDATA 来解析 html 数据

发布于 2024-11-14 09:45:27 字数 696 浏览 1 评论 0原文

我有一个 xml 文件，其内容中包含格式错误的 HTML .. 由于 xml 无法解析像这样的 html 标签，我使用了 CDATA 来保存和解析。

我使用过 documentBuilder.setCoalescing(true) ；在解析恢复数据时 test data ]]> 没有 CDATA 标签..

但在 optput <和>标签被替换为 <和 > 分别..

我期待这个字符串在结果中...

<br>test<br>data<br>

在解析的字符串中。

如何做到这一点？有什么想法吗？提前致谢！

更新：我还有两个问题需要跟进..

1.有没有办法将格式错误的 HTML（例如）转换为可解析的 xml（例如) 通过代码，如果是这样，它也会处理   吗？

2.是否有任何解决方案可以通过java将html文本转换为纯文本（例如

test text

到test text）？

原文

I have a xml file with a malformed HTML in its content ..
Since xml cannot parse html tags like <br> I have used CDATA for saving and parsing .

I have used documentBuilder.setCoalescing(true) ; while parsing for recovering data <![CDATA[<br>test<br>data<br>]]> without CDATA tag ..

but in the optput < and > tags are replaced by < and > respectively ..

I m expecting this string in result ...

<br>test<br>data<br>

in the parsed string .

How to do this ? Any Idea ?
Thanks in advance !

UPDATE:I have two more Questions in follow up ..

1.Is there any way to make a malformed HTML (eg.<br>) to parsable xml (eg.<br/>) via code , if so will it handle also ?

2.Is there any solution to convert a html text to plain text via java (eg.<div>test text</div> to test text)?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

追星践月 2024-11-21 09:45:27

合并是一种将 CDATA 节（节点）的内容转换为文本节点并与相邻文本节点的内容合并的操作。将 CDATA 节转换为文本节点的这一要求本身将施加限制，即生成的文本节点必须由有效的 XML 字符组成。这将保留原始文档格式；换句话说，原始文档中的节点结构不会发生变化。

由此产生的行为是 5 个预定义实体的行为 - <、>、&、" 和 '，前三个实体将被扩展，因为它们不变的存在将改变文档结构。

简而言之，您无法通过从 DOM 中提取值来执行您想要执行的操作，在解析文档之后，您需要将这些值解码为您想要的内容。 href="http://commons.apache.org/lang/api-2.4/org/apache/commons/lang/StringEscapeUtils.html#unescapeXml%28java.lang.String%29" rel="nofollow">StringEscapeUtils 拥有所需的方法。

回复收藏 0 原文

虚拟世界 2024-11-21 09:45:27

合并意味着解析器将 CDATA 节点转换为 Text 节点。当文档序列化为XML时，当然必须对文本内容（HTML）进行转义。如果您想对 HTML 执行某些操作，则必须首先将其提取为文本 - 然后您可以在浏览器或其他设备中呈现它。

更新：

1）您可以使用 JTidy，http://jtidy.sourceforge.net/index.html，解析 HTML 内容并生成 XML 或 XHTML。像这样：

DocumentBuilder db = factory.newDocumentBuilder();
Document doc = db.parse(..)); // parse your input document

// Obtain the HTML content, may be buried deeper down or
// or scattered around in different places
String text = doc.getDocumentElement().getTextContent();

// Parse with JTidy to convert from HTML to XHTML
Tidy tidy = new Tidy();
tidy.setXHTML(true);

Document htmlDoc = tidy.parseDOM(new StringReader(text), null);
Transformer t = TransformerFactory.newInstance().newTransformer();
t.setOutputProperty(OutputKeys.INDENT, "yes");
t.transform(new DOMSource(htmlDoc), new StreamResult(System.out));

2）是的。当您拥有已解析的 htmlDoc（上面）时，您可以遍历它或应用 XPATH 或其他方式来提取所需的文本片段。只要记住这一点将不转义为 '\u00A0'。因此，如果想要真正的纯文本，你也许应该这样做

String s = text.replace('\u00A0', ' ');

Coalescing means that the parser will convert CDATA nodes to Text nodes. When the document is serialized to XML, of course the text content (HTML) must be escaped. If you want to do something with the HTML you must first extract it as text--then you can render it in a browser, or whatever.

UPDATE:

1) You can use JTidy, http://jtidy.sourceforge.net/index.html, to parse the HTML content and produce XML or XHTML. Something like this:

DocumentBuilder db = factory.newDocumentBuilder();
Document doc = db.parse(..)); // parse your input document

// Obtain the HTML content, may be buried deeper down or
// or scattered around in different places
String text = doc.getDocumentElement().getTextContent();

// Parse with JTidy to convert from HTML to XHTML
Tidy tidy = new Tidy();
tidy.setXHTML(true);

Document htmlDoc = tidy.parseDOM(new StringReader(text), null);
Transformer t = TransformerFactory.newInstance().newTransformer();
t.setOutputProperty(OutputKeys.INDENT, "yes");
t.transform(new DOMSource(htmlDoc), new StreamResult(System.out));

2) Yes. When you have the parsed htmlDoc (above) you can travserse it or apply XPATH or whatever to extract the wanted text pieces. Just remember that will be unescaped to '\u00A0'. So if want really plain text, you should perhaps do

String s = text.replace('\u00A0', ' ');

回复收藏 0 原文

简单气质女生网名 2024-11-21 09:45:27

如果您只是对格式错误的 XML 感到困扰，您可以考虑使用 tidy 工具，它可以帮助您解决问题。 HTML 转换为格式良好的 XML。

一般来说，您需要一个 XML 解析器，它允许您访问 CDATA 标记部分的原始内容，然后将该原始数据用于您想要的任何用途。

回复收藏 0 原文

停顿的约定 2024-11-21 09:45:27

@Billu：你可以看看apache开放库：- org.apache.commons.lang.StringEscapeUtils。这个类有 escapeXML()/escapeHTML() 和 unescapeXML()/escapeHTML() 方法。
例如，关于转换 < 的第一个问题和>您可以使用unescapeHTML（您的数据）。

你甚至可能不需要在CDATA部分存储/传递数据，你可以在发送/存储端使用escapeXML(data)；和用户 unescapeXML(data) 在接收/检索端。

欲了解更多信息，请点击以下链接：-
StringEscapeUtils

如果需要，请告诉我信息对你有帮助。

回复收藏 0 原文

~没有更多了~