SAXReader 不重新转义字符

发布于 2024-08-21 16:41:58 字数 439 浏览 5 评论 0 原文

我正在使用 dom4j 读取 XML 文件。该文件如下所示:

...
<Field>&#13;&#10; hello, world...</Field>
...

我使用 SAXReader 将文件读取到 Document 中。当我在节点上使用 getText() 时,我获得以下字符串:

\r\n hello, world...

我进行一些处理,然后使用 asXml() 编写另一个文件。但这些字符并未像原始文件中那样进行转义,这会导致使用该文件的外部系统出现错误。

写入文件时如何转义特殊字符并具有 &#13;&#10;

I'm reading a XML file with dom4j. The file looks like this:

...
<Field>
 hello, world...</Field>
...

I read the file with SAXReader into a Document. When I use getText() on a the node I obtain the followin String:

\r\n hello, world...

I do some processing and then write another file using asXml(). But the characters are not escaped as in the original file which results in error in the external system which uses the file.

How can I escape the special character and have when writing the file?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

沐歌 2024-08-28 16:41:58

你不能轻易。这些不是“逃避”,而是“角色实体”。它们是 XML 的基本组成部分。 Xerces 对“未解析实体”有一些非常复杂的支持,但我怀疑它是否适用于这些实体,而不是 DTD 中定义的种类。

You cannot easily. Those aren't 'escapes', they are 'character entities'. They are a fundamental part of XML. Xerces has some very complex support for 'unparsed entities', but I doubt that it applies to these, as opposed to the species that are defined in a DTD.

吻风 2024-08-28 16:41:58

这取决于您得到什么以及您想要什么(请参阅我之前的评论)。SAX

阅读器没有做错任何事情 - 您的 XML 为您提供了一个文字换行符。如果您控制此 XML,那么您将需要插入一个 \(反斜杠)字符,后跟“r”或“n”字符(或两者),而不是换行符。

如果您不控制此 XML,那么您在取回字符串后,需要将换行符字面转换为“\r\n”。在 C# 中,它会是这样的:

myString = myString.Replace("\r\n", "\\r\\n");

It depends on what you're getting and what you want (see my previous comment.)

The SAX reader is doing nothing wrong - your XML is giving you a literal newline character. If you control this XML, then instead of the newline characters, you will need to insert a \ (backslash) character following by the "r" or "n" characters (or both.)

If you do not control this XML, then you will need to do a literal conversion of the newline character to "\r\n" after you've gotten your string back. In C# it would be something like:

myString = myString.Replace("\r\n", "\\r\\n");
葬心 2024-08-28 16:41:58

XML 实体在 DOM 中被抽象出来。内容通过 String 公开,无需担心编码——这在大多数情况下都是您想要的。

但是 SAX 对实体的处理方式有一些支持。您可以尝试使用自定义 EntityResolver#resolveEntity 创建一个 XMLReader,并将其作为参数传递给 SAXReader。但我觉得它可能行不通:

解析器将调用此方法
在打开任何外部实体之前
除了顶级文档实体
(包括外部 DTD 子集,
内部引用的外部实体
DTD 和外部实体
文档中引用的
元素)

否则,您可以尝试为 SAX 配置一个 LexicalHandler ,以便在遇到实体时收到通知。 LexicalHandler#startEntity 的 Javadoc 说:

报告一些内部的开始
和外部 XML 实体。

您将无法更改分辨率,但这可能仍然有帮助。

编辑

您必须使用 dom4j 提供的 SAXReaderXMLWriter 读取和写入 XML。请参阅读取 XML 文件编写XML 文件。不要使用 asXml() 并自行转储文件。

FileOutputStream fos = new FileOutputStream("simple.xml");
OutputFormat format = OutputFormat.createPrettyPrint();
XMLWriter writer = new XMLWriter(fos, format);
writer.write(doc);
writer.flush();

XML entities are abstracted away in DOM. Content is exposed with String without the need to bother about the encoding -- which in most of the case is what you want.

But SAX has some support for how entities are processed. You could try to create a XMLReader with a custom EntityResolver#resolveEntity, and pass it as parameter to the SAXReader. But I feat it may not work:

The Parser will call this method
before opening any external entity
except the top-level document entity
(including the external DTD subset,
external entities referenced within
the DTD, and external entities
referenced within the document
element)

Otherwise you could try to configure a LexicalHandler for SAX in a way to be notified when an entity is encountered. Javadoc for LexicalHandler#startEntity says:

Report the beginning of some internal
and external XML entities.

You will not be able to change the resolving, but that may still help.

EDIT

You must read and write XML with the SAXReader and XMLWriter provided by dom4j. See reading a XML file and writing an XML file. Don't use asXml() and dump the file yourself.

FileOutputStream fos = new FileOutputStream("simple.xml");
OutputFormat format = OutputFormat.createPrettyPrint();
XMLWriter writer = new XMLWriter(fos, format);
writer.write(doc);
writer.flush();
糖粟与秋泊 2024-08-28 16:41:58

您可以预处理输入流以将 & 替换为 [$AMPERSAND_CHARACTER$],然后使用 dom4j 执行这些操作,并对输出流进行后处理,从而使回替。

示例(使用 streamflyer):

import com.github.rwitzel.streamflyer.util.ModifyingReaderFactory;
import com.github.rwitzel.streamflyer.util.ModifyingWriterFactory;

// Pre-process
Reader originalReader = new InputStreamReader(myInputStream, "utf-8");
Reader modifyingReader = new ModifyingReaderFactory().createRegexModifyingReader(originalReader, "&", "[\\$AMPERSAND_CHARACTER\\$]");

// Read and modify XML via dom4j
SAXReader xmlReader = new SAXReader();
Document xmlDocument = xmlReader.read(modifyingReader);
// ...

// Post-process
Writer originalWriter = new OutputStreamWriter(myOutputStream, "utf-8");
Writer modifyingWriter = new ModifyingWriterFactory().createRegexModifyingWriter(originalWriter, "\\[\\$AMPERSAND_CHARACTER\\$\\]", "&");

// Write to output stream
OutputFormat xmlOutputFormat = OutputFormat.createPrettyPrint();
XMLWriter xmlWriter = new XMLWriter(modifyingWriter, xmlOutputFormat);
xmlWriter.write(xmlDocument);
xmlWriter.close();

您还可以使用 FilterInputStream/FilterOutputStream, PipedInputStream/PipedOutputStream,或 ProxyInputStream/ProxyOutputStream 用于预处理和后处理。

You can pre-process the input stream to replace & to e.g. [$AMPERSAND_CHARACTER$], then do the stuff with dom4j, and post-process the output stream making the back substitution.

Example (using streamflyer):

import com.github.rwitzel.streamflyer.util.ModifyingReaderFactory;
import com.github.rwitzel.streamflyer.util.ModifyingWriterFactory;

// Pre-process
Reader originalReader = new InputStreamReader(myInputStream, "utf-8");
Reader modifyingReader = new ModifyingReaderFactory().createRegexModifyingReader(originalReader, "&", "[\\$AMPERSAND_CHARACTER\\$]");

// Read and modify XML via dom4j
SAXReader xmlReader = new SAXReader();
Document xmlDocument = xmlReader.read(modifyingReader);
// ...

// Post-process
Writer originalWriter = new OutputStreamWriter(myOutputStream, "utf-8");
Writer modifyingWriter = new ModifyingWriterFactory().createRegexModifyingWriter(originalWriter, "\\[\\$AMPERSAND_CHARACTER\\$\\]", "&");

// Write to output stream
OutputFormat xmlOutputFormat = OutputFormat.createPrettyPrint();
XMLWriter xmlWriter = new XMLWriter(modifyingWriter, xmlOutputFormat);
xmlWriter.write(xmlDocument);
xmlWriter.close();

You can also use FilterInputStream/FilterOutputStream, PipedInputStream/PipedOutputStream, or ProxyInputStream/ProxyOutputStream for pre- and post-processing.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文