使用字符 8221 进行 XSLT 转换
我正在使用 javax.xml.transform.Transformer 和 XSLT 转换 XML 文档。该文档包含字符“和”(Java 整数代码 8220 和 8221)。这些不是正常的引号。
当我转换文档时,这些字符被转换为 “
和 ”
现在,我的困难是如何将这些字符转换回人们可以使用的内容可以阅读吗?我尝试使用 utf-8、utf-16、ascii 等编码,使用 DOMReader 和 SAXReader 读取文档。但没有成功。
非常感谢您的帮助。 最大限度。
I'm transforming an XML document using javax.xml.transform.Transformer
and XSLT. The document contains the characters “ and ” (Java Integer Code 8220 and 8221). These are not the normal quotation marks.
When I transform the document, these characters are transformed into
and
Now, my struggle is how to convert these back into something that people can read? I tried reading the document with DOMReader
and SAXReader
using encodings utf-8,utf-16, ascii, etc. No luck.
Your help is very much appreciated.
Max.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
这些是 utf-8 字符 201c 和 201d。您正在转换为 HTML 吗?如果是这样,并且您的 xslt 指定了 HTML 输出,我希望它输出
&ldquo
和&rldquo
,因为它们是字符实体引用: http://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references引用 XSLT 规范:
http://www.w3.org/TR/ xslt#section-HTML-输出方法
These are utf-8 characters 201c and 201d. Are you transforming to HTML? If so and your xslt specifies HTML output I'd expect it to output
&ldquo
and&rldquo
, as they're character entity references: http://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_referencesQuote from the XSLT spec:
http://www.w3.org/TR/xslt#section-HTML-Output-Method
输入:
使用此样式表(仅身份规则):
输出:
仅具有
html
序列化方法的 Xalan,输出:因此,如果您想要正确的渲染,则需要输出正确的 HTML 文档...
此 样式表:
输出:
注意:正确的字符集编码声明。
This input:
With this stylesheet (just identity rule):
Output:
Only Xalan with
html
serialization method, output:So, if you want a proper renderization you need to output a proper HTML document...
This stylesheet:
Output:
Note: Proper charset encoding declaration.
您需要了解,XSL 转换不是应用于 XML 文档本身,而是应用于该文档的树表示。文本节点包含特定编码的值,无论它们在输入文档中如何表示 - 构建树后它们是相同的。在转换过程中,您只需创建另一棵树,然后将其序列化。
您提到的某些字符需要特殊处理,具体取决于您选择的目标格式。在序列化为 XML 文档的情况下,它们会被“转义”,而在序列化为 HTML 的情况下,它们不会被“转义”。这就是为什么第一个答案为您提供了解决方法。
然而,这两种方法在转义方面的区别仅在于“disable-output-escaping”属性(XSLT 1.0)的默认值。如果是 XML 输出,则设置为“no”;如果是 HTML,则设置为“yes”。
因此,为了在不更改整个序列化方法的情况下解决您的问题,您可以在复制某些可能包含“特殊”字符的值时编写如下内容:
PS 在 XSLT 2.0 中执行此类操作的首选方法是通过使用字符映射指令。
You need to understand that XSL transformation is applied not to the XML document per se but rather to tree representation of this document(s). Text nodes contain values in particular encoding regardless of how they were represented in input document - after tree is built they are same. During transformation you just create another tree and then it's serialized.
Some of characters like ones that you mentioned require special treatment depending on what destination format you choose. In case of serialization to XML document they are "escaped" and in case of serialization to HTML they are not. This is why first answer gives you a workaround.
However difference between these two methods in regard of escaping is just in the default value for "disable-output-escaping" attribute (XSLT 1.0). In case of XML output it's set to "no" and in case of HTML it's set to "yes".
So in order to fix your issue without changing the whole serialization method you could write something like this when you're copying some value which might contain "special" characters:
P.S. In XSLT 2.0 preferred way to do this kind of things is by using character-map instruction.