在 XSLT 输出中编码特殊字符
我构建了一组脚本,其中一部分将 XML 文档从一个词汇表转换为另一个词汇表中的文档子集。
由于对我来说不透明但显然不可协商的原因,目标平台(基于 Java)要求输出文档在 XML 声明中具有“encoding =“UTF-8””,但文本节点中的一些特殊字符必须使用其十六进制 unicode 值进行编码 - 例如“”必须替换为“”
”等。我无法获得必须编码的字符的明确列表,但它似乎并不像“所有非 ASCII”那么简单。
目前,我有一个可怕的混乱的 VBScript,使用 ADODB 在处理后直接检查输出文件的每一行,并在必要时替换字符。这是非常缓慢的,并且毫不奇怪,一些角色会被错过(并因此被目标平台摧毁)。
虽然我可能会浪费时间“精炼”VBScript,但长期目标是完全摆脱它,而且我确信必须有一种更快、更准确的方法来实现这一目标,最好是在 XSLT 阶段本身内。
谁能建议任何有效的调查途径?
(编辑:我不相信字符映射是答案 - 我之前已经看过它们,除非我弄错了,因为我的输入可能包含任何 unicode字符,我需要拥有一张包含所有这些内容的地图除了我不想编码的内容......)
I have built a set of scripts, part of which transform XML documents from one vocabulary to a subset of the document in another vocabulary.
For reasons that are opaque to me, but apparently non-negotiable, the target platform (Java-based) requires the output document to have 'encoding="UTF-8"' in the XML declaration, but some special characters within text nodes must be encoded with their hex unicode value - e.g. '”' must be replaced with '”
' and so forth. I have not been able to acquire a definitive list of which chars must be encoded, but it does not appear to be as simple as "all non-ASCII".
Currently, I have a horrid mess of VBScript using ADODB to directly check each line of the output file after processing, and replace characters where necessary. This is painfully slow, and unsurprisingly some characters get missed (and are consequently nuked by the target platform).
While I could waste time "refining" the VBScript, the long-term aim is to get rid of that entirely, and I'm sure there must be a faster and more accurate way of achieving this, ideally within the XSLT stage itself.
Can anyone suggest any fruitful avenues of investigation?
(edit: I'm not convinced that character maps are the answer - I've looked at them before, and unless I'm mistaken, since my input could conceivably contain any unicode character, I would need to have a map containing all of them except the ones I don't want encoded...)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
告诉串行器它必须生成 ASCII 兼容的输出。这将迫使它为文本内容和属性值中的所有非 ASCII 字符生成字符引用。 (如果标签或属性名称等其他地方有非 ASCII,序列化将会失败。)
Tells the serialiser that it has to produce ASCII-compatible output. That should force it to produce character references for all non-ASCII characters in text content and attribute values. (Should there be non-ASCII in other places like tag or attribute names, serialisation will fail.)
使用 XSLT 2.0,您可以使用字符映射表标记您的帖子,请参阅 http: //www.w3.org/TR/xslt20/#character-maps。
Well with XSLT 2.0 you have tagged your post with you can use a character map, see http://www.w3.org/TR/xslt20/#character-maps.