XSLT 属性中将 ISO-8859-1 字符视为 UTF-8

发布于 2024-12-28 06:28:12 字数 1250 浏览 2 评论 0原文

如果我确保 ISO-8859-1 始终用作编码，则 Ø 字符（ISO-8859-1 中的 0xAC）适用于普通文本。但是，当在属性中使用它时，它会转义为：%C2%AC。我知道它需要对 url 进行转义，但不明白为什么它以与 UTF-8 相同的方式转义它，而不是像我期望的 ISO 那样仅使用 %AC -8859-1。

由于转义位于输出 html 文件中，因此唯一的结论是 xslt 处理器是原因。

示例：

对我来说生成：

output.html

输出是使用 xsltproc 生成的，针对 libxml 20707 编译， libxslt 10126 和 libexslt 815。这是在 #! Linux（amd64）。我还尝试过：xmlstarlet tr（也使用 libxml）、xalan 和 google chrome（通过添加 < /code>，请参阅 input_ss.xml 标记）具有相同的结果。

Opera 根本没有转义它，并且它允许 ‐ 在 url 和属性中按字面意思使用。

这是 xslt 的标准行为还是属性转义方式中的错误？无论哪种方式，除了用 %AC 替换 %C2%AC 之外，还有其他解决方案吗？记住，对于有效 ISO-8859 的其他字符来说几乎肯定是相同的-1，在 UTF-8 中无效。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

简单气质女生网名 2025-01-04 06:28:12

这里使用了 3 种不同的基于文本的技术：XML、HTML 和 URI。

所有这些都有转义机制——也就是说，使用文本来指示在给定上下文中不可能或难以指示的其他文本的方法。

无符号字符 - (U+00AC) 可以在前两个字符中转义为 ¬；或 ¬ 可能带有一些前导零，在 XML 和 HTML 中（¬ 也适用于 HTML）。无论 XML 或 HTML 采用什么编码，都会使用这种转义，因为它与字符 - 相关，而不是与给定字符编码中的八位字节集相关 - 事实上，我们通常只使用如果正在使用的编码中没有这样的八位字节组，则使用它。

在这种情况下，这是不必要的，因为输出采用字符编码，不需要对其进行转义，因此在源代码中您可以看到未转义的 Ø 字符。

该 HTML 包含 URI 文本。 HTML 的编码与此无关，因为编码是我们将 HTML 文本从一台机器传输到另一台机器的方式，但是当解析 HTML 来读取此 URI 时，我们就超出了这一点并正在处理一些文本处于文本级别 - 也就是说，它不再具有编码。

现在，URI 有自己的转义机制。这必须在 - 的情况下使用，因为它不是 URI 中允许的字符（与 IRI 相对）。遗憾的是，与 XML 和 HTML 中的转义不同，这些转义基于给定编码中的八位位组，而不是字符本身的代码点。

现在很容易认为这是一个错误，但 URI 是在 1994 年指定的，正式化工作可以追溯到 1989/1990 年，而 Unicode 1.0 于 1991 年发布，直到 1996 年才推出突破性的 2.0，所以事后看来，这已经相当重要了。比 URI 的发明者有更多的好处。（HTML 多年前也有同样的问题，但其编码格式使得解决这个问题变得更容易，而且没有那么多向后兼容性问题）。

那么，我们应该对这些八位位组使用什么编码？原始规范未对此进行定义，但实际上唯一可能的选择是 UTF-8。它是唯一一种编码，可以为 URI 特殊字符提供 0x20 - 0x7F 范围内的转义符，同时覆盖所有 UCS。

也没有办法表明其他选择可能更合适。请记住，我们正在文本级别工作，因此您对 ISO-8859-1 的使用完全无关。即使我们在解析 HTML 时跟踪编码，URI 也会以与文档无关的方式使用，所以我们仍然无法使用它。总而言之，如果我们必须使用基于八位字节的编码，并且必须将 ASCII 范围内的字符与它们在 ASCII 中的八位字节相匹配，则唯一可能的编码基础是 UTF-8。

因此，- 的任何 URI 中的转义必须始终为 %C2%AC。

可能有一些遗留系统期望 URI 使用其他编码，但解决方案是修复损坏的位，而不是修复有效的位，因此如果某些东西期望 Ø 为 %AC 然后通过将 %C2%AC 转换为接近它的用途来捕捉它（如果它输出 %AC 本身，那么当然你需要将其修复为 %C2%AC在它到达外界之前）。

There are 3 different text-based technologies in use here, XML, HTML and URIs.

All of these have escape mechanisms - that is to say, ways to use text to indicate other text that it is impossible or difficult to indicate in a given context.

The not-sign character ¬ (U+00AC) could be escaped in the first two as ¬; or ¬ perhaps with some leading zeros, in both XML and HTML (¬ would also work in HTML). This escape would be used no matter what encoding the XML or HTML was in, because it relates to the character ¬, not to its set of octets in a given character encoding - indeed, we would generally only use it in the case where there was no such set of octets in the encoding being used.

In this case, this is unnecessary, since the output is in a character encoding in which there is no need to escape it, and so in the source you can see The ¬ character unescaped.

This HTML includes the text of a URI. The encoding of the HTML has nothing to do with this, because the encoding is how we get the text of the HTML from one machine to another, but when the HTML is being parsed to read this URI we're past that point and are dealing with some text at the level of text - that is to say, it doesn't have an encoding any more.

Now, URIs have their own escape mechanisms. This must be used in the case of ¬, as it is not a character allowed in URIs (as opposed to IRIs). Sadly, unlike the escapes in XML and HTML, these escapes are based on octets in a given encoding rather than the code-point of the character itself.

It's easy to see this as a mistake now, but URIs were specified in 1994 and that formalised work going back to 1989/1990 while Unicode 1.0 was released in 1991 and didn't have the ground-breaking 2.0 until 1996, so hindsight has considerably more benefits than URI's inventors. (HTML had the same problem many years ago, but the format of its encodings made it much easier to fix this without as many backwards-compatibility issues).

So, what encoding should we use for those octets? The original specs left this undefined, but really the only possible choice is UTF-8. It's the only encoding that gives those escapes commonly used for chracters special to URIs their escapes in the range 0x20 - 0x7F while also covering all of the UCS.

There's also no way to indicate another choice could be more appropriate. Remember, we're working at the level of text, so your use of ISO-8859-1 is completely irrelevant. Even if we kept track of the encoding while parsing the HTML, the URI is going to be made use of in a way that is nothing to do with the document, so we still couldn't use it. In all, if we have to make use of an octet-based encoding, and we have to keep characters in the ASCII range matching the octets they'd have in ASCII, the only possible basis for the encoding is UTF-8.

For that reason, the escape in any URI for ¬ must always be %C2%AC.

There can be some legacy systems that expect URIs to use other encodings, but the solution is to fix the bit that's broken, not the bit that works, so if something expects ¬ to be %AC then catch it close to that by converting %C2%AC close to its use (and if it outputs %AC itself then of course you'll need to fix it to %C2%AC before it hits the outside world).

回复收藏 0 原文