为什么我不能使用 Ñ在我的 XML 输出中,当声明为 UTF-8 时?

发布于 2024-10-19 09:00:39 字数 259 浏览 10 评论 0原文

我的 Z/OS DB2 数据库中有“N 波形符”字符 Ñ。我正在从数据生成 xml 文件。在 XML 中,我有 encoding=UTF-8,但是 Internet Explorer 给出错误文本字段中的非法字符。如果我将编码更改为 ISO-8859-1,它就可以正常工作。

我认为 ISO-8859-1 是 UTF-8 的子集,那么为什么它不能与 UTF-8 一起使用呢?

UTF-8 是 XML 文档的最佳选择吗?

I have the 'N Tilde' character Ñ in my Z/OS DB2 database. I am generating an xml file from the data. In the XML I have encoding=UTF-8, however Internet Explorer gives me the error Illegal character in text field. If I change the encoding to ISO-8859-1 it works fine.

I thought ISO-8859-1 was a subset of UTF-8, so why is it not working with UTF-8?

Is UTF-8 the best for an XML document?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

凉城已无爱 2024-10-26 09:00:39

ISO-8859-1 不是 UTF-8 的子集。它可以表示 UTF-8 中可表示的字符的子集,但其方式不同。

ISO-8859-1和UTF-8都是ASCII的超集(即它们可以表示ASCII可以表示的所有字符,并且以相同的方式表示)。

因此,您不能只是将 ISO-8859-1 数据标记为 UTF-8 并希望它起作用,您需要实际将数据存储(或转换)为 UTF-8。

ISO-8859-1 is not a subset of UTF-8. It can represent a subset of the characters representable in UTF-8, but it doesn't do so in the same way.

Both ISO-8859-1 and UTF-8 are supersets of ASCII (i.e. they can represent all characters that ASCII can represent and they represent them in the same way).

So you can't just label ISO-8859-1 data as UTF-8 and hope that it works, you need to actually store (or convert) your data as UTF-8.

恍梦境° 2024-10-26 09:00:39

UTF-8 ≠ Unicode

请注意:

  • ASCII 是 ISO 8859-1 的子集。
  • ASCII 是 Unicode 的子集。
  • ASCII 是 UTF-8 的子集。
  • ISO 8859-1 是 Unicode 的子集。
  • ISO 8859-1 不是 UTF-8 的子集。
  • Unicode 与 UTF-8 不同。

我强烈建议您熟悉 现代术语

如果这太令人困惑,你可以看看 Radix-50,它有很多命令其数量级小于 Unicode,但它仍然表现出一些现在人们在 Unicode、字符库、编码字符集、字符编码形式和字符编码方案方面所忽视的相同微妙之处。

Java chars 无法保存字符

既然您是从 Java 中了解到这一点的,那么这些在您的头脑中并不是明显独立的概念,这并不是您的错。这是因为 Java 没有将编码字符集的抽象代码点(逻辑字符)与一种特定字符编码形式<的简陋机制分开,从而严重混淆了这些问题。 /强>。

Java 将 chars 与逻辑字符混合在一起非常容易出错;或许更准确的说法是,Java 程序员将它们混为一谈是悲惨的。无论如何,现在似乎永远没有补救的希望。

如果你必须的话,可以把这一切归咎于歇斯底里的海豚,但你能说的最仁慈的是,这是非常不幸的。正因为如此,像您这样善意且完全有能力的程序员将永远很容易感到困惑,因此将不断编写简单、清晰但错误的 Java 代码。

关于这一切的教育是唯一可能的缓解方法,但它并不是真正的治愈方法。

UTF-8 ≠ Unicode

Note carefully:

  • ASCII is a subset of ISO 8859-1.
  • ASCII is a subset of Unicode.
  • ASCII is a subset of UTF-8.
  • ISO 8859-1 is a subset of Unicode.
  • ISO 8859-1 is not a subset of UTF-8.
  • Unicode is not the same thing as UTF-8.

I strongly advise familiarizing oneself with the subtleties in modern terminology.

If that’s too confusing, you might look at Radix-50, which has a repertoire many order of magnitude smaller than Unicode’s, but which nevertheless manifests several of the same subtleties that now escape people with respect to Unicode, character repertoires, coded character sets, character encoding forms, and character encoding schemes.

Java chars Incapable of Holding Characters

Since you’re coming at this from Java, it really isn’t your fault that these aren’t clearly separate concepts in your mind. That’s because Java gravely confuses these issue by not separating out the abstact code points (the logical characters) of a coded character set from the down-and-dirty mechanics of one particular character encoding form.

Java’s miserable conflation of chars with logical characters is error-prone in the extremely; perhaps it would be more accurate to say that Java programmers’ conflation of the same is miserable. In any event, there now seems to be no hope of remedy, ever.

Blame it all on the hysterical porpoises if you must, but the most charitable thing you can say about it is that it is highly unfortunate. Because of all this, well-meaning and perfectly competent programmers like yourself will forever be easily confused, and so will continually write Java code that is simple, clear, and wrong.

Education about all this is the only possible palliative, but it is no true cure.

眼睛会笑 2024-10-26 09:00:39

ISO-8859-1 根本不是 UTF-8 的子集。 ASCII 是 ISO-8859-1 和 UTF-8 的子集。它们对于 Unicode 代码点范围 U+0080 - U+00FF 中的字符特别不同。

在 ISO-8859-1 中,字符“Ñ”(U+00D1 带波浪线的拉丁文大写字母 N)表示为单字节 D1。在UTF-8中,相同的字符由两个字节序列C3 91表示。

ISO-8859-1 is not at all a subset of UTF-8. ASCII is a subset of both ISO-8859-1 and UTF-8. They specifically differ for characters in the Unicode code point range of U+0080 - U+00FF.

In ISO-8859-1, the character 'Ñ' (U+00D1 LATIN CAPITAL LETTER N WITH TILDE) is represented as the single byte D1. In UTF-8, the same character is represented by the two byte sequence C3 91.

云淡月浅 2024-10-26 09:00:39

要在 Java 中生成 XML,最好的办法是使用 XML 库 - 这也可以确保一切都格式良好。

如果您必须手动创建它,最好使用new OutputStreamWriter(stream,encoding),其中编码与您在XML前导码中写入的编码相同。

还要确保从数据库获取的字符串以正确的方式编码。

For generating XML in Java, best thing to do would to use an XML library - this also ensures that everything is well-formed.

If you must create it by hand, best use new OutputStreamWriter(stream, encoding), where encoding is the same encoding as you are writing in your XML preamble.

Also make sure that the Strings you get from your database are encoded the right way.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文