XML 中的编码是什么?
XML 中的编码是什么?正常使用的编码是utf-8。它与其他编码有何不同?使用它的目的是什么?
What is encoding in XML? The normal encoding used is utf-8. How is it different from other encoding? What is the purpose of using it?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
字符编码指定字符如何映射到字节。由于 XML 文档作为字节流存储和传输,因此有必要表示组成 XML 文档的 unicode 字符。
UTF-8 被选为默认值,因为它有几个优点
字符编码是一种更通用的主题不仅仅是 XML。 UTF-8不仅限于在 XML 中使用。
每个程序员绝对需要了解处理文本的编码和字符集 是一篇很好的文章,给出了对这个主题的一个很好的概述。
A character encoding specifies how characters are mapped onto bytes. Since XML documents are stored and transferred as byte streams, this is necessary to represent the unicode characters that make up an XML document.
UTF-8 is chosen as the default, because it has several advantages:
Character encodings are a more general topic than just XML. UTF-8 is not restricted to being used in XML only.
What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text is a good article that gives a good overview over the topic.
当计算机最初被创建时,它们大多只使用英语中的字符,从而产生了 7 位 US-ASCII 标准。
然而,世界上有许多不同的书面语言,必须找到在计算机中使用它们的方法。
如果您限制自己使用某种语言,第一种方法可以很好地工作,它是使用特定于文化的编码,例如 ISO-8859-1,它能够以 8 位表示拉丁欧洲语言字符,或者使用 GB2312 表示中文字符。
第二种方式有点复杂,但理论上可以表示世界上的每个字符,它是 Unicode 标准,其中每种语言的每个字符都有一个特定的代码。
然而,考虑到现有字符数量较多(Unicode 5 中为 109,000 个),unicode 字符通常使用三字节表示(一个字节用于 Unicode 平面,两个字节用于字符代码)。
为了最大限度地与现有代码兼容(有些仍在使用 ASCII 中的文本),UTF-8 标准编码被设计为存储 Unicode 字符的一种方式,仅使用最少的空间,如 Joachim Sauer 的答案中所述
。如果文件仅由仅理解这些语言的软件(和人员)编辑或读取,则通常会看到使用特定字符集(例如 ISO-8859-1)编码的文件;如果需要高度互操作性,则通常会看到使用 UTF-8 编码的文件独立于文化。
目前的趋势是 UTF-8 取代其他字符集,尽管这需要软件开发人员的工作,因为 UTF-8 字符串比固定宽度字符集字符串处理起来更复杂。
When computers were first created, they mostly only worked with characters found in the english language, leading to the 7-bit US-ASCII standard.
However, there are a lot of different written languages in the world, and ways had to be found to be able to use them in computers.
The first way works fine if you restrict yourself to a certain language, it's to use a culture specific encoding, such as ISO-8859-1, which is able to represent latin-european language characters on 8-bits, or GB2312 for chinese characters.
The second way is a bit more complicated, but allows theoretically to represent every character in the world, it's the Unicode standard, in which every character from every language has a specific code.
However, given the high number of existing characters (109,000 in Unicode 5), unicode characters are normally represented using a three byte representation (one byte for the Unicode plane, and two bytes for the character code.
In order to maximize compatibility with existing code (some is still using text in ASCII), the UTF-8 standard encoding was devised as a way to store Unicode characters, only using the minimal amount of space, as described in Joachim Sauer's answer.
So, it's common to see files encoded with specific charsets such as ISO-8859-1 if the file is meant to be edited or read only by software (and people) understanding only these languages, and UTF-8 when there's the need to be highly interoperable and culture-independant.
The current tendancy is for UTF-8 to replace other charsets, even though it needs work from software developers, since UTF-8 strings are more complicated to handle than fixed-width charset strings.
XML 文档可以包含非 ASCII 字符,例如挪威语 æ ø å 或法语 ê è é。因此,为避免错误,请设置编码或将 XML 文件保存为 Unicode。
XML 编码规则
XML documents can contain non ASCII characters, like Norwegian æ ø å , or French ê è é. So, to avoid errors you set the encoding or save the XML file as Unicode.
XML Encoding Rules
当数据被存储或传输时,它只是字节。这些字节需要一些解释。使用非英语语言环境的用户曾经在使用仅出现在其语言环境中的字符时遇到一些问题。这些字符经常以错误的方式显示。
XML 具有如何解释其字节的信息,可以以正确的方式显示字符。
When data is stored or transfered it is only bytes. Those bytes need some interpretation. Users with non English locales used to have some problems with characters that only appeared in their locale. Those characters were displayed in a wrong way frequently.
With XML having an information how to interpret its bytes character can be displayed in a correct way.