代码页和编码
在有人建议我对此进行谷歌搜索之前,我已经这样做了。 我只需要更清楚地了解代码页和编码。
如果我使用 UTF8 编码,并使用意大利代码页,然后使用法国代码页,这是否意味着即使字节没有改变,我也会得到不同的字符?
Before anyone recommends that I do a google search on this, I have. I just need a bit more clarity around what codepages and encodings.
If I use UTF8 encoding, and use an italian code page and then a french code page, does this mean ill get different characters even though the bytes havent changed?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
乔尔对此有一个很好的总结:
http://www.joelonsoftware.com/articles/Unicode.html
不。 如果我正确理解你的问题,那并不意味着。
当您将 UTF-8 转换为特定代码页时,可能只有部分字符会被转换。 那些没有被转换的会发生什么取决于你如何调用转换。 可能的结果是无法映射到代码页的字符将被转换为问号字符。
Joel has a nice summary of this:
http://www.joelonsoftware.com/articles/Unicode.html
And no. if I understand your question correctly it doesn't mean that.
When you're converting UTF-8 to a specific code page, it is possible that only some of the characters are going to be converted. What happens to the ones that don't get converted depends on how you call the conversion. A possible result is that the characters which could not be mapped to the code page would be converted to question mark characters.
编码只是数值和“字符”之间的映射。
US-ASCII 将数字 65 映射到字母 A,将 32 映射到空格,将 49 映射到数字“1”。 (这些东西如何呈现是另一回事。)事实上,UTF-8 也是这样做的! 但 UTF-8 对其他值的处理方式与 ASCII 不同。 它是一种变长编码,即一个字符可以用1、2、3或4个字节进行编码; 普通字符通常消耗较少的字节。
纯文本文件(包括网页)作为字节序列存储和传输。 这些字节应该代表一些文本。 软件应用程序(如文本编辑器和网络浏览器)负责在屏幕上呈现这些文件中的信息。 通常他们使用库或操作系统函数。
如果软件采用与创建文件的软件不同的编码,则可能会显示错误的字符!
请注意,不同编码之间可以进行转换; 但是,如果您转换为不包含特定字符的编码,则软件必须选择使用什么字符。 这种转换通常是透明地发生的(当您使用某种编码保存文件时,您输入的任何内容都必须更改为该编码)。
An encoding is simply a mapping between numerical values and "characters".
US-ASCII maps the number 65 to the letter A, 32 to a space and 49 to the digit "1". (How these things are rendered is another matter.) In fact, UTF-8 does the same! But there are other values which UTF-8 treats differently to ASCII. It is a variable-length encoding, i.e. a character may be encoded with 1, 2, 3, or 4 bytes; common characters generally consume less bytes.
Plain text files, including web pages, are stored and transmitted as sequences of bytes. These bytes are supposed to represent something textual. Software applications (like text editors and web browsers) are responsible for rending the information within these files on the screen. Usually they make use of library or OS functions.
If the software assumes a different encoding to the software that created the file, the wrong characters may be displayed!
Note that it is possible to convert between different encodings; however if you convert to an encoding that does not contain a certain character, the software must make a choice as to what to use instead. This conversion often happens transparently (when you save a file with a certain encoding, whatever you've typed must be changed into that encoding).
UTF-8 包括法语和意大利语代码页中的所有字符,但特定于语言的代码页不包括所有其他字符。
因此,您可以获取每种语言的输入并将其转换为 UTF-8 进行存储,但如果您获取意大利语输入并将其显示为法语,则不能确定是否会获得正确的字符。
如果可以的话,尽量使用 UTF-8。
UTF-8 includes all characters from your French and Italian code page, but the language specific code pages does not include all of each others characters.
So you can take input from each language and convert it to UTF-8 for storage, but you can not be certain that you will get the right characters if you take Italian input and show it as French.
Use UTF-8 all the way if you can.