对于 PHP 开发人员来说,Unicode 和 UTF-8 哪个更好?
对于 PHP 开发人员来说,Unicode 和 UTF-8 哪个更好?
我将创建一个国际 CMS。所以我的客户将遍布世界各地。他们会说所有可能的语言。
什么编码格式更适合浏览器识别和DB数据存储?
What is better for PHP developers - Unicode or UTF-8?
I am going to create an international CMS. So I am going to have clients all over the world. They will speak all possible languages.
What encoding format is better for browser recognition and for DB data storage?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
“Unicode”不是一种编码。您可能指的是 UTF-8 与 UTF-16(大端或小端)。对于浏览器支持来说确实没有多大关系。任何现代浏览器都支持这三种。您可能会发现 UTF-8 对于您的数据库来说是最节省空间的。
"Unicode" is not an encoding. You may mean UTF-8 versus UTF-16 (big-endian or little-endian). It really doesn't matter much for browser support. Any modern browser will support all three. You will probably find UTF-8 is the most space-efficient for your database.
UTF-8 是 Unicode 的一种编码,是将 Unicode 字符的(抽象)序列表示为(具体)字节序列的一种方式。还有其他编码,例如 UTF-16(同时具有大端和小端变体)。 UTF-8 和 UTF-16 都可以表示 Unicode 中的任何字符,因此无论您选择哪一种,都可以支持所有语言。
如果大部分文本都是西方语言,则 UTF-8 很有用,因为它仅用一个字节表示 ASCII 字符,但对于“外来”字母表中的许多字符(例如中文),每个字符需要三个字节。另一方面,UTF-16 对您可能遇到的所有字符恰好使用两个字节(尽管一些非常深奥的字符,即 Unicode“基本多语言平面”之外的字符,需要四个字节)。
不过,我不建议使用 PHP 来开发国际化软件,因为它并不能真正正确地支持 Unicode。它有一些用于处理 Unicode 编码的附加函数(查看 多字节字符串 函数),但 PHP 核心将字符串视为字节,而不是字符,因此标准 PHP 字符串函数不适合处理编码为多个字节的字符。例如,如果您对包含字符“大”的 UTF-8 表示形式的字符串调用 PHP 的
strlen()
,它将返回 3,因为该字符在 UTF-8 中占用三个字节,即使它只是一个角色。使用像substr()
这样的字符串分割函数是不稳定的,因为如果你在多字节字符的中间分割,就会破坏字符串。大多数用于 Web 开发的其他语言,例如 Java、C# 和 Python,都内置了对 Unicode 的支持,因此您可以将任意 Unicode 字符放入字符串中,而无需担心使用哪种编码来表示它们内存,因为从您的角度来看,字符串包含字符,而不是字节。这是一种更安全、更不易出错的处理 Unicode 文本的方法。由于这个原因和其他原因(PHP 并不是一种真正伟大的语言),我建议使用其他语言。
(我读到 PHP 6 将有适当的 Unicode 支持,但目前还不可用。)
UTF-8 is an encoding of Unicode, a way of representing an (abstract) sequence of Unicode characters as a (concrete) sequence of bytes. There are other encodings, such as UTF-16 (which has both big-endian and little-endian variants). Both UTF-8 and UTF-16 can represent any character in Unicode, so you can support all languages regardless of which one you choose.
UTF-8 is useful if most of your text is in Western languages since it represents ASCII characters in just one byte, but it needs three bytes each for many characters in "foreign" alphabets such as Chinese. UTF-16, on the other hand, uses exactly two bytes for all characters you're likely to ever encounter (though some very esoteric characters, those outside Unicode's "Basic Multilingual Plane", require four).
I wouldn't recommend using PHP for developing international software, though, because it doesn't really properly support Unicode. It has some add-on functions for working with Unicode encodings (look at the multibyte string functions), but the the PHP core treats strings as bytes, not characters, so the standard PHP string functions are not suitable for working with characters that are encoded as more than one byte. For example, if you call PHP's
strlen()
on a string containing the UTF-8 representation of the character "大", it will return 3, because that character takes up three bytes in UTF-8, even though it's only one character. Using string-splitting functions likesubstr()
is precarious because if you split in the middle of a multi-byte character you corrupt the string.Most other languages used for Web development, such as Java, C#, and Python, have built-in support for Unicode, so that you can put arbitrary Unicode characters into a string and not need to worry about which encoding is used to represent them in memory because from your point of view a string contains characters, not bytes. This is a much safer, less-error-prone way to work with Unicode text. For this and other reasons (PHP isn't really that great a language), I'd recommend using something else.
(I've read that PHP 6 will have proper Unicode support, but that's not available yet.)
UTF-8是一种 Unicode 编码。您的意思可能是要在 UTF-8 和 UTF-16 之间进行选择。
Microsoft 建议
对于数据库存储,请使用 RDBMS 更好支持的编码。或者,在其他条件相同的情况下,根据空间效率进行选择。对于英语和大多数欧洲语言,UTF-8 较小,而对于亚洲语言,UTF-16 往往较小。
UTF-8 is a Unicode encoding. You probably meant that you want to choose between UTF-8 and UTF-16.
Microsoft recommends that
For database storage, use the encoding your RDBMS has better support for. Or, all else being equal, choose based on space efficiency. UTF-8 is smaller for English and most European languages, while UTF-16 tends to be smaller for Asian languages.
Unicode 是一个标准,它定义了一堆抽象字符(所谓的代码点)及其属性(是数字还是大写字母等)。它还定义了某些编码(用字节表示字符的方法),UTF-8 就是其中之一。请参阅每个软件开发人员绝对必须了解 Unicode 和字符集的绝对最低限度(没有任何借口!) 由 Spolsky 了解更多详细信息。
我当然会选择 UTF-8,它是当今所有地方的标准,并且具有一些很好的属性,例如保留所有 7 位 ASCII 字符,这意味着大多数与 HTML 相关的函数,例如
htmlspecialchars 可以直接在 UTF-8 表示上使用,因此留下与编码相关的安全漏洞的机会较小。此外,许多 PHP 函数明确需要 UTF-8 字符串,并且 UTF-8 也比 UTF-16 等替代方案具有更好的文本编辑器支持。
Unicode is a standard which defines a bunch of abstract characters (so-called code points) and their properties (is it a digit, is it uppercase etc.). It also defines certain encodings (methods to represent characters with bytes), UTF-8 being one of them. See The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Spolsky for more details.
I would certainly go with UTF-8, it is the standard everywhere these days, and has some nice properties such as leaving all 7-bit ASCII characters in place, which means that most HTML-related functions such as
htmlspecialchars
can be used directly on the UTF-8 representation, so you have less chance of leaving encoding-related security holes. Also, a lot of PHP functions explicitly expect UTF-8 strings, and UTF-8 has better text editor support than alternatives like UTF-16, too.最好使用 UTF-8,因为它引用了世界各地所有语言的口音。此外,UTF-8 还具有扩展规定,可以添加更多未使用或已识别的字符。我更喜欢并始终使用 UTF-8 及其系列。
It is better to use UTF-8, because which refers all language's accents around the world. Also UTF-8 has an extended provisions to add more unused or recognized chars too. I prefer and use always UTF-8 and its series.