如何确定“最低”编码可以吗?
场景
您有大量以 UTF-16 格式存储在数据库或服务器中的 XML 文件,空间不成问题。您需要将访问其他系统所需的大部分文件作为 XML 文件,并且使用尽可能少的空间至关重要。
问题
实际上,存储为 UTF-16 的文件中只有大约 10% 需要存储为 UTF-16,其余的可以安全地存储为 UTF-8 并且没问题。如果我们可以将需要为 UTF-16 的部分设为 UTF-16,而将其余部分设为 UTF-8,我们可以在文件系统上使用大约 40% 的空间。
我们尝试对数据进行大幅压缩,这很有用,但我们发现使用 UTF-8 获得的压缩率与使用 UTF-16 获得的压缩率相同,而且 UTF-8 的压缩速度也更快。因此最终如果尽可能多的数据存储为UTF-8,我们不仅可以在未压缩存储时节省空间,即使在压缩时我们仍然可以节省更多空间,甚至可以通过压缩本身节省时间。
目标
找出 XML 文件中何时存在需要 UTF-16 的 Unicode 字符,以便我们只能在必要时使用 UTF-16。
有关 XML 文件和数据的一些详细信息
虽然我们控制 XML 本身的架构,但从 Unicode 角度来看,我们无法控制值中可以包含什么类型的“字符串”,因为源是免费提供的要使用的 Unicode 数据。然而,这种情况很少见,因此我们不希望每次都使用 UTF-16 来支持只在 10% 的时间需要的东西。
开发环境
我们使用 C# 和 .Net Framework 4.0。
编辑:解决方案
解决方案就是使用UTF-8。
这个问题是基于我对 UTF 的误解,感谢大家帮助我纠正错误。谢谢你!
Scenario
You have lots of XML files stored as UTF-16 in a Database or on a Server where space is not an issue. You need to take a large majority of these files that you need to get to other systems as XML Files and it is critical that you use as little space as you can.
Issue
In reality only about 10% of the files stored as UTF-16 need to be stored as UTF-16, the rest can safely be stored as UTF-8 and be fine. If we can have the ones that need to be UTF-16 be such, and the rest be UTF-8 we can use about 40% less space on the file system.
We have tried to use great compression of the data and this is useful but we find that we get the same ratio of compression with UTF-8 as we get with UTF-16 and UTF-8 compresses faster as well. Therefore in the end if as much of the data is stored as UTF-8 as possible we can not only save space when stored uncompress, we can still save more space even when it is compressed, and we can even save time with the compression itself.
Goal
To figure out when there are Unicode characters in the XML file that require UTF-16 so we can only use UTF-16 when we have to.
Some Details about XML File and Data
While we control the schema for the XML itself, we do not control what type of "strings" can go in the values from a Unicode perspective as the source is free to provide Unicode data to use. However, this is rare so we would like not to have to use UTF-16 everytime just to support something that is only needed 10% of the time.
Development Environment
We are using C# with the .Net Framework 4.0.
EDIT: Solution
The solution is just to use UTF-8.
The question was based on my misunderstanding of UTF and I appreciate everyone helping set me straight. Thank you!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
编辑:我没有意识到您的问题暗示您认为有些 Unicode 字符串无法安全地编码为 UTF-8。事实并非如此。以下答案假设您的真正意思是某些字符串会比 UTF-8 更长(占用更多存储空间)。
我想说,甚至只有不到 10% 的文件需要存储为 UTF-16。即使您的 XML 包含大量中文、日文、韩文或其他 UTF-8 格式比 UTF-16 格式更大的语言,如果该语言中的文本多于 XML 语法,这仍然是一个问题。
因此,我最初的直觉是“使用UTF-8,直到出现问题为止”。这也有助于保持一致性。
如果您有充分的理由相信 XML 的很大一部分将来自东亚,那么您才需要担心它。在这种情况下,我会应用一个简单的启发式方法,例如......遍历 XML 并计算大于 U+0800 的字符数(这些是 UTF-8 中的三个字节),并且仅当它大于小于 U+0080 的字符(这些字符在 UTF-8 中是一个字节),请使用 UTF-16。
Edit: I didn’t realise that your question implies that you think that there are Unicode strings that cannot be safely encoded as UTF-8. This is not the case. The following answer assumes that what you really meant was that some strings will simply be longer (take more storage space) as UTF-8.
I would say even less than 10% of the files need to be stored as UTF-16. Even if your XML contains significant amounts of Chinese, Japanese, Korean, or another language that is larger in UTF-8 than UTF-16, it is still only an issue if there is more text in that language than there is XML syntax.
Therefore, my initial intuition is “use UTF-8 until it’s a problem”. It makes for consistency, too.
If you have serious reason to believe that a large proportion of the XML will be East Asian, only then you need to worry about it. In that case, I would apply a simple heuristic, like... go through the XML and count the number of characters greater than U+0800 (those are three bytes in UTF-8) and only if this is greater than the number of characters less than U+0080 (those are one byte in UTF-8), use UTF-16.
将所有内容编码为 UTF-8。 UTF-8 可以处理 UTF-16 可以处理的任何内容,并且对于 XML 文档来说几乎肯定会更小。 UTF-8 大于 UTF-16 的唯一情况是文件主要由 BMP 之外的字符组成,并且在最好的情况下(ASCII 规范,包括您可以在标准 US 上键入的每个字符) 104 键)UTF-8 文件的大小是 UTF-16 文件的一半。
对于序数 U07FF 或以下的所有符号,UTF-8 要求每个字符 2 个字节或更少,对于扩展 ASCII 代码页中的任何字符,每个字符需要 1 个字节;这意味着对于使用拉丁语、希腊语、西里尔语、希伯来语或阿拉伯字母(包括大多数常见符号)的现代语言的任何文档,UTF-8 的大小至少等于 UTF-16(并且可能小得多)用于代数和 IPA。这被称为基础多语言平面,涵盖亚洲以外 90% 以上的官方国家语言。
作为一般规则,UTF-16 将为您提供一个较小的文件,用于主要使用梵文(印地语)、日语、中文或朝鲜文(韩语)字母表或任何古代或“深奥”字母表(切罗基语或因纽特语)编写的文档?),如果文档大量使用专门的数学、科学、工程或游戏符号,则可能会更小。如果您正在使用的 XML 用于印度、中国和日本的本地化文件,则使用 UTF-16 可能会获得较小的文件大小,但您必须使您的程序足够智能,以知道本地化文件是这样编码的。
Encode everything in UTF-8. UTF-8 can handle anything UTF-16 can, and is almost surely going to be smaller in the case of an XML document. The only case in which UTF-8 would be larger than UTF-16 would be if the file was largely composed of characters beyond the BMP, and in the best case (ASCII-spec, which includes every character you can type on a standard U.S. 104-key) a UTF-8 file would be half the size of a UTF-16.
UTF-8 requires 2 bytes or less per character for all symbols at or below ordinal U07FF, and one byte for any character in the Extended ASCII codepage; that means UTF-8 will be at least equal to UTF-16 in size (and probably far smaller) for any document in a modern-day language using the Latin, Greek, Cyrillic, Hebrew or Arabic alphabets, including most of the common symbols used in algebra and the IPA. That's known as the Base Multilingual Plane, and encompasses more than 90% of all official national languages outside of Asia.
UTF-16, as a general rule, will give you a smaller file for documents written primarily in the Devanagari (Hindi), Japanese, Chinese, or Hangul (Korean) alphabets, or any ancient or "esoteric" alphabet (Cherokee or Inuit anyone?), and MAY be smaller in cases of documents that heavily use specialized mathematical, scientific, engineering or game symbols. If the XML you're working with is for localization files for India, China and Japan, you MAY get a smaller file size with UTF-16, but you will have to make your program smart enough to know the localization file is encoded that way.
您永远不会“需要”使用 UTF-16 而不是 UTF-8,并且选择与“安全”无关。两种编码具有相同的可编码字符库。
You never 'need' to use UTF-16 instead of UTF-8 and the choice is not about 'safety'. Both encodings have the same encodable character repertoire.
不存在必须为 UTF-16 的文档。任何 UTF-16 文档也可以编码为 UTF-8。理论上,UTF-8 的文档可能比 UTF-16 的文档大,但这几乎不可能,也不值得强调。
只需将所有内容编码为 UTF-8 即可,无需担心。
There is no such thing as a document that has to be UTF-16. Any UTF-16 document can also be encoded as UTF-8. It is theoretically possible to have a document which is larger as UTF-8 than as UTF-16, but this is vanishingly unlikely, and not worth stressing over.
Just encode everything as UTF-8 and stop worrying about it.
没有字符需要 UTF-16 而不是 UTF-8。 UTF-8 和 UTF-16(就此而言,UTF-32 以及其他一些不推荐的格式)都可以对整个 UCS 进行编码(这就是 UTF 的含义)。
有些流在 UTF-16 中会比在 UTF-8 中小。然而,实际上,此类流将主要包含语言上非常简洁的亚洲表意文字。然而,XML 需要 0x20-0x7F 范围内的一些具有特定含义的字符,并且经常使用基于字母的脚本作为元素和属性名称。
由于这些表意文字的上述简洁性,XML 标签(包括元素和属性名称以及小于和大于)与人类目标文本的比率将比使用字母和音节的语言高得多。因此,即使在 UTF-16 中的纯文本明显小于 UTF-8 中的相同文本的情况下,当涉及到 XML 时,这种差异会更小,或者 UTF-8 仍然会更小。
原则上使用UTF-8进行传输和存储。
编辑:刚刚注意到你也在压缩。这种情况下,平衡就更不重要了,用UTF-8就可以了。
There are no characters that require UTF-16 rather than UTF-8. Both UTF-8 and UTF-16 (and for that matter, UTF-32 along with some other non-recommended formats) can encode the entire UCS (that's what UTF means).
There are some streams that will be smaller in UTF-16 than in UTF-8. However, in practice such streams will largely contain Asian ideographs which are linguistically very concise. However, XML requires some characters in the 0x20-0x7F range with specific meanings, and are quite often using alphabet-based scripts for the element and attribute names.
Because of the aforementioned concision of these ideographs, the ratio of XML tags (including the element and attribute name along with the less-thans and greater-thans) to human-trageted text will be much higher than in languages that use alphabets and syllabaries. For this reason, even in cases where plain-text in UTF-16 would be appreciably smaller than the same text in UTF-8, when it comes to XML either this difference will be less, or the UTF-8 will still be smaller.
As a rule, use UTF-8 for transmission and storage.
Edit: Just noticed that you're compressing too. In which case, the balance is even less important, just use UTF-8 and be done with it.