什么多字节字符集以 0x7F 开头,长度为 4 个字节?
我正在尝试获取一些遗留代码来正确显示中文字符。 我尝试使用的一种字符编码以 0x7F 开头,长度为 4 个字节(包括 0x7F 字节)。 有谁知道这是什么类型的编码以及我在哪里可以找到它的信息? 谢谢..
更新: 我还必须使用一些日语编码,每个字符都以 0xE3 开头,长度为三个字节。 如果我在 Windows 中选择日语区域设置,它会正确显示在我的计算机上,但是,它不会在我们的应用程序中正确显示。 但是,如果选择了日语以外的任何其他语言环境,我什至无法正确查看文件名。 所以我猜这个编码不是Unicode。 有人知道这是什么吗? 是ANSI吗? 是 Shift JIS 吗?
对于中文,我用 Unicode 和 UTF-8 字符进行了测试,得到了相同的模式; 0x7F 后面跟着三个字节。 Unicode 和 UTF-8 相同吗?
I'm trying to get some legacy code to display Chinese characters properly. One character encoding I'm trying to work with starts with a 0x7F and is 4 bytes long (including the 0x7F byte). Does anyone know what kind of encoding this is and where I can find information for it? Thanks..
UPDATE:
I've also had to work with some Japanese encoding that starts every character with a 0xE3 and is three bytes long. It displays on my computer properly if I choose the Japanese locale in Windows, however, it doesn't display properly in our application. However, if any other locale other than Japanese is selected, I cannot even view the filenames properly. So I'm guessing this encoding is not Unicode. Anyone know what it is? Is it ANSI? Is it Shift JIS?
For the Chinese one, I've tested it with Unicode and UTF-8 characters and I'm getting the same pattern; 0x7F followed by three bytes. Are Unicode and UTF-8 the same?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
其他字节是什么? 你有这种编码的拉丁文本吗?
如果它是“0x7f 0x...0x00 0x00”,那么您正在查看UTF-32LE。 它也可以是两个 UTF-16(LE 或 BE)字符。
大多数东亚编码使用 0x80-0xFF 作为非 ASCII 字符的前导字节; 据我所知,没有一个会使用前导 0x7F 作为除 ASCII 删除之外的任何内容。
预计到达时间:
如果有一种带外方式表明编码是“UTF-32LE”(可能在到达您之前就丢失了),则不需要 BOM。
那肯定是UTF-8。 序列 0xE3 0x... 0x... 将产生 U+3000 和 U+4000 之间的字符,这是平假名/片假名所在的位置。
那么您的应用程序很可能是令人遗憾的一大群不符合 Unicode 的应用程序之一,在“W”后缀的应用程序中仍然使用“A”(*) 版本的 Win32 接口。 是否可以根据字符串的真实编码读取字符串是没有意义的:不符合 Unicode 的应用程序将永远无法在西方语言环境上显示东亚表意文字。
(*:以“ANSI”命名,这是 Windows 的误导性术语,表示“无论系统代码页当前设置为什么”。这就是更改您的区域设置会影响它的原因。)
ETA(2):
好的,破解了。 这不是我以前见过的任何标准化编码,但如果您假设 Unicode 代码点正在被编码,那么破译起来相对容易。
Unicode 转义中编码的字符可以通过将 A、B 和 C 的键字符串中的索引相加来计算:
也就是说,它是一个 Base-64 字符集,但它不是通常的 Base64 标准。 一些实验给出了一个关键字符串:
“.” 和“_”字符是猜测,因为您发布的字符都没有使用它们。 我们需要更多数据才能找出确切的字符串。
因此,例如:
ETA(3):
是的,通过手动提取每个代码点并作为字符连接,应该很容易创建本机 Unicode 字符串。 不太确定您使用的任何平台上都有哪些可用的内容,但任何支持 Unicode 的平台都应该能够简单地从代码点生成字符串(并且希望无需手动重新编码为 UTF-16LE 字节)。
我注意到这三个示例字符的第一个转义字符与它们的 Unicode 代码点在相同的一般范围内,并且数字顺序相同,因此我认为它一定是 Unicode 代码点。 另外两个字符似乎随机变化,因此它很可能是代码点的大端编码,并且可能是 base-64 编码,因为 6 是您可以从可读 ASCII 中获取的尽可能多的位。
标准 Base64 本身以字母开头,这会使以数字开头的内容太靠前,无法位于基本多语言平面中。 所以我开始猜测“0123456789ABCDEFG...”这将是密钥字符串的另一个明显选择。 结果得到的数字接近给定字符的代码点,但有点太低了。 在键字符串的开头插入一个额外的字符(因此数字“0”不会映射到数字 0),其中一个字符正确,而另外两个字符非常接近; 正确的没有小写字母,因此为了仅更改小写字母,我在大小写之间插入了另一个字符。 这得出了正确的数字。
不能保证这实际上是正确的,但是(除了任意选择插入的字符之外)它很可能是正确的。
What are the other bytes? Do you have any Latin text in this encoding?
If it's “0x7f 0x... 0x00 0x00” you are looking at UTF-32LE. It could also be two UTF-16 (either LE or BE) characters.
Most East Asian encodings use 0x80-0xFF as lead bytes for non-ASCII characters; there is none I know of that would use a leading 0x7F as anything other than an ASCII delete.
ETA:
There doesn't need to be a BOM if there is an out-of-band way of signalling that the encoding is ‘UTF-32LE’ (possibly one that is lost before it gets to you).
That's surely UTF-8. Sequence 0xE3 0x... 0x... would result in a character between U+3000 and U+4000, which is where the hiragana/katakana live.
Then chances are your application is is one of the regrettable horde of non-Unicode-compliant apps, still using ‘A’(*) versions of the Win32 interfaces inside of the ‘W’-suffixed ones. Whether you can read in the string according to its real encoding is moot: a non-Unicode-compliant app will never be able to display an East Asian ideograph on a Western locale.
(*: named for “ANSI”, which is Windows's misleading term for “whatever the system codepage is set to at the moment”. That's why changing your locale affected it.)
ETA(2):
OK, cracked it. It's not any standardised encoding I've met before, but it's relatively easy to decipher if you assume the premise that Unicode code points are being encoded.
The character encoded in a Unicode escape can be calculated by taking the index in a key string of A, B and C and adding together:
That is, it's a base-64 character set, but it's not the usual Base64 standard. A little experimentation gives a key string of:
The ‘.’ and ‘_’ characters are guesses, since none of the characters you posted uses them. We'd need more data to find out the exact string.
So, for example:
ETA(3):
Yeah, it should be easy to create a native Unicode string by sucking out each code point manually and joining as a character. Not quite sure what's available on whatever platform you're using, but any Unicode-capable platform should be able to make a string from codepoints simply (and hopefully without having to manually re-encode to UTF-16LE bytes).
I figured it must be Unicode codepoints by noticing that the three example characters had first escape-characters in the same general range, and in the same numerical order as their Unicode codepoints. The other two characters seemed to change randomly, so it was very likely a big-endian encoding of the code point, and probably a base-64 encoding as 6 is as many bits as you can get out of readable ASCII.
Standard Base64 itself starts with letters, which would put something starting with a number too far up to be in the Basic Multilingual Plane. So I started guessing with ‘0123456789ABCDEFG...’ which would be the other obvious choice of key string. That got resulting numbers that were close to the code points for the given characters, but a bit too low. Inserting an extra character at the start of the key string (so digit ‘0’ doesn't map to number 0) got one of the characters right and the other two very close; the one that was right had no lower-case letters, so to change only the lower-case letters I inserted another character between the upper and lower cases. This came up with the right numbers.
It's not guaranteed that this is actually right, but (apart from the arbitrary choice of inserted characters) it's very likely to be it.
您可能想查看维基百科上的中文字符编码页面。 我能看到的唯一始终为 4 个字节的编码是 UTF-32 。
GB 18030 是当前标准中文字符集,但长度可以是 1 到 4 个字节。
You might want to look at chinese character encoding page on Wikipedia. The only encoding in there that I can see that is always 4 bytes is UTF-32.
GB 18030 is the current standard Chinese character set, but it can be 1 to 4 bytes long.
尝试 chardet。 它可以很好地猜测字节串的字符编码。
不会。UTF-8 只是将 Unicode 字符表示为字节序列的一种方法。 Unicode 是完整的标准,为每个字符分配数字和人类可读的标识符,以及有关字符的大量元数据。
Try chardet. It does a good job of guessing the character encoding of a string of bytes.
No. UTF-8 is just one way to represent Unicode characters as a sequence of bytes. Unicode is the full standard, assigning numeric and human-readable identifiers to each character, as well as lots of metadata about the characters.
它可能是有效的 unicode 编码,例如 utf-8 或 UTF16 代理项对。
It might be a valid unicode encoding, such as a utf-8 or UTF16 surrogate pair.
是的,中文的是UTF-8,Unicode的一种实现(编码)。
对于 ASCII 字符,UTF-8 的长度为 1 个字节,对于其他字符,最长为 4 个字节。
Yes, the Chinese one is UTF-8, a implementation (encoding) of Unicode.
The UTF-8 is 1 byte long for ASCII characters and up to 4 bytes for others.