什么多字节字符集以 0x7F 开头，长度为 4 个字节？

发布于 2024-07-15 22:28:41 字数 418 浏览 13 评论 0原文

我正在尝试获取一些遗留代码来正确显示中文字符。我尝试使用的一种字符编码以 0x7F 开头，长度为 4 个字节（包括 0x7F 字节）。有谁知道这是什么类型的编码以及我在哪里可以找到它的信息？谢谢..

更新：我还必须使用一些日语编码，每个字符都以 0xE3 开头，长度为三个字节。如果我在 Windows 中选择日语区域设置，它会正确显示在我的计算机上，但是，它不会在我们的应用程序中正确显示。但是，如果选择了日语以外的任何其他语言环境，我什至无法正确查看文件名。所以我猜这个编码不是Unicode。有人知道这是什么吗？是ANSI吗？是 Shift JIS 吗？

对于中文，我用 Unicode 和 UTF-8 字符进行了测试，得到了相同的模式； 0x7F 后面跟着三个字节。 Unicode 和 UTF-8 相同吗？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

放赐 2024-07-22 22:28:41

我尝试使用的一个字符编码以 0x7F 开头，长度为 4 个字节

其他字节是什么？你有这种编码的拉丁文本吗？

如果它是“0x7f 0x...0x00 0x00”，那么您正在查看UTF-32LE。它也可以是两个 UTF-16（LE 或 BE）字符。

大多数东亚编码使用 0x80-0xFF 作为非 ASCII 字符的前导字节；据我所知，没有一个会使用前导 0x7F 作为除 ASCII 删除之外的任何内容。

预计到达时间：

是否应该有字节顺序标记？

如果有一种带外方式表明编码是“UTF-32LE”（可能在到达您之前就丢失了），则不需要 BOM。

我还必须使用一些日语编码，每个字符都以 0xE3 开头，长度为三个字节。

那肯定是UTF-8。序列 0xE3 0x... 0x... 将产生 U+3000 和 U+4000 之间的字符，这是平假名/片假名所在的位置。

如果我在 Windows 中选择日语语言环境，它会在我的计算机上正确显示，但在我们的应用程序中却无法正确显示。

那么您的应用程序很可能是令人遗憾的一大群不符合 Unicode 的应用程序之一，在“W”后缀的应用程序中仍然使用“A”(*) 版本的 Win32 接口。是否可以根据字符串的真实编码读取字符串是没有意义的：不符合 Unicode 的应用程序将永远无法在西方语言环境上显示东亚表意文字。

（*：以“ANSI”命名，这是 Windows 的误导性术语，表示“无论系统代码页当前设置为什么”。这就是更改您的区域设置会影响它的原因。）

ETA(2)：

好的，破解了。这不是我以前见过的任何标准化编码，但如果您假设 Unicode 代码点正在被编码，那么破译起来相对容易。

0x00-0x7E: plain ASCII
0x7F A B C: Unicode character

Unicode 转义中编码的字符可以通过将 A、B 和 C 的键字符串中的索引相加来计算：

A*0x1000 + B*0x40 + C

也就是说，它是一个 Base-64 字符集，但它不是通常的 Base64 标准。一些实验给出了一个关键字符串：

.0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz

“.” 和“_”字符是猜测，因为您发布的字符都没有使用它们。我们需要更多数据才能找出确切的字符串。

因此，例如：

0x7F 3 u g
A=4 B=58 C=44
4*0x1000 + 58*0x40 + 44 = 0x4EAC
U+4EAC = 京

ETA(3)：

是的，通过手动提取每个代码点并作为字符连接，应该很容易创建本机 Unicode 字符串。不太确定您使用的任何平台上都有哪些可用的内容，但任何支持 Unicode 的平台都应该能够简单地从代码点生成字符串（并且希望无需手动重新编码为 UTF-16LE 字节）。

我注意到这三个示例字符的第一个转义字符与它们的 Unicode 代码点在相同的一般范围内，并且数字顺序相同，因此我认为它一定是 Unicode 代码点。另外两个字符似乎随机变化，因此它很可能是代码点的大端编码，并且可能是 base-64 编码，因为 6 是您可以从可读 ASCII 中获取的尽可能多的位。

标准 Base64 本身以字母开头，这会使以数字开头的内容太靠前，无法位于基本多语言平面中。所以我开始猜测“0123456789ABCDEFG...”这将是密钥字符串的另一个明显选择。结果得到的数字接近给定字符的代码点，但有点太低了。在键字符串的开头插入一个额外的字符（因此数字“0”不会映射到数字 0），其中一个字符正确，而另外两个字符非常接近；正确的没有小写字母，因此为了仅更改小写字母，我在大小写之间插入了另一个字符。这得出了正确的数字。

不能保证这实际上是正确的，但是（除了任意选择插入的字符之外）它很可能是正确的。

One character encoding I'm trying to work with starts with a 0x7F and is 4 bytes long

What are the other bytes? Do you have any Latin text in this encoding?

If it's “0x7f 0x... 0x00 0x00” you are looking at UTF-32LE. It could also be two UTF-16 (either LE or BE) characters.

Most East Asian encodings use 0x80-0xFF as lead bytes for non-ASCII characters; there is none I know of that would use a leading 0x7F as anything other than an ASCII delete.

ETA:

are there supposed to be Byte Order Marks?

There doesn't need to be a BOM if there is an out-of-band way of signalling that the encoding is ‘UTF-32LE’ (possibly one that is lost before it gets to you).

I've also had to work with some Japanese encoding that starts every character with a 0xE3 and is three bytes long.

That's surely UTF-8. Sequence 0xE3 0x... 0x... would result in a character between U+3000 and U+4000, which is where the hiragana/katakana live.

It displays on my computer properly if I choose the Japanese locale in Windows, however, it doesn't display properly in our application.

Then chances are your application is is one of the regrettable horde of non-Unicode-compliant apps, still using ‘A’(*) versions of the Win32 interfaces inside of the ‘W’-suffixed ones. Whether you can read in the string according to its real encoding is moot: a non-Unicode-compliant app will never be able to display an East Asian ideograph on a Western locale.

(*: named for “ANSI”, which is Windows's misleading term for “whatever the system codepage is set to at the moment”. That's why changing your locale affected it.)

ETA(2):

OK, cracked it. It's not any standardised encoding I've met before, but it's relatively easy to decipher if you assume the premise that Unicode code points are being encoded.

0x00-0x7E: plain ASCII
0x7F A B C: Unicode character

The character encoded in a Unicode escape can be calculated by taking the index in a key string of A, B and C and adding together:

A*0x1000 + B*0x40 + C

That is, it's a base-64 character set, but it's not the usual Base64 standard. A little experimentation gives a key string of:

.0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz

The ‘.’ and ‘_’ characters are guesses, since none of the characters you posted uses them. We'd need more data to find out the exact string.

So, for example:

0x7F 3 u g
A=4 B=58 C=44
4*0x1000 + 58*0x40 + 44 = 0x4EAC
U+4EAC = 京

ETA(3):

Yeah, it should be easy to create a native Unicode string by sucking out each code point manually and joining as a character. Not quite sure what's available on whatever platform you're using, but any Unicode-capable platform should be able to make a string from codepoints simply (and hopefully without having to manually re-encode to UTF-16LE bytes).

I figured it must be Unicode codepoints by noticing that the three example characters had first escape-characters in the same general range, and in the same numerical order as their Unicode codepoints. The other two characters seemed to change randomly, so it was very likely a big-endian encoding of the code point, and probably a base-64 encoding as 6 is as many bits as you can get out of readable ASCII.

Standard Base64 itself starts with letters, which would put something starting with a number too far up to be in the Basic Multilingual Plane. So I started guessing with ‘0123456789ABCDEFG...’ which would be the other obvious choice of key string. That got resulting numbers that were close to the code points for the given characters, but a bit too low. Inserting an extra character at the start of the key string (so digit ‘0’ doesn't map to number 0) got one of the characters right and the other two very close; the one that was right had no lower-case letters, so to change only the lower-case letters I inserted another character between the upper and lower cases. This came up with the right numbers.

It's not guaranteed that this is actually right, but (apart from the arbitrary choice of inserted characters) it's very likely to be it.

回复收藏 0 原文