两个Unicode编码代表1个西里尔字母
我在Unicode和UTF-8表示中有这样的字符串:
\u00d0\u0095\u00d1\u0081\u00d0\u00bb\u00d0\u00b8\u00d0\u00bf\u00d0\u00be\u00d0\u00b2\u00d0\u00b5\u00d0\u00b7\u00d0\u00b5\u00d1\u0082 \u00d1\u0082\u00d0\u00be\u00d1\u0081\u00d0\u00b5\u00d0\u00b3\u00d0\u00be\u00d0\u00b4\u00d0\u00bd\u00d1\u008f\u00d1\u0083\u00d0\u00b6\u00d0\u00b5\u00d1\u0081\u00d0\u00ba\u00d0\u00b8\u00d0\u00bd\u00d1\u0083
所需
ЕÑли повезет то ÑÐµÐ³Ð¾Ð´Ð½Ñ ÑƒÐ¶Ðµ Ñкину.
的OUPUT是“ e list”。
我尝试了所有可能的编码,但仍然无法以完整的西里尔形式获得。
我得到的最好的是
'�?�?ли повезе�? �?о �?егодн�? �?же �?кин�?'
使用Windows-1252。
而且我还注意到,所需字符串中的一个西里尔字母表示两个Unicode编码。
例如:\ u00d0 \ u0095 ='r''
。 也许有人知道什么编码以及如何使用它来获得正常结果?
I have such string in unicode and utf-8 representation:
\u00d0\u0095\u00d1\u0081\u00d0\u00bb\u00d0\u00b8\u00d0\u00bf\u00d0\u00be\u00d0\u00b2\u00d0\u00b5\u00d0\u00b7\u00d0\u00b5\u00d1\u0082 \u00d1\u0082\u00d0\u00be\u00d1\u0081\u00d0\u00b5\u00d0\u00b3\u00d0\u00be\u00d0\u00b4\u00d0\u00bd\u00d1\u008f\u00d1\u0083\u00d0\u00b6\u00d0\u00b5\u00d1\u0081\u00d0\u00ba\u00d0\u00b8\u00d0\u00bd\u00d1\u0083
and
ЕÑли повезет то ÑÐµÐ³Ð¾Ð´Ð½Ñ ÑƒÐ¶Ðµ Ñкину.
The desired ouput is "Если повезет то сегодня уже скину".
I have tried all possible encodings but still wasn't able to get it in complete cyrillic form.
The best I got was
'�?�?ли повезе�? �?о �?егодн�? �?же �?кин�?'
using windows-1252.
And also I've noticed that one cyrillic letter in desired string means two unicode encodings.
For example: \u00d0\u0095 = 'Е'
.
Maybe someone knows what encoding and how to use it to get a normal result?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您有一个编码错误的字符串,其中UTF-8字节被翻译为ISO-8859-1(也称为
latin1
)。理想情况下,使用正确的编码重新下载,但是您也可以用错误使用的编码来编码以重新恢复原始字节流,然后用右编码(UTF-8)解码:Python:
您也可以具有字面的字符串Unicode逃生代码,这有点棘手:
在这种情况下,必须将字符串转换回字节,将其解码为Unicode Escapes,然后编码回到字节并正确解码为UTF-8。
latin1
具有这样的功能,即该编解码器中的Unicode MAP的第一个256代码点为0-255,因此它将1:1代码指数转换为字节值。You have a mis-decoded string where the UTF-8 bytes were translated as ISO-8859-1 (also known as
latin1
). Ideally, re-download with the correct encoding, but you can also encode with the wrongly-used encoding to regain the original byte stream, then decode with the right encoding (UTF-8):Python:
You may also have a literal string of Unicode escape codes, which is a bit trickier:
In this case, the string has to be converted back to bytes, decoded as Unicode escapes, then encoded back to bytes and correctly decoded as UTF-8.
latin1
has the feature that the first 256 code points of Unicode map to bytes 0-255 in that codec, so it converts 1:1 code point to byte value.D0 95 D1 81 D0 BB D0 B8
是的正确的UTF-8八位位流式流,“ eCome>”
。因此,您需要通过删除最重要的部分(无论如何在您的示例中始终为0),将每个字符转换为字节(8位单词,八位字)。然后将它们解码为UTF-8。
或者更好,请返回到您获得的源,并确保八位位集不被视为单字节编码。
d0 95 d1 81 d0 bb d0 b8
is the correct UTF-8 octet stream for"Если"
.So you need to convert each character to a byte (8-bit word, octet) by removing the most significant part (which is always 0 anyway in your example). Then decode them as UTF-8.
Or better, go back to the source from which you got this, and make sure the stream of octets is not seen as single-byte encoding.