UTF-8 编码错误,需要转换文本的帮助

发布于 2024-08-23 01:29:25 字数 949 浏览 4 评论 0 原文

我一直在为海地开发一个统计翻译系统 (code.google.com/p/ccmts),该系统使用 C++ 后端 (http://www.statmt.org/moses/?n=Development.GetStarted) 和 Python 驱动 C++ 引擎/后端。

我已将 UTF-8 Python 字符串传递到 C++ std::string 中,进行了一些处理,将结果返回到 Python 中,这是该字符串(当从 C++ 打印到 Linux 终端时) :

mwen bezwen à d medical

  1. 那是什么编码?它是双重编码的字符串吗?
  2. 我如何“修复它”以便它可以渲染?
  3. 是以这种方式打印的,是因为我缺少字体还是其他什么?

Python chardet 库说:

{'confidence': 0.93812499999999999, 'encoding': 'utf-8'}

但是,Python,当我运行 string/unicode/codecs 解码时给我旧的:

UnicodeDecodeError:“ascii”编解码器无法解码位置 30 中的字节 0xc3:序数不在范围内 (128)

哦,Python 将相同的字符串打印到标准输出中。

repr() 调用打印以下内容: ' mwen bezwen \xc3\xa3 \xc2\xa8 d medikal '

I've been working on a statistical translation system for haiti (code.google.com/p/ccmts) that uses a C++ backend (http://www.statmt.org/moses/?n=Development.GetStarted) and Python drives the C++ engine/backend.

I've passed a UTF-8 Python string into a C++ std::string, done some processing, gotten a result back into Python and here is the string (when printed from C++ into a Linux terminal):

mwen bezwen ã ¨ d medikal

  1. What encoding is that? Is it a double encoded string?
  2. How do I "fix it" so it's renderable?
  3. Is that printed in that fashion because I'm missing a font or something?

The Python chardet library says:

{'confidence': 0.93812499999999999, 'encoding': 'utf-8'}

but, Python, when I run a string/unicode/codecs decode gives me the old:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 30: ordinal not in range(128)

Oh and Python prints that same exact string into standard output.

A repr() call prints the following: ' mwen bezwen \xc3\xa3 \xc2\xa8 d medikal '

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

风吹短裙飘 2024-08-30 01:29:25

这看起来就像一个垃圾进、垃圾出的情况。以下是有关如何查看数据中的内容的一些线索。 repr()unicodedata.name() 是你的朋友。

>>> s = ' mwen bezwen \xc3\xa3 \xc2\xa8 d medikal '
>>> print repr(s.decode('utf8'))
u' mwen bezwen \xe3 \xa8 d medikal '
>>> import unicodedata
>>> unicodedata.name(u'\xe3')
'LATIN SMALL LETTER A WITH TILDE'
>>> unicodedata.name(u'\xa8')
'DIAERESIS'
>>>

更新:

如果(正如其他人所暗示的那样)您让包随机选择输出语言,并且您怀疑它的选择是韩语(a)告诉我们(b)尝试使用与该语言相关的编解码器来解码输出....这里不仅有韩文,还有中文、日文和俄文各两个:

>>> s = ' mwen bezwen \xc3\xa3 \xc2\xa8 d medikal '
>>> for enc in 'euc-kr big5 gb2312 shift-jis euc-jp cp1251 koi8-r'.split():
    print enc, s.decode(enc)


euc-kr  mwen bezwen 찾 짢 d medikal 
big5  mwen bezwen 瓊 穡 d medikal 
gb2312  mwen bezwen 茫 篓 d medikal 
shift-jis  mwen bezwen テ」 ツィ d medikal 
euc-jp  mwen bezwen 達 即 d medikal 
cp1251  mwen bezwen ГЈ ВЁ d medikal 
koi8-r  mwen bezwen цё б╗ d medikal 
>>> 

真的,没有一个很合理,尤其是 koi8-r。进一步的建议:检查您所连接的包的文档(请提供 URL!)...它对编码有何说明?您在哪两种语言之间进行尝试? “mwen bezwen”在预期的输出语言中有意义吗?尝试更大的文本样本 - chardet 是否仍然表示 UTF-8?较大的输出对于预期的输出语言是否有意义?尝试将英语翻译成另一种仅使用 ASCII 的语言——你能得到有意义的 ASCII 输出吗?你愿意透露你的Python代码和你的swig接口代码吗?

更新2 信息流很有趣:“一个字符串处理应用程序”-> “统计语言翻译系统”-> “帮助海地的机器翻译系统(开源/自由软件)(crisiscommons.org)”

请尝试用以下事实替换“未知”:

Input language: English (guess)
Output language: Haitian Creole
Operating system: linux
Python version: unknown
C++ package name: unknown
C++ package URL: unknown
C++ package output encoding: unknown

Test 1 input: unknown
Test 1 expected output: unknown
Test 1 actual output (utf8): ' mwen bezwen \xc3\xa3 \xc2\xa8 d medikal '
[Are all of those internal spaces really in the string?]

Test 2 input: 'I need medical aid.'
Test 2 expected output (utf8): 'Mwen bezwen \xc3\xa8d medikal.'
Test 2 actual output (utf8): unknown

测试 2 均来自 Google 翻译(alpha)
Microsoft 翻译(测试版)
Mwen bezwen èd medikal
第三个单词是带有 GRAVE (U+00E8) 的拉丁文小写字母 E,后跟“d”。

更新3

你说“”“输入:utf8(也许,我认为我的几个文件中可能有不正确编码的文本)”“”

假设(你从未明确说明过这一点)所有您的文件应使用 UTF-8 进行编码:

对齐的 en-fr-ht 语料库的 zip 文件有几个文件,当尝试将它们解码为 UTF-8 时,这些文件会崩溃。

诊断为什么会发生这种情况:

chardet 没有用(在这种情况下);它迷惑了很长一段时间,并带着 ISO-8859-2(东欧又名 Latin2)的猜测回来,置信水平为 80 到 90 pct。

下一步:选择 ht-en 目录(ht 使用的重音字符比 fr 少,因此更容易看到发生了什么)。

预期:e-grave 是假定良好 ht 文本(网站、CMU 文件)中最常见的非 ASCII 字符……大约是下一个字符 o-grave 的 3 倍。第三个最常见的一个消失在噪音中。

获取文件 hten.txt 中非 ASCII 字节的计数。前 5 行:

8a 99164
95 27682
c3 8210
a8 6004
b2 2159

最后三行由 解释

e-grave is c3 a8 in UTF-8
o-grave is c3 b2 in UTF-8
2159 + 6004 approx == 8210
6004 approx == 3 * 2159

前 2 行由 解释

e-grave is 8a in old Western Europe DOS encodings like cp850!!
o-grave is 95 in old Western Europe DOS encodings like cp850!!
99164 approx == 3 * 27682

包含 latin1 或 cp1252 的解释不成立(8a 是 latin1 中的控制字符;8a 是 cp1252 中的 S-caron)。

检查内容后发现,该文件是多个原始文件的集合,其中一些是 UTF-8,至少一个 cp850(或类似文件)。罪魁祸首似乎是圣经!

编码的混合解释了 chardet 为何陷入困境。

建议:

(1) 对所有输入文件实施编码检查。确保它们预先转换为 UTF-8,就像在边境管制处一样。

(2) 实现一个脚本,在发布前检查 UTF-8 的可解码性。

(3) 圣经文本的正字法(乍一看)与网站的正字法不同(更多的撇号)。您可能希望与您的克里奥尔语专家讨论您的语料库是否被不同的正字法扭曲了……还有单词的问题;您期望大量使用无酵面包和麻布吗?骨灰?注意 cp850 的东西似乎占到了大约 90% 的集团;一些圣经可能还可以,但 90% 似乎有点过分了。

(4) 为什么 Moses 没有抱怨非 UTF-8 输入?可能性:(1) 它正在处理原始字节,即它不会转换为 Unicode (2) 它尝试转换为 Unicode,但默默地忽略失败:-(

It looks like a case of garbage in, garbage out. Here are a few clues on how to see what you've got in your data. repr() and unicodedata.name() are your friends.

>>> s = ' mwen bezwen \xc3\xa3 \xc2\xa8 d medikal '
>>> print repr(s.decode('utf8'))
u' mwen bezwen \xe3 \xa8 d medikal '
>>> import unicodedata
>>> unicodedata.name(u'\xe3')
'LATIN SMALL LETTER A WITH TILDE'
>>> unicodedata.name(u'\xa8')
'DIAERESIS'
>>>

Update:

If (as A. N. Other implies) you are letting the package choose the output language at random, and you suspect its choice is e.g. Korean (a) tell us (b) try to decode the output using a codec that's relevant to that language .... here are not only Korean but also two each of Chinese, Japanese, and Russian:

>>> s = ' mwen bezwen \xc3\xa3 \xc2\xa8 d medikal '
>>> for enc in 'euc-kr big5 gb2312 shift-jis euc-jp cp1251 koi8-r'.split():
    print enc, s.decode(enc)


euc-kr  mwen bezwen 찾 짢 d medikal 
big5  mwen bezwen 瓊 穡 d medikal 
gb2312  mwen bezwen 茫 篓 d medikal 
shift-jis  mwen bezwen テ」 ツィ d medikal 
euc-jp  mwen bezwen 達 即 d medikal 
cp1251  mwen bezwen ГЈ ВЁ d medikal 
koi8-r  mwen bezwen цё б╗ d medikal 
>>> 

None very plausible, really, especially the koi8-r. Further suggestions: Inspect the documentation of the package you interfacing with (URL please!) ... what does it say about encoding? Between which two languages are you trying it? Does "mwen bezwen" make any sense in the expected output language? Try a much larger sample of text -- does chardet still indicate UTF-8? Does any of the larger output make sense in the expected output language? Try it translating English to another language that uses only ASCII -- do you get meaningful ASCII output? Do you care to divulge your Python code and your swig interface code?

update 2 The information flow is interesting: "a string processing app" -> "a statistical language translation system" -> "a machine translation system (opensource/freesoftware) to help out in haiti (crisiscommons.org)"

Please try to replace "unknown" by the facts in the following:

Input language: English (guess)
Output language: Haitian Creole
Operating system: linux
Python version: unknown
C++ package name: unknown
C++ package URL: unknown
C++ package output encoding: unknown

Test 1 input: unknown
Test 1 expected output: unknown
Test 1 actual output (utf8): ' mwen bezwen \xc3\xa3 \xc2\xa8 d medikal '
[Are all of those internal spaces really in the string?]

Test 2 input: 'I need medical aid.'
Test 2 expected output (utf8): 'Mwen bezwen \xc3\xa8d medikal.'
Test 2 actual output (utf8): unknown

Test 2 obtained from both Google Translate (alpha) and
Microsoft Translate (beta):
Mwen bezwen èd medikal.
The third word is LATIN SMALL LETTER E with GRAVE (U+00E8) followed by 'd'.

Update 3

You said """input: utf8 (maybe, i think a couple of my files might have improperly coded text in them) """

Assuming (you've never stated this explicitly) that all your files should be encoded in UTF-8:

The zip file of aligned en-fr-ht corpus has several files that crash when one attempts to decode them as UTF-8.

Diagnosis of why this happens:

chardet is useless (in this case); it faffs about for a long time and comes back with a guess of ISO-8859-2 (Eastern Europe aka Latin2) with a confidence level of 80 to 90 pct.

Next step: chose the ht-en directory (ht uses fewer accented chars than fr therefore easier to see what is going on).

Expectation: e-grave is the most frequent non-ASCII character in presumed-good ht text (a web site, CMU files) ... about 3 times as many as the next one, o-grave. The 3rd most frequent one is lost in the noise.

Got counts of non-ascii bytes in file hten.txt. Top 5:

8a 99164
95 27682
c3 8210
a8 6004
b2 2159

The last three rows are explained by

e-grave is c3 a8 in UTF-8
o-grave is c3 b2 in UTF-8
2159 + 6004 approx == 8210
6004 approx == 3 * 2159

The first 2 rows are explained by

e-grave is 8a in old Western Europe DOS encodings like cp850!!
o-grave is 95 in old Western Europe DOS encodings like cp850!!
99164 approx == 3 * 27682

Explanations that include latin1 or cp1252 don't hold water (8a is a control character in latin1; 8a is S-caron in cp1252).

Inspection of the contents reveals that the file is a conglomeration of multiple original files, some UTF-8, at least one cp850 (or similar). The culprit appears to be the Bible!!!

The mixture of encodings explains why chardet was struggling.

Suggestions:

(1) Implement checking of encoding on all input files. Ensure that they are converted to UTF-8 right up front, like at border control.

(2) Implement a script to check UTF-8 decodability before release.

(3) The orthography of the Bible text appears (at a glance) to be different to that of websites (many more apostrophes). You may wish to discuss with your Creole experts whether your corpus is being distorted by a different orthography ... there is also the question of the words; do you expect to get much use of unleavened bread and sackcloth & ashes? Note the cp850 stuff appears to about 90% of the conglomeration; some Bible might be OK but 90% seems over the top.

(4) Why is Moses not complaining about non-UTF-8 input? Possibilities: (1) it is working on raw bytes i.e. it doesn't convert to Unicode (2) it attempts to convert to Unicode, but silently ignores failure :-(

后知后觉 2024-08-30 01:29:25

看起来您的 默认编码 是 ASCII。

您可以显式转换您的 unicode 字符串:

print u"Hellö, Wörld".encode("utf-8")

或者,如果您想在脚本中全局更改此设置,请将 sys.stdout 替换为将其编码为 utf-8 的包装器:

import sys, codecs
sys.stdout = codecs.EncodedFile(sys.stdout, "utf-8")
print u"Hellö, Wörld!"

此外,您可以一劳永逸地更改默认编码(站点范围内)通过 sys.setdefaultencoding,但这只能是在 sitecustomize.py 中完成。然而,我不会这样做——虽然看起来很方便,但它会影响系统上的所有 python 脚本,并且可能会产生意想不到的副作用。

Looks like your default encoding is ASCII.

You can either explicitly convert your unicode strings:

print u"Hellö, Wörld".encode("utf-8")

Or, if you want to change this globally in your script, replace sys.stdout with a wrapper that encodes it as utf-8:

import sys, codecs
sys.stdout = codecs.EncodedFile(sys.stdout, "utf-8")
print u"Hellö, Wörld!"

Furthermore, you can change the default encoding once and for all (site-wide) via sys.setdefaultencoding, but this can only be done in sitecustomize.py. I would't do this, however -- convenient as it may seem, it affects all python scripts on your system, and might have unintended side-effects.

-残月青衣踏尘吟 2024-08-30 01:29:25

编辑:别介意我之前发布的那些垃圾;这是错误的。

正如其他人所建议的那样,这将为您提供 python 中正确的 unicode 对象,假设它是 utf-8:

>>> ' mwen bezwen \xc3\xa3 \xc2\xa8 d medikal '.decode('utf-8')
u' mwen bezwen \xe3 \xa8 d medikal '
>>> print _
 mwen bezwen ã ¨ d medikal

这确实似乎是您的库给您带来垃圾的情况,无论是否进入其中时都是垃圾。

Edit: Nevermind that junk I posted before; it was wrong.

As others have suggested, this will get you the correct unicode object in python, assuming that's meant to be utf-8:

>>> ' mwen bezwen \xc3\xa3 \xc2\xa8 d medikal '.decode('utf-8')
u' mwen bezwen \xe3 \xa8 d medikal '
>>> print _
 mwen bezwen ã ¨ d medikal

It really does seem to be a case of your library giving you garbage, whether garbage when into it or not.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文