Python - 编解码器将 ascii 编码为 unicode:错误
:) 我正在尝试将输入文件(当前为英语)的音译反转回其原始形式(印地语)
输入文件的示例或一部分如下所示:
E-k- b-u-d-z*dhi-m-aan- p-ksii#
E-k- ghn-e- j-ngg-l- m-e-ng E-k- b-h-u-t- UUNNc-aa p-e-dr thaa#
U-s- k-ii p-t-z*t-o-ng s-e- l-d-ii shaakhaay-e-ng m-j-*zb-uut- b-aaj-u-O-ng k-ii t-r-h- pheil-ii h-u-II thiing#
w-n- h-NNs-o-ng k-aa E-k- jhu-nhz*D- I-s- p-e-dr p-r- n-i-w-aas- k-r-t-aa thaa#
w-e- s-b- y-h-aaNN s-u-r-ksi-t- the- AUr- b-dre- AAr-aam- s-e- r-h-t-e- the-#
U-n- m-e-ng s-e- E-k- p-ksii b-h-u-t- b-u-d-z*dhi-m-aan- thaa#
I-s- b-u-d-z*dhi-m-aan- p-ksii n-e- E-k- d-i-n- p-e-dr k-ii j-dr m-e-ng s-e- E-k- l-t-aa k-o- U-g-t-e- d-e-khaa#
I-s- k-e- b-aar-e- m-e-ng U-s-n-e- d-uus-r-e- p-ksi-y-o-ng s-e- b-aat- k-ii#
"k-z*y-aa t-u-m-z*h-e-ng w-h- l-t-aa d-i-khaaII d-e-t-ii h-ei", U-s- n-e- U-n- s-e- p-uuchaa "t-u-m-z*h-e-ng I-s-e- n-Shz*T- k-r- d-e-n-aa c-aah-i-E-"#
"I-s-e- k-z*y-o-ng n-Shz*T- k-r- d-e-n-aa c-aah-i-E-?" h-NNs-o-ng n-e- AAshz*c-*ry- s-e- p-uuchaa "y-h- t-o- I-t-n-ii cho-T-ii s-e- h-ei#
h-m-e-ng y-h- k-z*y-aa h-aan-i- p-h-u-NNc-aa s-k-t-ii h-ei"#
"m-e-r-e- m-i-tro-ng," b-u-d-z*dhi-m-aan- p-ksii n-e- U-t-z*t-r- d-i-y-aa "w-h- cho-T-ii s-ii l-t-aa j-l-z*d-ii h-ii b-drii h-o- j-aay-e-g-ii#
y-h- h-m-aar-e- p-e-dr p-r- c-Dh*z k-r- U-s- s-e- l-i-p-T-t-ii j-aay-e-g-ii AUr- phi-r- m-o-T-ii AUr- m-j-*zb-uut- h-o- j-aay-e-g-ii"#
"t-o- k-z*y-aa h-u-AA"#
它在英语中的等效含义是:
A WISE OLD BIRD.
Deep in the forest stood a very tall tree.
Its leafy branches spread out like long arms.
This was the home of a flock of wild geese.
They were safe there.
One of the geese was a wild old bird.
One day this wise old bird noticed a small creeper growing at the foot of the tree.
He spoke to the other birds about it.
"Do you see that creeper ?" he said to them.
"You must destroy it."
"Why must we destroy it ?" asked the geese in surprise.
"It is so small.
What harm can it do?"
"My friends," replied the wise old bird, " that little creeper will soon grow.
我的脚本如下所示:
#!/usr/bin/python
# -*- coding: UTF-8 -*-
import sys
CODEC = 'utf-8'
input_file=sys.argv[1]
output_file=sys.argv[2]
list1=[]
f=open(input_file,'r')
f1 = open(output_file,'w')
english_hindi_dict={'A' : u'अ' , 'AA' : u'आ ' , 'I' : u'इ' , 'II' : u'ई ' , 'U' : u'उ ' ,\
'UU' : u'ऊ' , 'r' : u'ऋ' , 'E' : u'ए' , 'ai' : u'ऐ' , 'O' : u'ओ' , 'AU' : u'औ' ,\
'k' : u'क' , 'kh' : u'ख' , 'g' : u'ग' , 'gh' : u'घ' , 'c' : u'च' , 'ch' : u'छ',\
'j': u'ज' , 'jh' : u'झ' , 'tr' : u'त्र' , 'T' : u'ट' , 'Th' : u'ठ' , 'D' : u'ड',\
'dr' : u'ड' , 'Dh' : u'ढ' , 'Na' : u'ण' , 'th' : u'त' , 'tha' : u'थ',\
'd' : u'द' , 'dh': u'ध' , 'n' : u'न' , 'p' : u'प' , 'ph' : u'फ' ,\
'b' : u'ब' , 'bh' : u'भ' , 'm' : u'म' , 'y' : u'य' , 'r' : u'र' , 'l' : u'ल' ,\
'w' : u'व' , 'sh' : u'श' , 'sha' : u'ष', 's' : u'स' , 'h' : u'ह' , 'ks' : u'क्ष' ,\
'i' : u'ि' , 'ii' : u'ी' , 'u' : u'ु' , 'uu' : u'ू' , 'e' : u'े' ,\
'aa' : u'ै' , 'o' : u'ो' , 'AU' : u'ौ' ,'H' : u'्' ,'mn' : u'ं' ,\
'NN' : u'ँ' , 'AW' : u'ॅ' , 'rr' : u'ृ' , '4' : u'४' , '6': u'६' , '8' : u'८',\
'2' : u'२' , '5' : u'५' , '3' : u'३' , '7' : u'७' , '9' : u'९' , '1' : u'१'}
for line in f:
#line=line.strip() to remove a line from its newline character....
#line=line.rstrip('.')
line=line.replace('-','')
line=line.replace('#','|') # i am using the or symbol for poornviram
#line=line.replace('।','')
#line = line.lower()
for word in line:
for ch in word:
if (ch in english_hindi_dict) :
translatedToken = english_hindi_dict[ch]
else :
translatedToken = ch
#{ translatedToken = english_hindi_dict[ch] }
#for ch in line:
f1.write(translatedToken)
#print translatedToken
#line = line.replace( char,english_hindi_dict[char] )
#list1.append(line)
f.close()
f1.write(' '.join(list1))
f1.close()
我收到的错误是:
python transliterate_eh_nw.py Hstory.txt op1.txt
Traceback (most recent call last):
File "transliterate_eh_nw.py", line 43, in <module>
f1.write(translatedToken)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u092f' in position 0: ordinal not in range(128)
您能否告诉我如何处理此错误。 谢谢..:)
:) I am trying to go about the process of reversing transliteration of an input file(currently in english) back to its original form(in hindi)
A sample or a part of the input file looks like this:
E-k- b-u-d-z*dhi-m-aan- p-ksii#
E-k- ghn-e- j-ngg-l- m-e-ng E-k- b-h-u-t- UUNNc-aa p-e-dr thaa#
U-s- k-ii p-t-z*t-o-ng s-e- l-d-ii shaakhaay-e-ng m-j-*zb-uut- b-aaj-u-O-ng k-ii t-r-h- pheil-ii h-u-II thiing#
w-n- h-NNs-o-ng k-aa E-k- jhu-nhz*D- I-s- p-e-dr p-r- n-i-w-aas- k-r-t-aa thaa#
w-e- s-b- y-h-aaNN s-u-r-ksi-t- the- AUr- b-dre- AAr-aam- s-e- r-h-t-e- the-#
U-n- m-e-ng s-e- E-k- p-ksii b-h-u-t- b-u-d-z*dhi-m-aan- thaa#
I-s- b-u-d-z*dhi-m-aan- p-ksii n-e- E-k- d-i-n- p-e-dr k-ii j-dr m-e-ng s-e- E-k- l-t-aa k-o- U-g-t-e- d-e-khaa#
I-s- k-e- b-aar-e- m-e-ng U-s-n-e- d-uus-r-e- p-ksi-y-o-ng s-e- b-aat- k-ii#
"k-z*y-aa t-u-m-z*h-e-ng w-h- l-t-aa d-i-khaaII d-e-t-ii h-ei", U-s- n-e- U-n- s-e- p-uuchaa "t-u-m-z*h-e-ng I-s-e- n-Shz*T- k-r- d-e-n-aa c-aah-i-E-"#
"I-s-e- k-z*y-o-ng n-Shz*T- k-r- d-e-n-aa c-aah-i-E-?" h-NNs-o-ng n-e- AAshz*c-*ry- s-e- p-uuchaa "y-h- t-o- I-t-n-ii cho-T-ii s-e- h-ei#
h-m-e-ng y-h- k-z*y-aa h-aan-i- p-h-u-NNc-aa s-k-t-ii h-ei"#
"m-e-r-e- m-i-tro-ng," b-u-d-z*dhi-m-aan- p-ksii n-e- U-t-z*t-r- d-i-y-aa "w-h- cho-T-ii s-ii l-t-aa j-l-z*d-ii h-ii b-drii h-o- j-aay-e-g-ii#
y-h- h-m-aar-e- p-e-dr p-r- c-Dh*z k-r- U-s- s-e- l-i-p-T-t-ii j-aay-e-g-ii AUr- phi-r- m-o-T-ii AUr- m-j-*zb-uut- h-o- j-aay-e-g-ii"#
"t-o- k-z*y-aa h-u-AA"#
Its equivalent meaning in english is:
A WISE OLD BIRD.
Deep in the forest stood a very tall tree.
Its leafy branches spread out like long arms.
This was the home of a flock of wild geese.
They were safe there.
One of the geese was a wild old bird.
One day this wise old bird noticed a small creeper growing at the foot of the tree.
He spoke to the other birds about it.
"Do you see that creeper ?" he said to them.
"You must destroy it."
"Why must we destroy it ?" asked the geese in surprise.
"It is so small.
What harm can it do?"
"My friends," replied the wise old bird, " that little creeper will soon grow.
My script looks like this:
#!/usr/bin/python
# -*- coding: UTF-8 -*-
import sys
CODEC = 'utf-8'
input_file=sys.argv[1]
output_file=sys.argv[2]
list1=[]
f=open(input_file,'r')
f1 = open(output_file,'w')
english_hindi_dict={'A' : u'अ' , 'AA' : u'आ ' , 'I' : u'इ' , 'II' : u'ई ' , 'U' : u'उ ' ,\
'UU' : u'ऊ' , 'r' : u'ऋ' , 'E' : u'ए' , 'ai' : u'ऐ' , 'O' : u'ओ' , 'AU' : u'औ' ,\
'k' : u'क' , 'kh' : u'ख' , 'g' : u'ग' , 'gh' : u'घ' , 'c' : u'च' , 'ch' : u'छ',\
'j': u'ज' , 'jh' : u'झ' , 'tr' : u'त्र' , 'T' : u'ट' , 'Th' : u'ठ' , 'D' : u'ड',\
'dr' : u'ड' , 'Dh' : u'ढ' , 'Na' : u'ण' , 'th' : u'त' , 'tha' : u'थ',\
'd' : u'द' , 'dh': u'ध' , 'n' : u'न' , 'p' : u'प' , 'ph' : u'फ' ,\
'b' : u'ब' , 'bh' : u'भ' , 'm' : u'म' , 'y' : u'य' , 'r' : u'र' , 'l' : u'ल' ,\
'w' : u'व' , 'sh' : u'श' , 'sha' : u'ष', 's' : u'स' , 'h' : u'ह' , 'ks' : u'क्ष' ,\
'i' : u'ि' , 'ii' : u'ी' , 'u' : u'ु' , 'uu' : u'ू' , 'e' : u'े' ,\
'aa' : u'ै' , 'o' : u'ो' , 'AU' : u'ौ' ,'H' : u'्' ,'mn' : u'ं' ,\
'NN' : u'ँ' , 'AW' : u'ॅ' , 'rr' : u'ृ' , '4' : u'४' , '6': u'६' , '8' : u'८',\
'2' : u'२' , '5' : u'५' , '3' : u'३' , '7' : u'७' , '9' : u'९' , '1' : u'१'}
for line in f:
#line=line.strip() to remove a line from its newline character....
#line=line.rstrip('.')
line=line.replace('-','')
line=line.replace('#','|') # i am using the or symbol for poornviram
#line=line.replace('।','')
#line = line.lower()
for word in line:
for ch in word:
if (ch in english_hindi_dict) :
translatedToken = english_hindi_dict[ch]
else :
translatedToken = ch
#{ translatedToken = english_hindi_dict[ch] }
#for ch in line:
f1.write(translatedToken)
#print translatedToken
#line = line.replace( char,english_hindi_dict[char] )
#list1.append(line)
f.close()
f1.write(' '.join(list1))
f1.close()
the error that I am getting is:
python transliterate_eh_nw.py Hstory.txt op1.txt
Traceback (most recent call last):
File "transliterate_eh_nw.py", line 43, in <module>
f1.write(translatedToken)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u092f' in position 0: ordinal not in range(128)
Could you please tell me how do I deal with this error.
Thank you..:)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
除了您所询问的问题之外,您还有一些问题。
(1) 概念问题:“Ek- budz*dhi-m-aan- p-ksii#”不是“english”。它是使用某种罗马化方案以 ASCII 编写的印地语。它看起来像ITRAN,但ITRAN没有AA和A,它只有aa和a。该计划有名称吗?可以提供一个网址吗?您的对象更好地描述为“将一些印地语文本从未命名的罗马字母音译为梵文脚本”。
(2) 显示将文本从印地语翻译成英语的结果(“A WISE OLD BIRD”等)的用处不大。预期的梵文输出将是一个更好的主意。
(3) 正如 @kaiser.se 所说,音译字典具有多字节(最多 3 个字节!)键,其中一些是其他键的前缀。据推测,
AA
必须优先于A
被识别,gh
必须在g
之前被识别,等等。字典中的项目按照可预测的顺序发生,但出于您的目的,应将其视为随机的。在下面的代码中,我优先考虑较长的“键”。(4) 要么字典丢失了一些字母键(a S tz),要么音译规则比我们迄今为止猜测的更复杂。
(5) 字符 # * 和 - 的含义并不是 100% 明显。从您的输入文本看来,z 和 * 仅以 z* 的形式出现
(6) 如果您解释了例如
shaakhaay-e-ng
... 的解释,那将是一个好主意以sh
开头,然后aa
开头,还是以sha
开头,然后a
开头?规则是什么?您所问问题的答案当然是其他几个人指出的那样,您需要使用显示设备支持的编码(例如 UTF-8)对 unicode 输出进行编码。
这是一些代码:
输出:
एक बुदz*धिमैन पक्षी
एक घने जनगगल मेनग एक बहुt ऊँचैपेडथa
उ स की पtztोनग से लदी षaखैयेनग मजzबूt बैजुओनग की tर हफेिलीहुईतीनग
वन हँसोनग कै एक झुनहzड इस पेड पर निवैस करtै थa
वे सब यहैँ सुरक्षिt ते ौर बडे आ रैम से रहtे ते
उ न मेनग से एक पक्षी बहुt बुदzधिमैन थa
इस बुदzधिमैन पक्षी ने एक दिन पेड की जड मेनग से ए क लtै को उ गtे देखै
इस के बैरे मेनग उ सने दूसरे पक्षियोनग से बैt की
"कzयै tुमzहेनग वह लtै दिखैई देtी हेि", उ स ने उ न से पूछै "tुमzहेनग इसे नSहzट कर देनै चैहिए"
“इसेकzयोनगनSहzटकरदेनैचैहिए?” हँसोनग ने आ शzचरय से पूछै "यह tो इtनी छोटी से हेि
हमेनग यह कzयै हैनि पहुँचै सकtी हेि"
“मेरे मित्रोनग,”बुदzधिमैन पक्षी ने उ tztर दियै“वह छो टी सी लtै जलzदी ही बडी हो जैयेगी
यह हमैरे पेड पर चढz कर उ स से लिपटtी जैयेगी ौर फिर मो टी ौर मजzबूt हो जैयेगी"
“tो कzयै हुआ”
通过谷歌翻译时只有几个可识别的单词。
更仔细地检查音译表后更新:
其中三个条目(AA、II 和 U)在对应的梵文后面有一个空格。也许应该删除空格。
辅音的一般模式似乎是:
DEVANAGARI LETTER XA 由 x 表示
梵文字母 XXA 由 X 表示
梵文字母 XHA 由 xh 表示
DEVANAGARI 字母 XXHA 由 Xh 表示
,但是 3 个条目打破了该模式:
SSA-> sha 但图案显示 S
TA-> th 但模式显示 t
THA-> tha 但模式说 th
注意:更改上述 3 个条目可以阻止我的代码在音译示例文本时抱怨 S 和 t 保持不变,并删除看似异常的 sha 和 tha 条目。
条目(D 和 dr)映射到同一字符 DEVANAGARI LETTER DDA。 D 是该字符的预期输入;也许 dr 应该映射到其他地方。
没有 DEVANAGARI LETTER NGA (U+0919) 的条目;也许它应该被编码为 ng ——示例文本中有一些以 ng 结尾的单词。
You have a few problems other than the one which you asked about.
(1) A conceptual problem: "E-k- b-u-d-z*dhi-m-aan- p-ksii#" is not "english". It is Hindi language written in ASCII using some romanization scheme. It looks like ITRAN but ITRAN doesn't have AA and A, it has only aa and a. Does the scheme have a name? Can you supply a URL? Your object is better described as "transliterate some Hindi text from the unnamed romanization to Devanagari script".
(2) Showing the result of translating your text from Hindi to English ("A WISE OLD BIRD" etc) is only moderately useful. The expected Devanagari output would be a better idea.
(3) As remarked by @kaiser.se, the transliteration dictionary has multi-byte (up to 3 bytes!) keys, some of which are prefixes of others. Presumably
AA
must be recognised in priority toA
,gh
must be recognised beforeg
, etc. Iterating over the items of a dictionary happens in an order that is predictable but for your purposes should be regarded as random. In the code that follows, I've given priority to longer "keys".(4) Either the dictionary is missing some letter keys (a S t z) or the transliteration rules are more complicated than any of us has guessed so far
(5) The meaning of the characters # * and - is not 100% obvious. It appears from your input text that z and * appear only in combination as z*
(6) It would be a good idea if you explained the interpretation of e.g.
shaakhaay-e-ng
... does it start withsh
thenaa
or does it start withsha
thena
? What are the rules?The answer to the problem that you asked about is of course as several others have pointed out that you need to encode your unicode output using an encoding that is supported by your display device e.g. UTF-8.
Here's some code:
Output:
एक बुदz*धिमैन पक्षी
एक घने जनगगल मेनग एक बहुt ऊँचै पेड थa
उ स की पtztोनग से लदी षaखैयेनग मजzबूt बैजुओनग की tरह फेिली हुई तीनग
वन हँसोनग कै एक झुनहzड इस पेड पर निवैस करtै थa
वे सब यहैँ सुरक्षिt ते ौर बडे आ रैम से रहtे ते
उ न मेनग से एक पक्षी बहुt बुदzधिमैन थa
इस बुदzधिमैन पक्षी ने एक दिन पेड की जड मेनग से एक लtै को उ गtे देखै
इस के बैरे मेनग उ सने दूसरे पक्षियोनग से बैt की
"कzयै tुमzहेनग वह लtै दिखैई देtी हेि", उ स ने उ न से पूछै "tुमzहेनग इसे नSहzट कर देनै चैहिए"
"इसे कzयोनग नSहzट कर देनै चैहिए?" हँसोनग ने आ शzचरय से पूछै "यह tो इtनी छोटी से हेि
हमेनग यह कzयै हैनि पहुँचै सकtी हेि"
"मेरे मित्रोनग," बुदzधिमैन पक्षी ने उ tztर दियै "वह छोटी सी लtै जलzदी ही बडी हो जैयेगी
यह हमैरे पेड पर चढz कर उ स से लिपटtी जैयेगी ौर फिर मोटी ौर मजzबूt हो जैयेगी"
"tो कzयै हुआ "
which has only a few recognisable words when shoved through Google Translate.
Update after examining the transliteration table more closely:
Three of the entries (AA, II, and U) have a space after the Devanagari equivalent. Perhaps the spaces should be removed.
The general pattern for consonants appears to be:
DEVANAGARI LETTER XA is represented by x
DEVANAGARI LETTER XXA is represented by X
DEVANAGARI LETTER XHA is represented by xh
DEVANAGARI LETTER XXHA is represented by Xh
However 3 entries break the pattern:
SSA -> sha but pattern says S
TA -> th but pattern says t
THA -> tha but pattern says th
Note: changing the above 3 entries stopped my code from complaining that S and t were left unchanged when transliterating your sample text, and removed the seemingly-anomalous sha and tha entries.
Entries (D and dr) are mapped to the same character, DEVANAGARI LETTER DDA. D is the expected entry for that character; perhaps dr should be mapped elsewhere.
There is no entry for DEVANAGARI LETTER NGA (U+0919); perhaps it should be encoded as ng -- there are a few words ending in ng in the sample text.
Are the uncatered-for "z*" occurrences in the sample text anything to do with DEVANAGARI LETTER ZA (U+095B)?
list1
此时包含 Unicode 字符串。你不能直接将 Unicode 写入文件,它是一个字节接口。您应该显式编码(' '.join(list1).encode('utf-8'))
,或者按照 Ignacio 建议,使用codecs
包装器对发送给它的 Unicode 字符串进行隐式编码。目前,您正在定义一个变量 CODEC,但没有对其执行任何操作。list1
, at this point, contains Unicode strings. You can't write Unicode directly to a file, it's a byte interface. You should either encode it explicitly(' '.join(list1).encode('utf-8'))
, or, as Ignacio suggests, use acodecs
wrapper to implicitly encode Unicode strings you send to it. At the moment you are defining a variableCODEC
, but not doing anything with it.您确定要删除所有连字符 (-) 吗?查看您的输入文件,看起来所有替换都是两个或三个字符的代码,例如 u'I-':u'इ'。如果是这样,您可以执行如下操作,但请确保对字典中的所有键和值使用 Unicode 字符串:
按照该理论,我得到以下结果,它看起来像诸如 'z* 之类的翻译字典中缺少 '、't-'、'ng' 和 'ei'。我不懂印地语,但谷歌翻译在你的翻译中出现了一些英语单词,所以我认为我走在正确的道路上。
Are you sure you want to remove all the hyphens(-)? Looking at your input file, it looks like all replacements are two- or three-character codes, such as u'I-':u'इ'. If this is so, you could do something like below, but make sure you're using Unicode strings for all your keys and values in the dictionary:
Following that theory, I got the following result, which looks like translations such as 'z*', 't-', 'ng', and 'ei' are missing from the dictionary. I don't read Hindi, but Google Translate came up with some of the English words in your translation, so I think I'm on the right track.