Python:用CJKLIB将汉字转换为拼音

发布于 2024-12-01 22:53:55 字数 1234 浏览 2 评论 0原文

我正在尝试将一堆汉字转换为拼音,从一个文件中读取字符并将拼音写入另一个文件中。我正在使用 CJKLIB 函数来执行此操作。

这是代码,

from cjklib.characterlookup import CharacterLookup

source_file = 'cities_test.txt'
dest_file = 'output.txt'

s = open(source_file, 'r')
d = open(dest_file, 'w')

cjk = CharacterLookup('T')

for line in s:
    p = line.split('\t')
    for p_shard in p:
        for c in p_shard:
            readings = cjk.getReadingForCharacter(c.encode('utf-8'), 'Pinyin')
            d.write(readings[0].encode('utf-8'))
        d.write('\t')
    d.write('\n')

s.close()
d.close()

我的问题是我不断遇到与 Unicode 相关的错误,当我调用 getReadingForCharacter 函数时会出现该错误。如果我按照书面方式调用它,

readings = cjk.getReadingForCharacter(c.encode('utf-8'), 'Pinyin')

我会得到: UnicodeDecodeError: 'ascii' codec can't Decode byte 0xef inposition 0: ordinal not in range (128)。

如果我这样调用它,没有 .encoding()

readings = cjk.getReadingForCharacter(c, 'Pinyin')

我会收到 sqlalchemy 抛出的错误(CJKLIB 使用 sqlalchemy 和 sqlite): You can not use 8-bit bytestrings except you use a text_factory可以解释 8 位字节串...等等。

有人可以帮助我吗?谢谢!

哦还有,CJKLIB有没有办法返回不带任何声调的拼音?我认为默认情况下它会返回带有这些奇怪字符的拼音来表示声调,我只想要没有这些声调的字母。

I'm trying to convert a bunch of Chinese characters into pinyin, reading the characters from one file and writing the pinyin into another. I'm working with the CJKLIB functions to do this.

Here's the code,

from cjklib.characterlookup import CharacterLookup

source_file = 'cities_test.txt'
dest_file = 'output.txt'

s = open(source_file, 'r')
d = open(dest_file, 'w')

cjk = CharacterLookup('T')

for line in s:
    p = line.split('\t')
    for p_shard in p:
        for c in p_shard:
            readings = cjk.getReadingForCharacter(c.encode('utf-8'), 'Pinyin')
            d.write(readings[0].encode('utf-8'))
        d.write('\t')
    d.write('\n')

s.close()
d.close()

My problem is that I keep running into Unicode-related errors, the error comes up when I call the getReadingForCharacter function. If I called it as written,

readings = cjk.getReadingForCharacter(c.encode('utf-8'), 'Pinyin')

I get: UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 0: ordinal not in range (128).

If I call it like this, without the .encoding(),

readings = cjk.getReadingForCharacter(c, 'Pinyin')

I get an error thrown by sqlalchemy (the CJKLIB uses sqlalchemy and sqlite): You must not use 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings ... etc.

Can someone help me out? Thanks!

Oh also, is there a way for CJKLIB to return the pinyin without any tones? I think by default it's returning pinyin with these weird characters to represent tones, I just want the letters without these tones.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

香草可樂 2024-12-08 22:53:55

您的错误是您没有对输入流进行解码,但您却将其重新编码,就好像它是 UTF-8 一样。那是走错路了。

你有两个选择。

您可以使用显式编码来 codecs.open 输入文件,这样每当您读取它时,您总是会得到常规的 Unicode 字符串,因为解码是自动的。这始终是我的强烈偏好。不再有文本文件之类的东西了。

您的另一个选择是在将二进制字符串传递给函数之前手动对其进行解码。我讨厌这种风格,因为它几乎总是表明你做错了什么,即使没有,它也很笨拙,因为所有人都出去了。

我会对输出文件做同样的事情。我只是讨厌到处看到手动 .encode("utf-8").decode("utf-8") 。设置流编码并完成。

Your bug is that you are not decoding the input stream, and yet you are turning around and re-encoding it as though it were UTF-8. That’s going the wrong way.

You have two choices.

You can codecs.open the input file with an explicit encoding so you always get back regular Unicode strings whenever you read from it because the decoding is automatic. This is always my strong preference. There is no such thing as a text file anymore.

Your other choice is to manually decode your binary string it before you pass it to the function. I hate this style, because it almost always indicates that you're doing something wrong, and even when it doesn't, it is clumsy as all get out.

I would do the same thing for the output file. I just hate seeing manually .encode("utf-8") and .decode("utf-8") all over the place. Set the stream encoding and be done with it.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文