我怎样才能让这个Python2.6函数与Unicode一起工作？

发布于 2024-09-24 21:50:59 字数 2711 浏览 6 评论 0原文

我已经有了这个函数，我根据在线 NLTK 书籍第一章中的材料对其进行了修改。它对我来说非常有用，但是，尽管阅读了有关 Unicode 的章节，我还是像以前一样感到迷茫。

def openbookreturnvocab(book):
    fileopen = open(book)
    rawness = fileopen.read()
    tokens = nltk.wordpunct_tokenize(rawness)
    nltktext = nltk.Text(tokens)
    nltkwords = [w.lower() for w in nltktext]
    nltkvocab = sorted(set(nltkwords))
    return nltkvocab

前几天我在《查拉图斯特拉如是说》(Also Sprach Zarathustra) 上尝试时，它在 o 和 u 上加上了元音符号，从而破坏了单词。我相信你们中的一些人会知道为什么会发生这种情况。我也确信它很容易修复。我知道它只与调用一个将标记重新编码为 unicode 字符串的函数有关。如果是这样，在我看来，它可能根本不会发生在该函数定义中，但是在这里，我准备写入文件：

def jotindex(jotted, filename, readmethod):
    filemydata = open(filename, readmethod)
    jottedf = '\n'.join(jotted)
    filemydata.write(jottedf)
    filemydata.close()
    return 0

我听说我要做的就是从文件中读取字符串后将其编码为 unicode 。我尝试像这样修改该函数：

def openbookreturnvocab(book):
    fileopen = open(book)
    rawness = fileopen.read()
    unirawness = rawness.decode('utf-8')
    tokens = nltk.wordpunct_tokenize(unirawness)
    nltktext = nltk.Text(tokens)
    nltkwords = [w.lower() for w in nltktext]
    nltkvocab = sorted(set(nltkwords))
    return nltkvocab

但是当我在匈牙利语上使用它时，这带来了这个错误。当我在德语上使用它时，没有出现任何错误。

>>> import bookroutines
>>> elles1 = bookroutines.openbookreturnvocab("lk1-les1")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "bookroutines.py", line 9, in openbookreturnvocab
    nltktext = nltk.Text(tokens)
  File "/usr/lib/pymodules/python2.6/nltk/text.py", line 285, in __init__
    self.name = " ".join(map(str, tokens[:8])) + "..."
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in position 4: ordinal not in range(128)

我修复了归档数据的函数，如下所示：

def jotindex(jotted, filename, readmethod):
    filemydata = open(filename, readmethod)
    jottedf = u'\n'.join(jotted)
    filemydata.write(jottedf)
    filemydata.close()
    return 0

但是，当我尝试归档德语时，这带来了此错误：

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "bookroutines.py", line 23, in jotindex
    filemydata.write(jottedf)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf6' in position 414: ordinal not in range(128)
>>>

...这就是当您尝试写入 u'\n'.join'ed 数据时得到的结果。

>>> jottedf = u'/n'.join(elles1)
>>> filemydata.write(jottedf)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf6' in position 504: ordinal not in range(128)

原文

I've got this function, which I modified from material in chapter 1 of the online NLTK book. It's been very useful to me but, despite reading the chapter on Unicode, I feel just as lost as before.

def openbookreturnvocab(book):
    fileopen = open(book)
    rawness = fileopen.read()
    tokens = nltk.wordpunct_tokenize(rawness)
    nltktext = nltk.Text(tokens)
    nltkwords = [w.lower() for w in nltktext]
    nltkvocab = sorted(set(nltkwords))
    return nltkvocab

When I tried it the other day on Also Sprach Zarathustra, it clobbered words with an umlat over the o's and u's. I'm sure some of you will know why that happened. I'm also sure that it's quite easy to fix. I know that it just has to do with calling a function that re-encodes the tokens into unicode strings. If so, that it seems to me it might not happen inside that function definition at all, but here, where I prepare to write to file:

def jotindex(jotted, filename, readmethod):
    filemydata = open(filename, readmethod)
    jottedf = '\n'.join(jotted)
    filemydata.write(jottedf)
    filemydata.close()
    return 0

I heard that what I had to do was encode the string into unicode after reading it from the file. I tried amending the function like so:

def openbookreturnvocab(book):
    fileopen = open(book)
    rawness = fileopen.read()
    unirawness = rawness.decode('utf-8')
    tokens = nltk.wordpunct_tokenize(unirawness)
    nltktext = nltk.Text(tokens)
    nltkwords = [w.lower() for w in nltktext]
    nltkvocab = sorted(set(nltkwords))
    return nltkvocab

But that brought this error, when I used it on Hungarian. When I used it on German, I had no errors.

>>> import bookroutines
>>> elles1 = bookroutines.openbookreturnvocab("lk1-les1")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "bookroutines.py", line 9, in openbookreturnvocab
    nltktext = nltk.Text(tokens)
  File "/usr/lib/pymodules/python2.6/nltk/text.py", line 285, in __init__
    self.name = " ".join(map(str, tokens[:8])) + "..."
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in position 4: ordinal not in range(128)

I fixed the function that files the data like so:

def jotindex(jotted, filename, readmethod):
    filemydata = open(filename, readmethod)
    jottedf = u'\n'.join(jotted)
    filemydata.write(jottedf)
    filemydata.close()
    return 0

However, that brought this error, when I tried to file the German:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "bookroutines.py", line 23, in jotindex
    filemydata.write(jottedf)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf6' in position 414: ordinal not in range(128)
>>>

...which is what you get when you try to write the u'\n'.join'ed data.

>>> jottedf = u'/n'.join(elles1)
>>> filemydata.write(jottedf)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf6' in position 504: ordinal not in range(128)

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

橘味果▽酱 2024-10-01 21:50:59

对于从文件中读取的每个字符串，如果您有 UTF-8 格式的文本，则可以通过调用 rawness.decode('utf-8') 将它们转换为 unicode。你最终会得到 unicode 对象。另外，我不知道“jotted”是什么，但您可能想确保它是一个 unicode 对象并使用 u'\n'.join(jotted) 代替。

更新：

NLTK 库似乎不喜欢 unicode 对象。好吧，那么您必须确保使用带有 UTF-8 编码文本的 str 实例。尝试使用这个：

tokens = nltk.wordpunct_tokenize(unirawness)
nltktext = nltk.Text([token.encode('utf-8') for token in tokens])

和这个：

jottedf = u'\n'.join(jotted)
filemydata.write(jottedf.encode('utf-8'))

但是如果 jotted 确实是一个 UTF-8 编码的 str 列表，那么你不需要这个，这应该足够了：

jottedf = '\n'.join(jotted)
filemydata.write(jottedf)

顺便说一句，看起来 NLTK 对关于 unicode 和编码（至少是演示）。最好小心并检查它是否正确处理了您的令牌。另外，这可能导致您收到匈牙利文本错误而不是德语文本错误，请检查您的编码。

For each string that you read from your file, you can convert them to unicode by calling rawness.decode('utf-8'), if you have the text in UTF-8. You will end up with unicode objects. Also, I don't know what "jotted" is, but you may want to make sure it's a unicode object and use u'\n'.join(jotted) instead.

Update:

It appears that the NLTK library doesn't like unicode objects. Fine, then you have to make sure that you are using str instances with UTF-8 encoded text. Try using this:

tokens = nltk.wordpunct_tokenize(unirawness)
nltktext = nltk.Text([token.encode('utf-8') for token in tokens])

and this:

jottedf = u'\n'.join(jotted)
filemydata.write(jottedf.encode('utf-8'))

but if jotted is really a list of UTF-8-encoded str, then you don't need this and this should be enough:

jottedf = '\n'.join(jotted)
filemydata.write(jottedf)

By the way, it looks as though NLTK isn't very cautious with respect to unicode and encoding (at least, the demos). Better be careful and check that it has processed your tokens correctly. Also, and this may have caused the fact that you get errors with Hungarian text and not German text, check your encodings.

回复收藏 0 原文

~没有更多了~