使用 nltk 标记 unicode
我有使用 utf-8 编码的文本文件,其中包含“ö”、“ü”等字符。我想解析这些文件中的文本,但无法让分词器正常工作。如果我使用标准 nltk 分词器:
f = open('C:\Python26\text.txt', 'r') # text = 'müsli pöök rääk'
text = f.read()
f.close
items = text.decode('utf8')
a = nltk.word_tokenize(items)
输出:[u'\ufeff', u'm', u'\xfc', u'sli', u'p', u'\xf6', u'\xf6 ', u'k', u'r', u'\xe4', u'\xe4', u'k']
Punkt 分词器似乎做得更好:
f = open('C:\Python26\text.txt', 'r') # text = 'müsli pöök rääk'
text = f.read()
f.close
items = text.decode('utf8')
a = PunktWordTokenizer().tokenize(items)
输出: [u'\ufeffm\xfcsli', u'p\xf6\xf6k', u'r\xe4\xe4k']
在第一个标记之前仍然有 '\ufeff' 我不能弄清楚(并不是说我无法删除它)。我做错了什么?非常感谢帮助。
I have textfiles that use utf-8 encoding that contain characters like 'ö', 'ü', etc. I would like to parse the text form these files, but I can't get the tokenizer to work properly. If I use standard nltk tokenizer:
f = open('C:\Python26\text.txt', 'r') # text = 'müsli pöök rääk'
text = f.read()
f.close
items = text.decode('utf8')
a = nltk.word_tokenize(items)
Output: [u'\ufeff', u'm', u'\xfc', u'sli', u'p', u'\xf6', u'\xf6', u'k', u'r', u'\xe4', u'\xe4', u'k']
Punkt tokenizer seems to do better:
f = open('C:\Python26\text.txt', 'r') # text = 'müsli pöök rääk'
text = f.read()
f.close
items = text.decode('utf8')
a = PunktWordTokenizer().tokenize(items)
output: [u'\ufeffm\xfcsli', u'p\xf6\xf6k', u'r\xe4\xe4k']
There is still '\ufeff' before the first token that i can't figure out (not that I can't remove it). What am I doing wrong? Help greatly appreciated.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
\uFEFF
字符更有可能是从文件读取的内容的一部分。我怀疑它是由标记器插入的。文件开头的\uFEFF
是已弃用的 字节顺序标记。如果它出现在其他地方,则将其视为零宽度不间断空格。该文件是用微软记事本写的吗?来自编解码器模块文档:
尝试使用
codecs.open()
相反。请注意使用 BOM 的"utf-8-sig"
编码。实验:
It's more likely that the
\uFEFF
char is part of the content read from the file. I doubt it was inserted by the tokeniser.\uFEFF
at the beginning of a file is a deprecated form of Byte Order Mark. If it appears anywhere else, then it is treated as a zero width non-break space.Was the file written by Microsoft Notepad? From the codecs module docs:
Try reading your file using
codecs.open()
instead. Note the"utf-8-sig"
encoding which consumes the BOM.Experiment:
您应该确保将 unicode 字符串传递给 nltk 分词器。我得到了以下与我端的两个标记器相同的字符串标记化:
You should make sure that you're passing unicode strings to nltk tokenizers. I get the following identical tokenizations of your string with both tokenizers on my end:
UFEE 代码是一个“ZERO WIDTH NO-BREAK SPACE”字符,
re
模块不会将其视为空格,因此使用正则表达式的PunktWordTokenizer()
带有 unicode 和 dotall 标志的r'\w+|[^\w\s]+'
将此字符识别为单词。如果您不想手动删除字符,可以使用以下标记生成器:the UFEE code is a "ZERO WIDTH NO-BREAK SPACE" character and this is not consider as a space by the
re
module, so thePunktWordTokenizer()
which use the regexr'\w+|[^\w\s]+'
with unicode and dotall flags recognize this character as a word. If you don't want to remove the character manually, you could use the following tokenizer: