使用 nltk 标记 unicode

发布于 2025-01-05 01:57:47 字数 736 浏览 4 评论 0原文

我有使用 utf-8 编码的文本文件,其中包含“ö”、“ü”等字符。我想解析这些文件中的文本,但无法让分词器正常工作。如果我使用标准 nltk 分词器:

f = open('C:\Python26\text.txt', 'r') # text = 'müsli pöök rääk'
text = f.read()
f.close
items = text.decode('utf8')
a = nltk.word_tokenize(items)

输出:[u'\ufeff', u'm', u'\xfc', u'sli', u'p', u'\xf6', u'\xf6 ', u'k', u'r', u'\xe4', u'\xe4', u'k']

Punkt 分词器似乎做得更好:

f = open('C:\Python26\text.txt', 'r') # text = 'müsli pöök rääk'
text = f.read()
f.close
items = text.decode('utf8')
a = PunktWordTokenizer().tokenize(items)

输出: [u'\ufeffm\xfcsli', u'p\xf6\xf6k', u'r\xe4\xe4k']

在第一个标记之前仍然有 '\ufeff' 我不能弄清楚(并不是说我无法删除它)。我做错了什么?非常感谢帮助。

I have textfiles that use utf-8 encoding that contain characters like 'ö', 'ü', etc. I would like to parse the text form these files, but I can't get the tokenizer to work properly. If I use standard nltk tokenizer:

f = open('C:\Python26\text.txt', 'r') # text = 'müsli pöök rääk'
text = f.read()
f.close
items = text.decode('utf8')
a = nltk.word_tokenize(items)

Output: [u'\ufeff', u'm', u'\xfc', u'sli', u'p', u'\xf6', u'\xf6', u'k', u'r', u'\xe4', u'\xe4', u'k']

Punkt tokenizer seems to do better:

f = open('C:\Python26\text.txt', 'r') # text = 'müsli pöök rääk'
text = f.read()
f.close
items = text.decode('utf8')
a = PunktWordTokenizer().tokenize(items)

output: [u'\ufeffm\xfcsli', u'p\xf6\xf6k', u'r\xe4\xe4k']

There is still '\ufeff' before the first token that i can't figure out (not that I can't remove it). What am I doing wrong? Help greatly appreciated.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

樱&纷飞 2025-01-12 01:57:47

\uFEFF 字符更有可能是从文件读取的内容的一部分。我怀疑它是由标记器插入的。文件开头的 \uFEFF 是已弃用的 字节顺序标记。如果它出现在其他地方,则将其视为零宽度不间断空格

该文件是用微软记事本写的吗?来自编解码器模块文档

为了提高检测 UTF-8 编码的可靠性,Microsoft 为其记事本程序发明了 UTF-8 的变体(Python 2.5 称为“utf-8-sig”): 在任何 Unicode 字符之前写入文件时,会写入 UTF-8 编码的 BOM(字节序列如下所示:0xef、0xbb、0xbf)。

尝试使用 codecs.open() 相反。请注意使用 BOM 的 "utf-8-sig" 编码。

import codecs
f = codecs.open('C:\Python26\text.txt', 'r', 'utf-8-sig')
text = f.read()
a = nltk.word_tokenize(text)

实验:

>>> open("x.txt", "r").read().decode("utf-8")
u'\ufeffm\xfcsli'
>>> import codecs
>>> codecs.open("x.txt", "r", "utf-8-sig").read()
u'm\xfcsli'
>>> 

It's more likely that the \uFEFF char is part of the content read from the file. I doubt it was inserted by the tokeniser. \uFEFF at the beginning of a file is a deprecated form of Byte Order Mark. If it appears anywhere else, then it is treated as a zero width non-break space.

Was the file written by Microsoft Notepad? From the codecs module docs:

To increase the reliability with which a UTF-8 encoding can be detected, Microsoft invented a variant of UTF-8 (that Python 2.5 calls "utf-8-sig") for its Notepad program: Before any of the Unicode characters is written to the file, a UTF-8 encoded BOM (which looks like this as a byte sequence: 0xef, 0xbb, 0xbf) is written.

Try reading your file using codecs.open() instead. Note the "utf-8-sig" encoding which consumes the BOM.

import codecs
f = codecs.open('C:\Python26\text.txt', 'r', 'utf-8-sig')
text = f.read()
a = nltk.word_tokenize(text)

Experiment:

>>> open("x.txt", "r").read().decode("utf-8")
u'\ufeffm\xfcsli'
>>> import codecs
>>> codecs.open("x.txt", "r", "utf-8-sig").read()
u'm\xfcsli'
>>> 
假情假意假温柔 2025-01-12 01:57:47

您应该确保将 unicode 字符串传递给 nltk 分词器。我得到了以下与我端的两个标记器相同的字符串标记化:

import nltk
nltk.wordpunct_tokenize('müsli pöök rääk'.decode('utf8'))
# output : [u'm\xfcsli', u'p\xf6\xf6k', u'r\xe4\xe4k']

nltk.word_tokenize('müsli pöök rääk'.decode('utf8'))
# output: [u'm\xfcsli', u'p\xf6\xf6k', u'r\xe4\xe4k']

You should make sure that you're passing unicode strings to nltk tokenizers. I get the following identical tokenizations of your string with both tokenizers on my end:

import nltk
nltk.wordpunct_tokenize('müsli pöök rääk'.decode('utf8'))
# output : [u'm\xfcsli', u'p\xf6\xf6k', u'r\xe4\xe4k']

nltk.word_tokenize('müsli pöök rääk'.decode('utf8'))
# output: [u'm\xfcsli', u'p\xf6\xf6k', u'r\xe4\xe4k']
A君 2025-01-12 01:57:47

UFEE 代码是一个“ZERO WIDTH NO-BREAK SPACE”字符,re 模块不会将其视为空格,因此使用正则表达式的 PunktWordTokenizer()带有 unicode 和 dotall 标志的 r'\w+|[^\w\s]+' 将此字符识别为单词。如果您不想手动删除字符,可以使用以下标记生成器:

nltk.RegexpTokenizer(u'\w+|[^\w\s\ufeff]+')

the UFEE code is a "ZERO WIDTH NO-BREAK SPACE" character and this is not consider as a space by the re module, so the PunktWordTokenizer() which use the regex r'\w+|[^\w\s]+' with unicode and dotall flags recognize this character as a word. If you don't want to remove the character manually, you could use the following tokenizer:

nltk.RegexpTokenizer(u'\w+|[^\w\s\ufeff]+')
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文