使用 nltk 标记 unicode

发布于 2025-01-05 01:57:47 字数 736 浏览 4 评论 0原文

我有使用 utf-8 编码的文本文件，其中包含“ö”、“ü”等字符。我想解析这些文件中的文本，但无法让分词器正常工作。如果我使用标准 nltk 分词器：

f = open('C:\Python26\text.txt', 'r') # text = 'müsli pöök rääk'
text = f.read()
f.close
items = text.decode('utf8')
a = nltk.word_tokenize(items)

输出：[u'\ufeff', u'm', u'\xfc', u'sli', u'p', u'\xf6', u'\xf6 ', u'k', u'r', u'\xe4', u'\xe4', u'k']

Punkt 分词器似乎做得更好：

f = open('C:\Python26\text.txt', 'r') # text = 'müsli pöök rääk'
text = f.read()
f.close
items = text.decode('utf8')
a = PunktWordTokenizer().tokenize(items)

输出： [u'\ufeffm\xfcsli', u'p\xf6\xf6k', u'r\xe4\xe4k']

在第一个标记之前仍然有 '\ufeff' 我不能弄清楚（并不是说我无法删除它）。我做错了什么？非常感谢帮助。

原文

I have textfiles that use utf-8 encoding that contain characters like 'ö', 'ü', etc. I would like to parse the text form these files, but I can't get the tokenizer to work properly. If I use standard nltk tokenizer:

f = open('C:\Python26\text.txt', 'r') # text = 'müsli pöök rääk'
text = f.read()
f.close
items = text.decode('utf8')
a = nltk.word_tokenize(items)

Output: [u'\ufeff', u'm', u'\xfc', u'sli', u'p', u'\xf6', u'\xf6', u'k', u'r', u'\xe4', u'\xe4', u'k']

Punkt tokenizer seems to do better:

f = open('C:\Python26\text.txt', 'r') # text = 'müsli pöök rääk'
text = f.read()
f.close
items = text.decode('utf8')
a = PunktWordTokenizer().tokenize(items)

output: [u'\ufeffm\xfcsli', u'p\xf6\xf6k', u'r\xe4\xe4k']

There is still '\ufeff' before the first token that i can't figure out (not that I can't remove it). What am I doing wrong? Help greatly appreciated.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

樱＆纷飞 2025-01-12 01:57:47

\uFEFF 字符更有可能是从文件读取的内容的一部分。我怀疑它是由标记器插入的。文件开头的 \uFEFF 是已弃用的字节顺序标记。如果它出现在其他地方，则将其视为零宽度不间断空格。

该文件是用微软记事本写的吗？来自编解码器模块文档：

为了提高检测 UTF-8 编码的可靠性，Microsoft 为其记事本程序发明了 UTF-8 的变体（Python 2.5 称为“utf-8-sig”）：在任何 Unicode 字符之前写入文件时，会写入 UTF-8 编码的 BOM（字节序列如下所示：0xef、0xbb、0xbf）。

尝试使用 codecs.open() 相反。请注意使用 BOM 的 "utf-8-sig" 编码。

import codecs
f = codecs.open('C:\Python26\text.txt', 'r', 'utf-8-sig')
text = f.read()
a = nltk.word_tokenize(text)

实验：

>>> open("x.txt", "r").read().decode("utf-8")
u'\ufeffm\xfcsli'
>>> import codecs
>>> codecs.open("x.txt", "r", "utf-8-sig").read()
u'm\xfcsli'
>>>

It's more likely that the \uFEFF char is part of the content read from the file. I doubt it was inserted by the tokeniser. \uFEFF at the beginning of a file is a deprecated form of Byte Order Mark. If it appears anywhere else, then it is treated as a zero width non-break space.

Was the file written by Microsoft Notepad? From the codecs module docs:

To increase the reliability with which a UTF-8 encoding can be detected, Microsoft invented a variant of UTF-8 (that Python 2.5 calls "utf-8-sig") for its Notepad program: Before any of the Unicode characters is written to the file, a UTF-8 encoded BOM (which looks like this as a byte sequence: 0xef, 0xbb, 0xbf) is written.

Try reading your file using codecs.open() instead. Note the "utf-8-sig" encoding which consumes the BOM.

import codecs
f = codecs.open('C:\Python26\text.txt', 'r', 'utf-8-sig')
text = f.read()
a = nltk.word_tokenize(text)

Experiment:

>>> open("x.txt", "r").read().decode("utf-8")
u'\ufeffm\xfcsli'
>>> import codecs
>>> codecs.open("x.txt", "r", "utf-8-sig").read()
u'm\xfcsli'
>>>

回复收藏 0 原文

假情假意假温柔 2025-01-12 01:57:47

您应该确保将 unicode 字符串传递给 nltk 分词器。我得到了以下与我端的两个标记器相同的字符串标记化：

import nltk
nltk.wordpunct_tokenize('müsli pöök rääk'.decode('utf8'))
# output : [u'm\xfcsli', u'p\xf6\xf6k', u'r\xe4\xe4k']

nltk.word_tokenize('müsli pöök rääk'.decode('utf8'))
# output: [u'm\xfcsli', u'p\xf6\xf6k', u'r\xe4\xe4k']

You should make sure that you're passing unicode strings to nltk tokenizers. I get the following identical tokenizations of your string with both tokenizers on my end:

import nltk
nltk.wordpunct_tokenize('müsli pöök rääk'.decode('utf8'))
# output : [u'm\xfcsli', u'p\xf6\xf6k', u'r\xe4\xe4k']

nltk.word_tokenize('müsli pöök rääk'.decode('utf8'))
# output: [u'm\xfcsli', u'p\xf6\xf6k', u'r\xe4\xe4k']

回复收藏 0 原文

A君 2025-01-12 01:57:47

UFEE 代码是一个“ZERO WIDTH NO-BREAK SPACE”字符，re 模块不会将其视为空格，因此使用正则表达式的 PunktWordTokenizer()带有 unicode 和 dotall 标志的 r'\w+|[^\w\s]+' 将此字符识别为单词。如果您不想手动删除字符，可以使用以下标记生成器：

nltk.RegexpTokenizer(u'\w+|[^\w\s\ufeff]+')

the UFEE code is a "ZERO WIDTH NO-BREAK SPACE" character and this is not consider as a space by the re module, so the PunktWordTokenizer() which use the regex r'\w+|[^\w\s]+' with unicode and dotall flags recognize this character as a word. If you don't want to remove the character manually, you could use the following tokenizer:

nltk.RegexpTokenizer(u'\w+|[^\w\s\ufeff]+')

回复收藏 0 原文

~没有更多了~

关于作者

陈独秀

暂无简介

文章

29 人气

关注发私信

友情链接

文江博客

使用 nltk 标记 unicode

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（3）

关于作者

相关话题

热门标签

推荐作者

达拉崩吧

PANGOO

kkgtx

WordPress小学生

酷炫老祖宗

硪扪都還晓

友情链接

使用 nltk 标记 unicode

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（3）

关于作者

相关话题

热门标签

推荐作者

达拉崩吧

PANGOO

kkgtx

WordPress小学生

酷炫老祖宗

硪扪都還晓

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。