奇怪的标记化错误
我正在尝试对推文进行标记,但在循环它们时遇到了问题。 例如,这工作正常:
tweet = """This is an example Tweet!"""
# tokenize the sentence
print text
print list(token[1] for token in tokenize.generate_tokens(cStringIO.StringIO(text).readline)if token[1])
但是,这不行:
for tweet in tweets:
text = tweet['tweet']
# tokenize the sentence
print text
print list(token[1] for token in tokenize.generate_tokens(cStringIO.StringIO(text).readline)if token[1])
我得到这个列表:
Sickipedia is hilarious xD
['S', '\x00', 'i', '\x00', 'c', '\x00', 'k', '\x00', 'i', '\x00', 'p', '\x00', 'e', '\x00', 'd', '\x00', 'i', '\x00', 'a', '\x00', ' ', '\x00', 'i', '\x00', 's', '\x00', ' ', '\x00', 'h', '\x00', 'i', '\x00', 'l', '\x00', 'a', '\x00', 'r', '\x00', 'i', '\x00', 'o', '\x00', 'u', '\x00', 's', '\x00', ' ', '\x00', 'x', '\x00', 'D', '\x00']
当它应该读成类似:
Sikipedia is hilarious xD
['Sikipedia', 'is', 'hilarious', 'xD']
有什么想法吗?我正在使用 Python 和 Mongo。 提前致谢
I'm trying to Tokenize tweets, but am running into a problem with looping through them.
For example, this works fine:
tweet = """This is an example Tweet!"""
# tokenize the sentence
print text
print list(token[1] for token in tokenize.generate_tokens(cStringIO.StringIO(text).readline)if token[1])
But, this does not:
for tweet in tweets:
text = tweet['tweet']
# tokenize the sentence
print text
print list(token[1] for token in tokenize.generate_tokens(cStringIO.StringIO(text).readline)if token[1])
I get this list back:
Sickipedia is hilarious xD
['S', '\x00', 'i', '\x00', 'c', '\x00', 'k', '\x00', 'i', '\x00', 'p', '\x00', 'e', '\x00', 'd', '\x00', 'i', '\x00', 'a', '\x00', ' ', '\x00', 'i', '\x00', 's', '\x00', ' ', '\x00', 'h', '\x00', 'i', '\x00', 'l', '\x00', 'a', '\x00', 'r', '\x00', 'i', '\x00', 'o', '\x00', 'u', '\x00', 's', '\x00', ' ', '\x00', 'x', '\x00', 'D', '\x00']
When it should read something like:
Sikipedia is hilarious xD
['Sikipedia', 'is', 'hilarious', 'xD']
Any ideas? I'm using Python by the way w/ Mongo.
Thanks in advance
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
令牌化包NLTK 中的 > 是一个很好的起点。要处理 Twitter 数据中发现的一些独特现象/字符串,您可以自定义此包以满足您的需求。
The tokenize Package in NLTK is a good place to start. To handle some of the unique phenomena/strings found in twitter data, you can customize this package to suit your needs.
您的输出表明您的文本采用 UTF-16 编码。尝试打印文本的
repr()
(在任何情况下这都是一个好主意),您应该在看到的每个字符之间看到相同的'\x00'
在标记化输出中。您应该查看它是哪种形式的 UTF-16(如果它不是以\xff\xfe
或\xfe\xff
开头),然后使用解码
字符串方法。您将无法将 unicode 提供给cStringIO
,因此您必须编写另一个函数来替换cStringIO.StringIO(text).readline
,或者将其编码回来以更合适的编码转换为字节串。Your output suggests your text is encoded in UTF-16. Try printing the
repr()
of the text (which is a good idea in any case) and you should see the same'\x00'
s between each character that you see in the tokenized output. You should see which form of UTF-16 it is (if it doesn't start with\xff\xfe
or\xfe\xff
) and then decode it, using thedecode
string method. You won't be able to feed unicode tocStringIO
, so you'd have to write another function to replacecStringIO.StringIO(text).readline
, or encode it back into a bytestring in a more appropriate encoding.为什么要对推文进行标记?正如 docs 中所述,“
tokenize
模块提供Python 源代码的词法扫描器”。似乎很多推文都不太可能包含 Python 源代码。如果您只是想将文本拆分为单词,则应该使用类似 的内容重新分割。
但无论如何,出现奇怪结果的原因是您的推文是用 UTF-16 编码的。
Why would you want to tokenize tweets? As it says in the docs, "The
tokenize
module provides a lexical scanner for Python source code". It doesn't seem likely that many tweets consist of Python source code.If you're just trying to split the text into words, you should be using something like re.split.
But anyway, the reason for the strange results is that your tweet is encoded in UTF-16.