奇怪的标记化错误

发布于 2024-10-01 01:42:36 字数 1056 浏览 7 评论 0原文

我正在尝试对推文进行标记，但在循环它们时遇到了问题。例如，这工作正常：

tweet = """This is an example Tweet!"""

# tokenize the sentence

print text
print list(token[1] for token in tokenize.generate_tokens(cStringIO.StringIO(text).readline)if token[1])

但是，这不行：

for tweet in tweets: 
    text = tweet['tweet']

    # tokenize the sentence

    print text
    print list(token[1] for token in tokenize.generate_tokens(cStringIO.StringIO(text).readline)if token[1])

我得到这个列表：

Sickipedia is hilarious xD
['S', '\x00', 'i', '\x00', 'c', '\x00', 'k', '\x00', 'i', '\x00', 'p', '\x00', 'e', '\x00', 'd', '\x00', 'i', '\x00', 'a', '\x00', ' ', '\x00', 'i', '\x00', 's', '\x00', ' ', '\x00', 'h', '\x00', 'i', '\x00', 'l', '\x00', 'a', '\x00', 'r', '\x00', 'i', '\x00', 'o', '\x00', 'u', '\x00', 's', '\x00', ' ', '\x00', 'x', '\x00', 'D', '\x00']

当它应该读成类似：

Sikipedia is hilarious xD
['Sikipedia', 'is', 'hilarious', 'xD']

有什么想法吗？我正在使用 Python 和 Mongo。提前致谢

原文

I'm trying to Tokenize tweets, but am running into a problem with looping through them.
For example, this works fine:

tweet = """This is an example Tweet!"""

# tokenize the sentence

print text
print list(token[1] for token in tokenize.generate_tokens(cStringIO.StringIO(text).readline)if token[1])

But, this does not:

for tweet in tweets: 
    text = tweet['tweet']

    # tokenize the sentence

    print text
    print list(token[1] for token in tokenize.generate_tokens(cStringIO.StringIO(text).readline)if token[1])

I get this list back:

Sickipedia is hilarious xD
['S', '\x00', 'i', '\x00', 'c', '\x00', 'k', '\x00', 'i', '\x00', 'p', '\x00', 'e', '\x00', 'd', '\x00', 'i', '\x00', 'a', '\x00', ' ', '\x00', 'i', '\x00', 's', '\x00', ' ', '\x00', 'h', '\x00', 'i', '\x00', 'l', '\x00', 'a', '\x00', 'r', '\x00', 'i', '\x00', 'o', '\x00', 'u', '\x00', 's', '\x00', ' ', '\x00', 'x', '\x00', 'D', '\x00']

When it should read something like:

Sikipedia is hilarious xD
['Sikipedia', 'is', 'hilarious', 'xD']

Any ideas? I'm using Python by the way w/ Mongo.
Thanks in advance

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

酒解孤独 2024-10-08 01:42:36

令牌化包NLTK 中的 > 是一个很好的起点。要处理 Twitter 数据中发现的一些独特现象/字符串，您可以自定义此包以满足您的需求。

回复收藏 0 原文

五里雾 2024-10-08 01:42:36

您的输出表明您的文本采用 UTF-16 编码。尝试打印文本的 repr()（在任何情况下这都是一个好主意），您应该在看到的每个字符之间看到相同的 '\x00'在标记化输出中。您应该查看它是哪种形式的 UTF-16（如果它不是以 \xff\xfe 或 \xfe\xff 开头），然后使用解码字符串方法。您将无法将 unicode 提供给 cStringIO，因此您必须编写另一个函数来替换 cStringIO.StringIO(text).readline，或者将其编码回来以更合适的编码转换为字节串。