使用 Python 编码东亚语言
这可能不是一个真正与 Python 相关的问题,而是与一般语言编码有关。我正在从 Twitter 中挖掘推文,似乎有一个大型的日本用户社区(包含日语消息)。当我尝试将推文编码为 XML 文件时,我使用了 utf-8。例如tweet=tweet.encode('utf-8') 并且没有一条日语推文按其应有的方式出现。我提出的问题是,我应该如何对它们进行编码?我的错误是什么?如果我要将数据存储在 CSV 中,在这种情况下我会使用什么编码方案?
This may not really be a Python related question, but pertains to language encoding in general. I'm mining tweets from Twitter, and it appears that there is a large Japanese user community (with messages in Japanese). When I tried encoding the tweets for an XML file I used utf-8. e.g tweet=tweet.encode('utf-8') and none of the Japanese tweets appeared as they should have. My question that I am posing is, how should I have encoded them? What was my mistake? If I was to store the data in a CSV, what encoding scheme would I use in that case?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
通常,您会查询数据的编码格式。话虽如此,Shift-JIS 是日语文本的一种非常流行的编码。
Normally you would query the format for what encoding the data is in. Having said that, Shift-JIS is quite a popular encoding for Japanese text.
从 Twitter 读取推文时,应该有一种方法可以查询推文的编码。然后,在将它们读入程序时将它们解码为 Unicode,然后在将它们写回 XML 文件时编码。例如,中文可能使用 gbk 编码:
There should be a way to query the encoding of the tweets when read from Twitter. You then decode them to Unicode as you read them into your program, then encode them when you write them back out to an XML file. Chinese, for example, might be using gbk encoding: