计算文件编码:我知道字符串,知道字符,编码是什么?
我正在将 csv 文件中的数据添加到数据库中。如果我打开 CSV 文件,某些条目包含项目符号点 - 我可以看到它们。 file
表示它被编码为 ISO-8859。
$ file data_clean.csv
data_clean.csv: ISO-8859 English text, with very long lines, with CRLF, LF line terminators
我按如下方式读取它,并将其从 ISO-8859-1 转换为 UTF-8,这是我的数据库需要的。
row = [unicode(x.decode("ISO-8859-1").strip()) for x in row]
print row[4]
description = row[4].encode("UTF-8")
print description
这给了我以下信息:
'\xa5 Research and insight \n\xa5 Media and communications'
¥ Research and insight
¥ Media and communications
为什么 \xa5 项目符号字符转换为日元符号?
我认为是因为我将其读取为错误的编码,但在这种情况下正确的编码是什么?它也不是cp1252。
更一般地说,是否有一个工具可以指定(i)字符串(ii)已知字符,并找出编码?
I'm adding data from a csv file into a database. If I open the CSV file, some of the entries contain bullet points - I can see them. file
says it is encoded as ISO-8859.
$ file data_clean.csv
data_clean.csv: ISO-8859 English text, with very long lines, with CRLF, LF line terminators
I read it in as follows and convert it from ISO-8859-1 to UTF-8, which my database requires.
row = [unicode(x.decode("ISO-8859-1").strip()) for x in row]
print row[4]
description = row[4].encode("UTF-8")
print description
This gives me the following:
'\xa5 Research and insight \n\xa5 Media and communications'
¥ Research and insight
¥ Media and communications
Why is the \xa5 bullet character converting as a yen symbol?
I assume because I'm reading it in as the wrong encoding, but what is the right encoding in this case? It isn't cp1252 either.
More generally, is there a tool where you can specify (i) string (ii) known character, and find out the encoding?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
我不知道有什么通用工具,但是这个维基百科页面(链接来自 代码页 1252 上的页面)显示
A5
是Mac OS Roman 代码页中的项目符号点。I don't know of any general tool, but this Wikipedia page (linked from the page on codepage 1252) shows that
A5
is a bullet point in the Mac OS Roman codepage.您可以轻松地用 Python 编写一个。
(示例使用 3.x 语法。)
因此,如果您知道您的要点是 U+2022,那么
You can easily write one in Python.
(Examples use 3.x syntax.)
So if you know that your bullet point is U+2022, then
你可以尝试
如果你知道它确实是 iso-latin-1
虽然在 iso-latin-1 \xA5 确实是 ¥
编辑:实际上这似乎是 Mac 上的一个问题,使用 Word 或类似的以及 Arial (?) 和打印或转换为 PDF。关于字体的一些问题以及其他问题。也许您需要先明确地处理该文件。听起来很熟悉吗?
You could try
if you know it is indeed iso-latin-1
Although in iso-latin-1 \xA5 is indeed a ¥
Edit: Actually this seems to be a problem on Mac, using Word or similar and Arial (?) and printing or converting to PDF. Some issues about fonts and what not. Maybe you need to explicitly massage the file first. Sounds familiar?