计算文件编码：我知道字符串，知道字符，编码是什么？

发布于 2024-09-14 09:13:56 字数 762 浏览 2 评论 0原文

我正在将 csv 文件中的数据添加到数据库中。如果我打开 CSV 文件，某些条目包含项目符号点 - 我可以看到它们。 file 表示它被编码为 ISO-8859。

$ file data_clean.csv 
data_clean.csv: ISO-8859 English text, with very long lines, with CRLF, LF line terminators

我按如下方式读取它，并将其从 ISO-8859-1 转换为 UTF-8，这是我的数据库需要的。

    row = [unicode(x.decode("ISO-8859-1").strip()) for x in row]
    print row[4]    
    description = row[4].encode("UTF-8")
    print description

这给了我以下信息：

'\xa5 Research and insight \n\xa5 Media and communications'
¥ Research and insight 
¥ Media and communications

为什么 \xa5 项目符号字符转换为日元符号？

我认为是因为我将其读取为错误的编码，但在这种情况下正确的编码是什么？它也不是cp1252。

更一般地说，是否有一个工具可以指定（i）字符串（ii）已知字符，并找出编码？

原文

I'm adding data from a csv file into a database. If I open the CSV file, some of the entries contain bullet points - I can see them. file says it is encoded as ISO-8859.

$ file data_clean.csv 
data_clean.csv: ISO-8859 English text, with very long lines, with CRLF, LF line terminators

I read it in as follows and convert it from ISO-8859-1 to UTF-8, which my database requires.

    row = [unicode(x.decode("ISO-8859-1").strip()) for x in row]
    print row[4]    
    description = row[4].encode("UTF-8")
    print description

This gives me the following:

'\xa5 Research and insight \n\xa5 Media and communications'
¥ Research and insight 
¥ Media and communications

Why is the \xa5 bullet character converting as a yen symbol?

I assume because I'm reading it in as the wrong encoding, but what is the right encoding in this case? It isn't cp1252 either.

More generally, is there a tool where you can specify (i) string (ii) known character, and find out the encoding?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

策马西风 2024-09-21 09:13:56

我不知道有什么通用工具，但是这个维基百科页面（链接来自代码页 1252 上的页面）显示 A5 是Mac OS Roman 代码页中的项目符号点。

回复收藏 0 原文

很酷又爱笑 2024-09-21 09:13:56

更一般地说，有没有一个工具可以
您可以指定 (i) 字符串 (ii) 已知
字符，并找出编码？

您可以轻松地用 Python 编写一个。
（示例使用 3.x 语法。）

import encodings

ENCODINGS = set(encodings._aliases.values()) - {'mbcs', 'tactis'}

def _decode(data, encoding):
    try:
        return data.decode(encoding)
    except UnicodeError:
        return None

def possible_encodings(encoded, decoded):
    return {enc for enc in ENCODINGS if _decode(encoded, enc) == decoded}

因此，如果您知道您的要点是 U+2022，那么

>>> possible_encodings(b'\xA5', '\u2022')
{'mac_iceland', 'mac_roman', 'mac_turkish', 'mac_latin2', 'mac_cyrillic'}

More generally, is there a tool where
you can specify (i) string (ii) known
character, and find out the encoding?

You can easily write one in Python.
(Examples use 3.x syntax.)

import encodings

ENCODINGS = set(encodings._aliases.values()) - {'mbcs', 'tactis'}

def _decode(data, encoding):
    try:
        return data.decode(encoding)
    except UnicodeError:
        return None

def possible_encodings(encoded, decoded):
    return {enc for enc in ENCODINGS if _decode(encoded, enc) == decoded}

So if you know that your bullet point is U+2022, then

>>> possible_encodings(b'\xA5', '\u2022')
{'mac_iceland', 'mac_roman', 'mac_turkish', 'mac_latin2', 'mac_cyrillic'}

回复收藏 0 原文

峩卟喜欢 2024-09-21 09:13:56

你可以尝试

 iconv -f latin1 -t utf8 data_clean.csv

如果你知道它确实是 iso-latin-1

虽然在 iso-latin-1 \xA5 确实是 ¥

编辑：实际上这似乎是 Mac 上的一个问题，使用 Word 或类似的以及 Arial (?) 和打印或转换为 PDF。关于字体的一些问题以及其他问题。也许您需要先明确地处理该文件。听起来很熟悉吗？

You could try

 iconv -f latin1 -t utf8 data_clean.csv

if you know it is indeed iso-latin-1

Although in iso-latin-1 \xA5 is indeed a ¥

Edit: Actually this seems to be a problem on Mac, using Word or similar and Arial (?) and printing or converting to PDF. Some issues about fonts and what not. Maybe you need to explicitly massage the file first. Sounds familiar?

回复收藏 0 原文

~没有更多了~