Python open(“x”, “r”) 函数，我如何知道或控制文件应该具有哪种编码？

发布于 2024-11-04 13:33:06 字数 714 浏览 0 评论 0原文

如果 python 脚本使用 open("filename", "r") 函数打开并随后读取文本文件的内容，我如何知道该文件应该具有哪种编码？

请注意，由于我是从自己的程序执行此脚本，因此如果有任何方法可以通过环境变量来控制它，那么这对我来说就足够了。

顺便说一句，这是 Python 2.7。

有问题的代码来自 Mercurial，它可以提供一个文件列表，例如通过磁盘上的文件添加到存储库，而不是在命令行上传递它们。

所以基本上，而不是这样：

hg add A B C

我可以将 A、B 和 C 写到一个文件中，每个文件之间有换行符，然后执行以下命令：

hg add listfile:input.txt

最终读取该文件的代码是这样的：

files = open(name, 'r').read().split(delimiter)

因此是我的问题。当我询问应该使用哪种编码时，IRC 上给出的答案是：

它与传递文件参数时在命令行上使用的编码相同

我认为这意味着它与我执行 Mercurial (hg) 时“使用”的编码相同。由于我不知道那是哪种编码，所以我只是将所有内容都提供给 .NET Process 对象，我在这里询问。

原文

If a python script uses the open("filename", "r") function to open, and subsequently read, the contents of a text file, how can I tell which encoding this file is supposed to have?

Note that since I'm executing this script from my own program, if there is any way to control this through environment variables, then that is good enough for me.

This is Python 2.7 by the way.

The code in question comes from Mercurial, it can be given a list of files to, say, add to the repository, through a file on disk, instead of passing them on the command line.

So basically, instead of this:

hg add A B C

I can write out A, B and C to a file, with newlines between each, and then execute the following:

hg add listfile:input.txt

The code that ends up reading this file is this:

files = open(name, 'r').read().split(delimiter)

Hence my question. The answer I was given on IRC when I asked which encoding I should use was this:

it is the same encoding than the one you use on command line when passing a file argument

I take this to mean that it is the same encoding I "use" when I execute Mercurial (hg). Since I have no idea which encoding that is, I just give everything to the .NET Process object, I ask here.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

追风人 2024-11-11 13:33:06

你不能。读取文件与其编码无关；您需要提前知道编码，以便正确解释您读入的字节。

例如，如果您知道文件是用 UTF-8 编码的：

with open('filename', 'rb') as f:
    contents = f.read().decode('utf-8-sig')    # -sig deals with BOM, if present

或者如果您知道文件仅是 ASCII：

with open('filename', 'r') as f:
    contents = f.read()    # results in a str object

如果您确实不知道不知道文件的编码，那么显然不能保证你能正确读取它；但是，您可以使用 chardet 等工具猜测编码。

更新：

我想我现在明白你的问题了。我以为你有一个需要为其编写代码的文件，但似乎你有一个需要为其编写文件的代码;-)

有问题的代码可能只能正确处理纯 ASCII（字符串可能稍后会被转换，但是我认为不太可能）。因此，您需要创建一个仅包含 ASCII（代码点 < 128）字符的文本文件，并确保它以 ASCII 编码（即不是 UTF-16 或类似编码）保存。考虑到 Mercurial 处理的文件名可能包含 Unicode 字符，这有点不幸。

You can't. Reading a file is independent of its encoding; you'll need to know the encoding in advance in order to properly interpret the bytes you read in.

For example, if you know the file is encoded in UTF-8:

with open('filename', 'rb') as f:
    contents = f.read().decode('utf-8-sig')    # -sig deals with BOM, if present

Or if you know the file is ASCII only:

with open('filename', 'r') as f:
    contents = f.read()    # results in a str object

If you really don't know the encoding of the file, then there's obviously no guarantee that you can read it properly; however, you can guess at the encoding using a tool like chardet.

UPDATE:

I think I understand your question now. I thought you had a file you needed to write code for, but it seems you have code you need to write a file for ;-)

The code in question probably only deals properly with plain ASCII (it's possible the strings are converted later, but unlikely I think). So you'll want to make a text file that contains only ASCII (codepoint < 128) characters, and make sure it is saved in an ASCII encoding (i.e. not UTF-16 or anything like that). This is a little unfortunate considering that Mercurial deals with filenames, which can contain Unicode characters.

回复收藏 0 原文

~没有更多了~