Python:读取文件时如何忽略#comment行
在Python中,我刚刚从文本文件中读取了一行,我想知道如何编写代码来忽略行开头带有哈希#的注释。
我认为应该是这样的:
for
if line !contain #
then ...process line
else end for loop
但我是Python新手,我不知道语法
In Python, I have just read a line form a text file and I'd like to know how to code to ignore comments with a hash # at the beginning of the line.
I think it should be something like this:
for
if line !contain #
then ...process line
else end for loop
But I'm new to Python and I don't know the syntax
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(10)
您可以使用 startswith()
例如
you can use startswith()
eg
我建议您在看到
#
字符时不要忽略整行;忽略该行的其余部分。您可以使用名为partition
的字符串方法函数轻松完成此操作:partition
返回一个元组:分区字符串之前的所有内容、分区字符串以及分区字符串之后的所有内容。因此,通过使用[0]
进行索引,我们仅获取分区字符串之前的部分。编辑:
如果您使用的 Python 版本没有
partition()
,则可以使用以下代码:这会在“#”字符上拆分字符串,然后保留拆分之前的所有内容。
1
参数使.split()
方法在一次分割后停止;因为我们只是获取第 0 个子字符串(通过使用[0]
进行索引),如果没有1
参数,您会得到相同的答案,但这可能会快一点。 (由于 @gnr 的评论,从我的原始代码进行了简化。我的原始代码毫无理由地更加混乱;谢谢 @gnr。)您也可以编写自己的
partition()
版本。这是一个名为part()
的函数:@dalle 指出“#”可以出现在字符串内。正确处理这个案子并不容易,所以我就忽略了它,但我应该说些什么。
如果您的输入文件对于带引号的字符串有足够简单的规则,那么这并不难。如果你接受任何合法的Python引用字符串,那就很难了,因为有单引号、双引号、多行引号(带有转义行尾的反斜杠)、三引号字符串(使用单引号或双引号),并且甚至是原始字符串!正确处理所有这些的唯一可能的方法是使用复杂的状态机。
但是如果我们将自己限制为一个简单的带引号的字符串,我们可以用一个简单的状态机来处理它。我们甚至可以允许在字符串内使用反斜杠引用的双引号。
我真的不想在标记为“初学者”的问题中变得如此复杂,但这个状态机相当简单,我希望它会很有趣。
I recommend you don't ignore the whole line when you see a
#
character; just ignore the rest of the line. You can do that easily with a string method function calledpartition
:partition
returns a tuple: everything before the partition string, the partition string, and everything after the partition string. So, by indexing with[0]
we take just the part before the partition string.EDIT:
If you are using a version of Python that doesn't have
partition()
, here is code you could use:This splits the string on a '#' character, then keeps everything before the split. The
1
argument makes the.split()
method stop after a one split; since we are just grabbing the 0th substring (by indexing with[0]
) you would get the same answer without the1
argument, but this might be a little bit faster. (Simplified from my original code thanks to a comment from @gnr. My original code was messier for no good reason; thanks, @gnr.)You could also just write your own version of
partition()
. Here is one calledpart()
:@dalle noted that '#' can appear inside a string. It's not that easy to handle this case correctly, so I just ignored it, but I should have said something.
If your input file has simple enough rules for quoted strings, this isn't hard. It would be hard if you accepted any legal Python quoted string, because there are single-quoted, double-quoted, multiline quotes with a backslash escaping the end-of-line, triple quoted strings (using either single or double quotes), and even raw strings! The only possible way to correctly handle all that would be a complicated state machine.
But if we limit ourselves to just a simple quoted string, we can handle it with a simple state machine. We can even allow a backslash-quoted double quote inside the string.
I didn't really want to get this complicated in a question tagged "beginner" but this state machine is reasonably simple, and I hope it will be interesting.
我来得这么晚,但是处理 shell 风格(或 python 风格)
#
注释的问题是一个很常见的问题。我几乎每次读取文本文件时都会使用一些代码。
问题在于它无法正确处理引用或转义的评论。但它适用于简单的情况并且很容易。
更强大的解决方案是使用 shlex:
这种 shlex 方法不仅可以正确处理引号和转义,它还添加了许多很酷的功能(例如,如果需要,可以将文件作为其他文件的源)。我还没有测试过它处理大文件的速度,但它处理小文件的速度已经足够快了。
当您还将每个输入行拆分为字段(在空白上)时,常见的情况甚至更简单:
I'm coming at this late, but the problem of handling shell style (or python style)
#
comments is a very common one.I've been using some code almost everytime I read a text file.
Problem is that it doesn't handle quoted or escaped comments properly. But it works for simple cases and is easy.
A more robust solution is to use shlex:
This shlex approach not only handles quotes and escapes properly, it adds a lot of cool functionality (like the ability to have files source other files if you want). I haven't tested it for speed on large files, but it is zippy enough of small stuff.
The common case when you're also splitting each input line into fields (on whitespace) is even simpler:
这是最短的可能形式:
如果您调用的字符串以您传入的字符串开头,则字符串上的
startswith()
方法将返回 True。虽然这在某些情况下(例如 shell 脚本)是可以的,它有两个问题。首先,它没有指定如何打开文件。打开文件的默认模式是
'r'
,意思是“以二进制模式读取文件”。由于您需要一个文本文件,因此最好使用'rt'
打开它。尽管这种区别在类 UNIX 操作系统上无关紧要,但在 Windows(以及 OS X 之前的 Mac)上却很重要。第二个问题是打开的文件句柄。
open()
函数返回一个文件对象,使用完文件后关闭文件被认为是一个很好的做法。为此,请调用对象的close()
方法。现在,Python可能会为你做这件事,最终;在 Python 中,对象是引用计数的,当对象的引用计数为零时,它就会被释放,并且在某些情况下对象被释放后,Python 将调用其析构函数(一种称为 __del__ 的特殊方法)。请注意,我说的是可能:Python 有一个坏习惯,即不会在程序结束前不久对引用计数降至零的对象实际调用析构函数。估计是着急了!对于像 shell 脚本这样的短期程序,特别是文件对象,这并不重要。当程序完成时,您的操作系统将自动清理所有打开的文件句柄。但是,如果您打开文件,读取内容,然后开始长时间计算,而没有先显式关闭文件句柄,Python 可能会在计算期间使文件句柄保持打开状态。这是不好的做法。
这个版本可以在任何 2.x 版本的 Python 中工作,并修复了我上面讨论的两个问题:
这是旧版本 Python 的最佳通用形式。
正如 steveha 所建议的,使用“with”语句现在被认为是最佳实践。如果您使用的是 2.6 或更高版本,您应该这样编写:
“with”语句将为您清理文件句柄。
在你的问题中你说“以#开头的行”,所以这就是我在这里向你展示的内容。如果您想过滤掉以可选空格和然后“#”开头的行,则应在查找“#”之前删除空格。在这种情况下,您应该将以下内容更改
为
:在 Python 中,字符串是不可变的,因此这不会更改
line
的值。 lstrip() 方法返回删除了所有前导空格的字符串副本。This is the shortest possible form:
The
startswith()
method on a string returns True if the string you call it on starts with the string you passed in.While this is okay in some circumstances like shell scripts, it has two problems. First, it doesn't specify how to open the file. The default mode for opening a file is
'r'
, which means 'read the file in binary mode'. Since you're expecting a text file it is better to open it with'rt'
. Although this distinction is irrelevant on UNIX-like operating systems, it's important on Windows (and on pre-OS X Macs).The second problem is the open file handle. The
open()
function returns a file object, and it's considered good practice to close files when you're done with them. To do that, call theclose()
method on the object. Now, Python will probably do this for you, eventually; in Python objects are reference-counted, and when an object's reference count goes to zero it gets freed, and at some point after an object is freed Python will call its destructor (a special method called__del__
). Note that I said probably: Python has a bad habit of not actually calling the destructor on objects whose reference count drops to zero shortly before the program finishes. I guess it's in a hurry!For short-lived programs like shell scripts, and particularly for file objects, this doesn't matter. Your operating system will automatically clean up any file handles left open when the program finishes. But if you opened the file, read the contents, then started a long computation without explicitly closing the file handle first, Python is likely to leave the file handle open during your computation. And that's bad practice.
This version will work in any 2.x version of Python, and fixes both the problems I discussed above:
This is the best general form for older versions of Python.
As suggested by steveha, using the "with" statement is now considered best practice. If you're using 2.6 or above you should write it this way:
The "with" statement will clean up the file handle for you.
In your question you said "lines that start with #", so that's what I've shown you here. If you want to filter out lines that start with optional whitespace and then a '#', you should strip the whitespace before looking for the '#'. In that case, you should change this:
to this:
In Python, strings are immutable, so this doesn't change the value of
line
. Thelstrip()
method returns a copy of the string with all its leading whitespace removed.我最近发现生成器函数在这方面做得很好。我使用类似的函数来跳过注释行、空行等。
我将函数定义为
这样,我可以这样做,
这可以在我的所有代码中重用,并且我可以添加任何额外的处理/日志记录/等。我需要的。
I've found recently that a generator function does a great job of this. I've used similar functions to skip comment lines, blank lines, etc.
I define my function as
That way, I can just do
This is reusable across all my code, and I can add any additional handling/logging/etc. that I need.
我知道这是一个旧线程,但这是我的生成器函数
用于我自己的目的。它会删除评论,无论它们在哪里
出现在行中,以及剥离前导/尾随空格和
空行。以下源文本:
将产生:
这是记录的代码,其中包括演示:
正常用例是从文件(即主机文件,如我上面的示例中)中删除注释。如果是这种情况,那么上面代码的尾部将修改为:
I know that this is an old thread, but this is a generator function that I
use for my own purposes. It strips comments no matter where they
appear in the line, as well as stripping leading/trailing whitespace and
blank lines. The following source text:
will yield:
Here is documented code, which includes a demo:
The normal use case will be to strip the comments from a file (i.e., a hosts file, as in my example above). If this is the case, then the tail end of the above code would be modified to:
过滤表达式的更紧凑版本也可以如下所示:
(l for ... )
称为“生成器表达式”,它在这里充当包装迭代器,它将过滤掉文件中所有不需要的行迭代它的同时。不要将其与方括号[l for ... ]
中的相同内容混淆,后者是一个“列表理解”,它首先将文件中的所有行读入内存,然后才开始迭代它。有时您可能希望让它少一些单行且更具可读性:
所有过滤器都将在一次迭代中即时执行。
A more compact version of a filtering expression can also look like this:
(l for ... )
is called "generator expression" which acts here as a wrapping iterator that will filter out all unneeded lines from file while iterating over it. Don't confuse it with the same thing in square brakets[l for ... ]
which is a "list comprehension" that will first read all the lines from the file into memory and only then will start iterating over it.Sometimes you might want to have it less one-liney and more readable:
All the filters will be executed on the fly in one iteration.
使用正则表达式
re.compile("^(?:\s+)*#|(?:\s+)")
跳过新行和注释。Use regex
re.compile("^(?:\s+)*#|(?:\s+)")
to skip the new lines and comments.我倾向于使用
This 会忽略整行,尽管包含 rpartition 的答案得到了我的支持,因为它可以包含 # 之前的任何信息
I tend to use
This will ignore the whole line, though the answer which includes rpartition has my upvote as it can include any information from before the #
删除同时适用于内联和一行的注释是一件好事
a good thing to get rid of coments that works for both inline and on a line