Python - 读取文件并按分隔符分隔行的最佳方法

发布于 2024-12-09 19:01:50 字数 226 浏览 0 评论 0原文

读取文件并按分隔符分隔行的最佳方法是什么？返回的数据应该是元组列表。

这个方法能打败吗？这可以更快地完成/使用更少的内存吗？

def readfile(filepath, delim):
    with open(filepath, 'r') as f:
        return [tuple(line.split(delim)) for line in f]

原文

What is the best way to read a file and break out the lines by a delimeter.
Data returned should be a list of tuples.

Can this method be beaten? Can this be done faster/using less memory?

def readfile(filepath, delim):
    with open(filepath, 'r') as f:
        return [tuple(line.split(delim)) for line in f]

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

无法言说的痛 2024-12-16 19:01:50

您发布的代码读取整个文件，并在内存中构建文件的副本，作为所有文件内容的单个列表，分为元组，每行一个元组。既然您询问如何使用更少的内存，您可能只需要一个生成器函数：

def readfile(filepath, delim): 
    with open(filepath, 'r') as f: 
        for line in f:
            yield tuple(line.split(delim))

但是！有一个重大警告！您只能迭代 readfile 返回的元组一次。

lines_as_tuples = readfile(mydata,','):

for linedata in lines_as_tuples:
    # do something

到目前为止这还不错，生成器和列表看起来是一样的。但是，假设您的文件将包含大量浮点数，并且您对文件的迭代计算了这些数字的总体平均值。您可以使用“# do some”代码来计算总和和数字数量，然后计算平均值。但现在假设您想再次迭代，这次是为了找出每个值的平均值的差异。您可能认为只需添加另一个 for 循环即可：

for linedata in lines_as_tuples:
    # do another thing
    # BUT - this loop never does anything because lines_as_tuples has been consumed!

嘭！这是生成器和列表之间的一个很大的区别。现在代码中的这一点，生成器已被完全消耗 - 但没有引发特殊异常，for 循环只是不执行任何操作并继续，默默地！

在许多情况下，您将返回的列表仅迭代一次，在这种情况下，将 readfile 转换为生成器就可以了。但是，如果您想要的是一个更持久的列表，您将多次访问该列表，那么仅使用生成器就会给您带来问题，因为您只能迭代生成器一次。

我的建议？让 readlines 成为一个生成器，这样在它自己的小世界视图中，它只会生成文件的每个增量位，既美观又节省内存。将保留数据的负担交给调用者 - 如果调用者需要多次引用返回的数据，那么调用者可以简单地从生成器构建自己的列表 - 在 Python 中使用 list(readfile( 'file.dat', ','))。

Your posted code reads the entire file and builds a copy of the file in memory as a single list of all the file contents split into tuples, one tuple per line. Since you ask about how to use less memory, you may only need a generator function:

def readfile(filepath, delim): 
    with open(filepath, 'r') as f: 
        for line in f:
            yield tuple(line.split(delim))

BUT! There is a major caveat! You can only iterate over the tuples returned by readfile once.

lines_as_tuples = readfile(mydata,','):

for linedata in lines_as_tuples:
    # do something

This is okay so far, and a generator and a list look the same. But let's say your file was going to contain lots of floating point numbers, and your iteration through the file computed an overall average of those numbers. You could use the "# do something" code to calculate the overall sum and number of numbers, and then compute the average. But now let's say you wanted to iterate again, this time to find the differences from the average of each value. You'd think you'd just add another for loop:

for linedata in lines_as_tuples:
    # do another thing
    # BUT - this loop never does anything because lines_as_tuples has been consumed!

BAM! This is a big difference between generators and lists. At this point in the code now, the generator has been completely consumed - but there is no special exception raised, the for loop simply does nothing and continues on, silently!

In many cases, the list that you would get back is only iterated over once, in which case a conversion of readfile to a generator would be fine. But if what you want is a more persistent list, which you will access multiple times, then just using a generator will give you problems, since you can only iterate over a generator once.

My suggestion? Make readlines a generator, so that in its own little view of the world, it just yields each incremental bit of the file, nice and memory-efficient. Put the burden of retention of the data onto the caller - if the caller needs to refer to the returned data multiple times, then the caller can simply build its own list from the generator - easily done in Python using list(readfile('file.dat', ',')).

回复收藏 0 原文

影子是时光的心 2024-12-16 19:01:50

通过使用生成器而不是列表和列表而不是元组可以减少内存使用，因此您不需要立即将整个文件读入内存：

def readfile(path, delim):
    return (ln.split(delim) for ln in open(f, 'r'))

您必须依靠垃圾收集器来关闭文件，不过。至于返回元组：如果没有必要，请不要这样做，因为列表的速度要快一点点，构建元组的成本很小，并且（重要的是）您的行将被分割成可变大小的序列，这些序列在概念上是列表。

我猜想，只有降低到 C/Cython 级别才能提高速度； str.split 很难被击败，因为它是用 C 编写的，并且列表推导式是 Python 中最快的循环结构。

更重要的是，这是非常清晰且Pythonic 的代码。除了生成器位之外，我不会尝试优化它。

Memory use could be reduced by using a generator instead of a list and a list instead of a tuple, so you don't need to read the whole file into memory at once:

def readfile(path, delim):
    return (ln.split(delim) for ln in open(f, 'r'))

You'll have to rely on the garbage collector to close the file, though. As for returning tuples: don't do it if it's not necessary, since lists are a tiny fraction faster, constructing the tuple has a minute cost and (importantly) your lines will be split into variable-size sequences, which are conceptually lists.

Speed can be improved only by going down to the C/Cython level, I guess; str.split is hard to beat since it's written in C, and list comprehensions are AFAIK the fastest loop construct in Python.

More importantly, this is very clear and Pythonic code. I wouldn't try optimizing this apart from the generator bit.

回复收藏 0 原文

~没有更多了~