如何在Python中廉价地获取大文件的行数

发布于 2024-07-19 13:52:49 字数 177 浏览 7 评论 0 原文

如何以最节省内存和时间的方式获取大文件的行数?

def file_len(filename):
    with open(filename) as f:
        for i, _ in enumerate(f):
            pass
    return i + 1

How do I get a line count of a large file in the most memory- and time-efficient manner?

def file_len(filename):
    with open(filename) as f:
        for i, _ in enumerate(f):
            pass
    return i + 1

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(30

吃不饱 2024-07-26 13:52:49

一行,比OP的for循环更快(尽管不是最快)并且非常简洁:

num_lines = sum(1 for _ in open('myfile.txt'))

您还可以通过使用提高速度(和鲁棒性) rbU 模式并将其包含在 with 块中以关闭文件:

with open("myfile.txt", "rbU") as f:
    num_lines = sum(1 for _ in f)

注意 中的 U rbU 模式自 Python 3.3 及更高版本起已被弃用,因此我们应该使用 rb 而不是 rbU (它已在 Python 3.11)。

One line, faster than the for loop of the OP (although not the fastest) and very concise:

num_lines = sum(1 for _ in open('myfile.txt'))

You can also boost the speed (and robustness) by using rbU mode and include it in a with block to close the file:

with open("myfile.txt", "rbU") as f:
    num_lines = sum(1 for _ in f)

Note: The U in rbU mode is deprecated since Python 3.3 and above, so iwe should use rb instead of rbU (and it has been removed in Python 3.11).

懒猫 2024-07-26 13:52:49

没有比这更好的了。

毕竟,任何解决方案都必须读取整个文件,计算出有多少个 \n ,然后返回该结果。

您是否有更好的方法来做到这一点而不读取整个文件? 不确定...最好的解决方案总是受 I/O 限制,您能做的最好的事情就是确保不使用不必要的内存,但看起来您已经涵盖了这一点。

[2023 年 5 月编辑]

正如许多其他答案中所评论的那样,Python 3 中有更好的替代方案。 for 循环并不是最有效的。 例如,使用 mmap 或缓冲区更有效。

You can't get any better than that.

After all, any solution will have to read the entire file, figure out how many \n you have, and return that result.

Do you have a better way of doing that without reading the entire file? Not sure... The best solution will always be I/O-bound, best you can do is make sure you don't use unnecessary memory, but it looks like you have that covered.

[Edit May 2023]

As commented in many other answers, in Python 3 there are better alternatives. The for loop is not the most efficient. For example, using mmap or buffers is more efficient.

邮友 2024-07-26 13:52:49

我相信内存映射文件将是最快的解决方案。 我尝试了四个函数:OP发布的函数(opcount); 对文件中各行的简单迭代 (simplecount); 带有内存映射字段 (mmap) 的 readline (mapcount); 以及 Mykola Kharechko 提供的缓冲区读取解决方案 (bufcount)。

我运行每个函数五次,并计算了 120 万行文本文件的平均运行时间。

Windows XP、Python 2.5、2 GB RAM、2 GHz AMD 处理器

以下是我的结果:

mapcount : 0.465599966049
simplecount : 0.756399965286
bufcount : 0.546800041199
opcount : 0.718600034714

数字对于Python 2.6:

mapcount : 0.471799945831
simplecount : 0.634400033951
bufcount : 0.468800067902
opcount : 0.602999973297

所以缓冲区读取策略对于Windows/Python 2.6来说似乎是最快的

下面是代码:

from __future__ import with_statement
import time
import mmap
import random
from collections import defaultdict

def mapcount(filename):
    with open(filename, "r+") as f:
        buf = mmap.mmap(f.fileno(), 0)
        lines = 0
        readline = buf.readline
        while readline():
            lines += 1
        return lines

def simplecount(filename):
    lines = 0
    for line in open(filename):
        lines += 1
    return lines

def bufcount(filename):
    f = open(filename)
    lines = 0
    buf_size = 1024 * 1024
    read_f = f.read # loop optimization

    buf = read_f(buf_size)
    while buf:
        lines += buf.count('\n')
        buf = read_f(buf_size)

    return lines

def opcount(fname):
    with open(fname) as f:
        for i, l in enumerate(f):
            pass
    return i + 1


counts = defaultdict(list)

for i in range(5):
    for func in [mapcount, simplecount, bufcount, opcount]:
        start_time = time.time()
        assert func("big_file.txt") == 1209138
        counts[func].append(time.time() - start_time)

for key, vals in counts.items():
    print key.__name__, ":", sum(vals) / float(len(vals))

I believe that a memory mapped file will be the fastest solution. I tried four functions: the function posted by the OP (opcount); a simple iteration over the lines in the file (simplecount); readline with a memory-mapped filed (mmap) (mapcount); and the buffer read solution offered by Mykola Kharechko (bufcount).

I ran each function five times, and calculated the average run-time for a 1.2 million-line text file.

Windows XP, Python 2.5, 2 GB RAM, 2 GHz AMD processor

Here are my results:

mapcount : 0.465599966049
simplecount : 0.756399965286
bufcount : 0.546800041199
opcount : 0.718600034714

Numbers for Python 2.6:

mapcount : 0.471799945831
simplecount : 0.634400033951
bufcount : 0.468800067902
opcount : 0.602999973297

So the buffer read strategy seems to be the fastest for Windows/Python 2.6

Here is the code:

from __future__ import with_statement
import time
import mmap
import random
from collections import defaultdict

def mapcount(filename):
    with open(filename, "r+") as f:
        buf = mmap.mmap(f.fileno(), 0)
        lines = 0
        readline = buf.readline
        while readline():
            lines += 1
        return lines

def simplecount(filename):
    lines = 0
    for line in open(filename):
        lines += 1
    return lines

def bufcount(filename):
    f = open(filename)
    lines = 0
    buf_size = 1024 * 1024
    read_f = f.read # loop optimization

    buf = read_f(buf_size)
    while buf:
        lines += buf.count('\n')
        buf = read_f(buf_size)

    return lines

def opcount(fname):
    with open(fname) as f:
        for i, l in enumerate(f):
            pass
    return i + 1


counts = defaultdict(list)

for i in range(5):
    for func in [mapcount, simplecount, bufcount, opcount]:
        start_time = time.time()
        assert func("big_file.txt") == 1209138
        counts[func].append(time.time() - start_time)

for key, vals in counts.items():
    print key.__name__, ":", sum(vals) / float(len(vals))
救星 2024-07-26 13:52:49

所有这些解决方案都忽略了一种使其运行速度更快的方法,即使用无缓冲(原始)接口、使用字节数组以及进行自己的缓冲。 (这只适用于Python 3。在Python 2中,默认情况下可能会或可能不会使用原始接口,但在Python 3中,您将默认使用Unicode。)

使用计时工具的修改版本,我相信以下内容代码比提供的任何解决方案都更快(并且稍微更Pythonic):

def rawcount(filename):
    f = open(filename, 'rb')
    lines = 0
    buf_size = 1024 * 1024
    read_f = f.raw.read

    buf = read_f(buf_size)
    while buf:
        lines += buf.count(b'\n')
        buf = read_f(buf_size)

    return lines

使用单独的生成器函数,运行速度更快:

def _make_gen(reader):
    b = reader(1024 * 1024)
    while b:
        yield b
        b = reader(1024*1024)

def rawgencount(filename):
    f = open(filename, 'rb')
    f_gen = _make_gen(f.raw.read)
    return sum(buf.count(b'\n') for buf in f_gen)

这可以使用itertools内联生成器表达式完全完成,但看起来很奇怪:

from itertools import (takewhile, repeat)

def rawincount(filename):
    f = open(filename, 'rb')
    bufgen = takewhile(lambda x: x, (f.raw.read(1024*1024) for _ in repeat(None)))
    return sum(buf.count(b'\n') for buf in bufgen)

这里我的时间安排是:

function      average, s  min, s   ratio
rawincount        0.0043  0.0041   1.00
rawgencount       0.0044  0.0042   1.01
rawcount          0.0048  0.0045   1.09
bufcount          0.008   0.0068   1.64
wccount           0.01    0.0097   2.35
itercount         0.014   0.014    3.41
opcount           0.02    0.02     4.83
kylecount         0.021   0.021    5.05
simplecount       0.022   0.022    5.25
mapcount          0.037   0.031    7.46

All of these solutions ignore one way to make this run considerably faster, namely by using the unbuffered (raw) interface, using bytearrays, and doing your own buffering. (This only applies in Python 3. In Python 2, the raw interface may or may not be used by default, but in Python 3, you'll default into Unicode.)

Using a modified version of the timing tool, I believe the following code is faster (and marginally more Pythonic) than any of the solutions offered:

def rawcount(filename):
    f = open(filename, 'rb')
    lines = 0
    buf_size = 1024 * 1024
    read_f = f.raw.read

    buf = read_f(buf_size)
    while buf:
        lines += buf.count(b'\n')
        buf = read_f(buf_size)

    return lines

Using a separate generator function, this runs a smidge faster:

def _make_gen(reader):
    b = reader(1024 * 1024)
    while b:
        yield b
        b = reader(1024*1024)

def rawgencount(filename):
    f = open(filename, 'rb')
    f_gen = _make_gen(f.raw.read)
    return sum(buf.count(b'\n') for buf in f_gen)

This can be done completely with generators expressions in-line using itertools, but it gets pretty weird looking:

from itertools import (takewhile, repeat)

def rawincount(filename):
    f = open(filename, 'rb')
    bufgen = takewhile(lambda x: x, (f.raw.read(1024*1024) for _ in repeat(None)))
    return sum(buf.count(b'\n') for buf in bufgen)

Here are my timings:

function      average, s  min, s   ratio
rawincount        0.0043  0.0041   1.00
rawgencount       0.0044  0.0042   1.01
rawcount          0.0048  0.0045   1.09
bufcount          0.008   0.0068   1.64
wccount           0.01    0.0097   2.35
itercount         0.014   0.014    3.41
opcount           0.02    0.02     4.83
kylecount         0.021   0.021    5.05
simplecount       0.022   0.022    5.25
mapcount          0.037   0.031    7.46
水溶 2024-07-26 13:52:49

您可以执行子进程并运行 wc -l filename

import subprocess

def file_len(fname):
    p = subprocess.Popen(['wc', '-l', fname], stdout=subprocess.PIPE, 
                                              stderr=subprocess.PIPE)
    result, err = p.communicate()
    if p.returncode != 0:
        raise IOError(err)
    return int(result.strip().split()[0])

You could execute a subprocess and run wc -l filename

import subprocess

def file_len(fname):
    p = subprocess.Popen(['wc', '-l', fname], stdout=subprocess.PIPE, 
                                              stderr=subprocess.PIPE)
    result, err = p.communicate()
    if p.returncode != 0:
        raise IOError(err)
    return int(result.strip().split()[0])
通知家属抬走 2024-07-26 13:52:49

经过 perfplot 分析后,我们必须推荐缓冲读取解决方案,

def buf_count_newlines_gen(fname):
    def _make_gen(reader):
        while True:
            b = reader(2 ** 16)
            if not b: break
            yield b

    with open(fname, "rb") as f:
        count = sum(buf.count(b"\n") for buf in _make_gen(f.raw.read))
    return count

它速度快且内存效率高。 大多数其他解决方案的速度要慢 20 倍左右。

输入图像描述这里


重现情节的代码:

import mmap
import subprocess
from functools import partial

import perfplot


def setup(n):
    fname = "t.txt"
    with open(fname, "w") as f:
        for i in range(n):
            f.write(str(i) + "\n")
    return fname


def for_enumerate(fname):
    i = 0
    with open(fname) as f:
        for i, _ in enumerate(f):
            pass
    return i + 1


def sum1(fname):
    return sum(1 for _ in open(fname))


def mmap_count(fname):
    with open(fname, "r+") as f:
        buf = mmap.mmap(f.fileno(), 0)

    lines = 0
    while buf.readline():
        lines += 1
    return lines


def for_open(fname):
    lines = 0
    for _ in open(fname):
        lines += 1
    return lines


def buf_count_newlines(fname):
    lines = 0
    buf_size = 2 ** 16
    with open(fname) as f:
        buf = f.read(buf_size)
        while buf:
            lines += buf.count("\n")
            buf = f.read(buf_size)
    return lines


def buf_count_newlines_gen(fname):
    def _make_gen(reader):
        b = reader(2 ** 16)
        while b:
            yield b
            b = reader(2 ** 16)

    with open(fname, "rb") as f:
        count = sum(buf.count(b"\n") for buf in _make_gen(f.raw.read))
    return count


def wc_l(fname):
    return int(subprocess.check_output(["wc", "-l", fname]).split()[0])


def sum_partial(fname):
    with open(fname) as f:
        count = sum(x.count("\n") for x in iter(partial(f.read, 2 ** 16), ""))
    return count


def read_count(fname):
    return open(fname).read().count("\n")


b = perfplot.bench(
    setup=setup,
    kernels=[
        for_enumerate,
        sum1,
        mmap_count,
        for_open,
        wc_l,
        buf_count_newlines,
        buf_count_newlines_gen,
        sum_partial,
        read_count,
    ],
    n_range=[2 ** k for k in range(27)],
    xlabel="num lines",
)
b.save("out.png")
b.show()

After a perfplot analysis, one has to recommend the buffered read solution

def buf_count_newlines_gen(fname):
    def _make_gen(reader):
        while True:
            b = reader(2 ** 16)
            if not b: break
            yield b

    with open(fname, "rb") as f:
        count = sum(buf.count(b"\n") for buf in _make_gen(f.raw.read))
    return count

It's fast and memory-efficient. Most other solutions are about 20 times slower.

enter image description here


Code to reproduce the plot:

import mmap
import subprocess
from functools import partial

import perfplot


def setup(n):
    fname = "t.txt"
    with open(fname, "w") as f:
        for i in range(n):
            f.write(str(i) + "\n")
    return fname


def for_enumerate(fname):
    i = 0
    with open(fname) as f:
        for i, _ in enumerate(f):
            pass
    return i + 1


def sum1(fname):
    return sum(1 for _ in open(fname))


def mmap_count(fname):
    with open(fname, "r+") as f:
        buf = mmap.mmap(f.fileno(), 0)

    lines = 0
    while buf.readline():
        lines += 1
    return lines


def for_open(fname):
    lines = 0
    for _ in open(fname):
        lines += 1
    return lines


def buf_count_newlines(fname):
    lines = 0
    buf_size = 2 ** 16
    with open(fname) as f:
        buf = f.read(buf_size)
        while buf:
            lines += buf.count("\n")
            buf = f.read(buf_size)
    return lines


def buf_count_newlines_gen(fname):
    def _make_gen(reader):
        b = reader(2 ** 16)
        while b:
            yield b
            b = reader(2 ** 16)

    with open(fname, "rb") as f:
        count = sum(buf.count(b"\n") for buf in _make_gen(f.raw.read))
    return count


def wc_l(fname):
    return int(subprocess.check_output(["wc", "-l", fname]).split()[0])


def sum_partial(fname):
    with open(fname) as f:
        count = sum(x.count("\n") for x in iter(partial(f.read, 2 ** 16), ""))
    return count


def read_count(fname):
    return open(fname).read().count("\n")


b = perfplot.bench(
    setup=setup,
    kernels=[
        for_enumerate,
        sum1,
        mmap_count,
        for_open,
        wc_l,
        buf_count_newlines,
        buf_count_newlines_gen,
        sum_partial,
        read_count,
    ],
    n_range=[2 ** k for k in range(27)],
    xlabel="num lines",
)
b.save("out.png")
b.show()
嗼ふ静 2024-07-26 13:52:49

类似于这个答案的单行 Bash 解决方案,使用现代的 subprocess.check_output 函数:

def line_count(filename):
    return int(subprocess.check_output(['wc', '-l', filename]).split()[0])

A one-line Bash solution similar to this answer, using the modern subprocess.check_output function:

def line_count(filename):
    return int(subprocess.check_output(['wc', '-l', filename]).split()[0])
等你爱我 2024-07-26 13:52:49

这是一个使用多处理库在机器/核心之间分配行计数的 Python 程序。 我的测试使用 8 核 Windows 64 位服务器将 2000 万行文件的计数从 26 秒缩短到 7 秒。 注意:不使用内存映射会使速度变慢。

import multiprocessing, sys, time, os, mmap
import logging, logging.handlers

def init_logger(pid):
    console_format = 'P{0} %(levelname)s %(message)s'.format(pid)
    logger = logging.getLogger()  # New logger at root level
    logger.setLevel(logging.INFO)
    logger.handlers.append(logging.StreamHandler())
    logger.handlers[0].setFormatter(logging.Formatter(console_format, '%d/%m/%y %H:%M:%S'))

def getFileLineCount(queues, pid, processes, file1):
    init_logger(pid)
    logging.info('start')

    physical_file = open(file1, "r")
    #  mmap.mmap(fileno, length[, tagname[, access[, offset]]]

    m1 = mmap.mmap(physical_file.fileno(), 0, access=mmap.ACCESS_READ)

    # Work out file size to divide up line counting

    fSize = os.stat(file1).st_size
    chunk = (fSize / processes) + 1

    lines = 0

    # Get where I start and stop
    _seedStart = chunk * (pid)
    _seekEnd = chunk * (pid+1)
    seekStart = int(_seedStart)
    seekEnd = int(_seekEnd)

    if seekEnd < int(_seekEnd + 1):
        seekEnd += 1

    if _seedStart < int(seekStart + 1):
        seekStart += 1

    if seekEnd > fSize:
        seekEnd = fSize

    # Find where to start
    if pid > 0:
        m1.seek(seekStart)
        # Read next line
        l1 = m1.readline()  # Need to use readline with memory mapped files
        seekStart = m1.tell()

    # Tell previous rank my seek start to make their seek end

    if pid > 0:
        queues[pid-1].put(seekStart)
    if pid < processes-1:
        seekEnd = queues[pid].get()

    m1.seek(seekStart)
    l1 = m1.readline()

    while len(l1) > 0:
        lines += 1
        l1 = m1.readline()
        if m1.tell() > seekEnd or len(l1) == 0:
            break

    logging.info('done')
    # Add up the results
    if pid == 0:
        for p in range(1, processes):
            lines += queues[0].get()
        queues[0].put(lines) # The total lines counted
    else:
        queues[0].put(lines)

    m1.close()
    physical_file.close()

if __name__ == '__main__':
    init_logger('main')
    if len(sys.argv) > 1:
        file_name = sys.argv[1]
    else:
        logging.fatal('parameters required: file-name [processes]')
        exit()

    t = time.time()
    processes = multiprocessing.cpu_count()
    if len(sys.argv) > 2:
        processes = int(sys.argv[2])
    queues = [] # A queue for each process
    for pid in range(processes):
        queues.append(multiprocessing.Queue())
    jobs = []
    prev_pipe = 0
    for pid in range(processes):
        p = multiprocessing.Process(target = getFileLineCount, args=(queues, pid, processes, file_name,))
        p.start()
        jobs.append(p)

    jobs[0].join() # Wait for counting to finish
    lines = queues[0].get()

    logging.info('finished {} Lines:{}'.format( time.time() - t, lines))

Here is a Python program to use the multiprocessing library to distribute the line counting across machines/cores. My test improves counting a 20 million line file from 26 seconds to 7 seconds using an 8-core Windows 64-bit server. Note: not using memory mapping makes things much slower.

import multiprocessing, sys, time, os, mmap
import logging, logging.handlers

def init_logger(pid):
    console_format = 'P{0} %(levelname)s %(message)s'.format(pid)
    logger = logging.getLogger()  # New logger at root level
    logger.setLevel(logging.INFO)
    logger.handlers.append(logging.StreamHandler())
    logger.handlers[0].setFormatter(logging.Formatter(console_format, '%d/%m/%y %H:%M:%S'))

def getFileLineCount(queues, pid, processes, file1):
    init_logger(pid)
    logging.info('start')

    physical_file = open(file1, "r")
    #  mmap.mmap(fileno, length[, tagname[, access[, offset]]]

    m1 = mmap.mmap(physical_file.fileno(), 0, access=mmap.ACCESS_READ)

    # Work out file size to divide up line counting

    fSize = os.stat(file1).st_size
    chunk = (fSize / processes) + 1

    lines = 0

    # Get where I start and stop
    _seedStart = chunk * (pid)
    _seekEnd = chunk * (pid+1)
    seekStart = int(_seedStart)
    seekEnd = int(_seekEnd)

    if seekEnd < int(_seekEnd + 1):
        seekEnd += 1

    if _seedStart < int(seekStart + 1):
        seekStart += 1

    if seekEnd > fSize:
        seekEnd = fSize

    # Find where to start
    if pid > 0:
        m1.seek(seekStart)
        # Read next line
        l1 = m1.readline()  # Need to use readline with memory mapped files
        seekStart = m1.tell()

    # Tell previous rank my seek start to make their seek end

    if pid > 0:
        queues[pid-1].put(seekStart)
    if pid < processes-1:
        seekEnd = queues[pid].get()

    m1.seek(seekStart)
    l1 = m1.readline()

    while len(l1) > 0:
        lines += 1
        l1 = m1.readline()
        if m1.tell() > seekEnd or len(l1) == 0:
            break

    logging.info('done')
    # Add up the results
    if pid == 0:
        for p in range(1, processes):
            lines += queues[0].get()
        queues[0].put(lines) # The total lines counted
    else:
        queues[0].put(lines)

    m1.close()
    physical_file.close()

if __name__ == '__main__':
    init_logger('main')
    if len(sys.argv) > 1:
        file_name = sys.argv[1]
    else:
        logging.fatal('parameters required: file-name [processes]')
        exit()

    t = time.time()
    processes = multiprocessing.cpu_count()
    if len(sys.argv) > 2:
        processes = int(sys.argv[2])
    queues = [] # A queue for each process
    for pid in range(processes):
        queues.append(multiprocessing.Queue())
    jobs = []
    prev_pipe = 0
    for pid in range(processes):
        p = multiprocessing.Process(target = getFileLineCount, args=(queues, pid, processes, file_name,))
        p.start()
        jobs.append(p)

    jobs[0].join() # Wait for counting to finish
    lines = queues[0].get()

    logging.info('finished {} Lines:{}'.format( time.time() - t, lines))
关于从前 2024-07-26 13:52:49

我将使用 Python 的文件对象方法 readlines,如下所示:

with open(input_file) as foo:
    lines = len(foo.readlines())

这将打开文件,在文件中创建行列表,计算列表的长度,将其保存到变量并再次关闭文件。

I would use Python's file object method readlines, as follows:

with open(input_file) as foo:
    lines = len(foo.readlines())

This opens the file, creates a list of lines in the file, counts the length of the list, saves that to a variable and closes the file again.

虫児飞 2024-07-26 13:52:49

这是我发现使用纯 Python 最快的事情。

您可以通过设置 buffer 使用任意数量的内存,尽管 2**16 似乎是我计算机上的最佳选择。

from functools import partial

buffer=2**16
with open(myfile) as f:
        print sum(x.count('\n') for x in iter(partial(f.read,buffer), ''))

我在这里找到了答案 为什么要阅读C++ 中来自 stdin 的行比 Python 慢得多? 并对其进行了一点点调整。 尽管 wc -l 仍然比其他任何方法快大约 75%,但对于了解如何快速计算行数来说,这是一本非常好的读物。

This is the fastest thing I have found using pure Python.

You can use whatever amount of memory you want by setting buffer, though 2**16 appears to be a sweet spot on my computer.

from functools import partial

buffer=2**16
with open(myfile) as f:
        print sum(x.count('\n') for x in iter(partial(f.read,buffer), ''))

I found the answer here Why is reading lines from stdin much slower in C++ than Python? and tweaked it just a tiny bit. It’s a very good read to understand how to count lines quickly, though wc -l is still about 75% faster than anything else.

还给你自由 2024-07-26 13:52:49
def file_len(full_path):
  """ Count number of lines in a file."""
  f = open(full_path)
  nr_of_lines = sum(1 for line in f)
  f.close()
  return nr_of_lines
def file_len(full_path):
  """ Count number of lines in a file."""
  f = open(full_path)
  nr_of_lines = sum(1 for line in f)
  f.close()
  return nr_of_lines
野稚 2024-07-26 13:52:49

这是我使用的,它看起来很干净:

import subprocess

def count_file_lines(file_path):
    """
    Counts the number of lines in a file using wc utility.
    :param file_path: path to file
    :return: int, no of lines
    """
    num = subprocess.check_output(['wc', '-l', file_path])
    num = num.split(' ')
    return int(num[0])

这比使用纯 Python 稍微快一些,但以内存使用为代价。 子进程在执行命令时将分叉一个与父进程具有相同内存占用的新进程。

Here is what I use, and it seems pretty clean:

import subprocess

def count_file_lines(file_path):
    """
    Counts the number of lines in a file using wc utility.
    :param file_path: path to file
    :return: int, no of lines
    """
    num = subprocess.check_output(['wc', '-l', file_path])
    num = num.split(' ')
    return int(num[0])

This is marginally faster than using pure Python, but at the cost of memory usage. Subprocess will fork a new process with the same memory footprint as the parent process while it executes your command.

︶葆Ⅱㄣ 2024-07-26 13:52:49

一行解决方案:

import os
os.system("wc -l  filename")  

我的代码片段:

>>> os.system('wc -l *.txt')

输出:

0 bar.txt
1000 command.txt
3 test_file.txt
1003 total

One line solution:

import os
os.system("wc -l  filename")  

My snippet:

>>> os.system('wc -l *.txt')

Output:

0 bar.txt
1000 command.txt
3 test_file.txt
1003 total
别靠近我心 2024-07-26 13:52:49

凯尔的答案

num_lines = sum(1 for line in open('my_file.txt'))

可能是最好的。 另一种方法是:

num_lines =  len(open('my_file.txt').read().splitlines())

以下是两者的性能比较:

In [20]: timeit sum(1 for line in open('Charts.ipynb'))
100000 loops, best of 3: 9.79 µs per loop

In [21]: timeit len(open('Charts.ipynb').read().splitlines())
100000 loops, best of 3: 12 µs per loop

Kyle's answer

num_lines = sum(1 for line in open('my_file.txt'))

is probably best. An alternative for this is:

num_lines =  len(open('my_file.txt').read().splitlines())

Here is the comparison of performance of both:

In [20]: timeit sum(1 for line in open('Charts.ipynb'))
100000 loops, best of 3: 9.79 µs per loop

In [21]: timeit len(open('Charts.ipynb').read().splitlines())
100000 loops, best of 3: 12 µs per loop
终难遇 2024-07-26 13:52:49

我在这个版本中得到了一个小的(4-8%)改进,它重用了常量缓冲区,因此它应该避免任何内存或 GC 开销:

lines = 0
buffer = bytearray(2048)
with open(filename) as f:
  while f.readinto(buffer) > 0:
      lines += buffer.count('\n')

您可以调整缓冲区大小,也许会看到一些改进。

I got a small (4-8%) improvement with this version which reuses a constant buffer, so it should avoid any memory or GC overhead:

lines = 0
buffer = bytearray(2048)
with open(filename) as f:
  while f.readinto(buffer) > 0:
      lines += buffer.count('\n')

You can play around with the buffer size and maybe see a little improvement.

葵雨 2024-07-26 13:52:49

对我来说,这个变体将是最快的:

#!/usr/bin/env python

def main():
    f = open('filename')                  
    lines = 0
    buf_size = 1024 * 1024
    read_f = f.read # loop optimization

    buf = read_f(buf_size)
    while buf:
        lines += buf.count('\n')
        buf = read_f(buf_size)

    print lines

if __name__ == '__main__':
    main()

原因:缓冲比逐行读取更快,并且 string.count 也非常快

As for me this variant will be the fastest:

#!/usr/bin/env python

def main():
    f = open('filename')                  
    lines = 0
    buf_size = 1024 * 1024
    read_f = f.read # loop optimization

    buf = read_f(buf_size)
    while buf:
        lines += buf.count('\n')
        buf = read_f(buf_size)

    print lines

if __name__ == '__main__':
    main()

reasons: buffering faster than reading line by line and string.count is also very fast

用心笑 2024-07-26 13:52:49

这段代码更短、更清晰。 这可能是最好的方法:

num_lines = open('yourfile.ext').read().count('\n')

This code is shorter and clearer. It's probably the best way:

num_lines = open('yourfile.ext').read().count('\n')
满地尘埃落定 2024-07-26 13:52:49

只是为了完成之前答案中的方法,我尝试了 fileinput 模块的变体:

import fileinput as fi   

def filecount(fname):
        for line in fi.input(fname):
            pass
        return fi.lineno()

并将一个 6000 万行文件传递给之前答案中的所有所述方法:

mapcount:    6.13
simplecount: 4.59
opcount:     4.43
filecount:  43.3
bufcount:    0.171

令我感到有点惊讶的是 fileinput 如此糟糕并且规模如此之大比所有其他方法更糟糕...

Just to complete the methods in previous answers, I tried a variant with the fileinput module:

import fileinput as fi   

def filecount(fname):
        for line in fi.input(fname):
            pass
        return fi.lineno()

And passed a 60-million-lines file to all the stated methods in previous answers:

mapcount:    6.13
simplecount: 4.59
opcount:     4.43
filecount:  43.3
bufcount:    0.171

It's a little surprise to me that fileinput is that bad and scales far worse than all the other methods...

傲性难收 2024-07-26 13:52:49

我已经修改了这样的缓冲区情况:

def CountLines(filename):
    f = open(filename)
    try:
        lines = 1
        buf_size = 1024 * 1024
        read_f = f.read # loop optimization
        buf = read_f(buf_size)

        # Empty file
        if not buf:
            return 0

        while buf:
            lines += buf.count('\n')
            buf = read_f(buf_size)

        return lines
    finally:
        f.close()

现在也计算空文件和最后一行(没有 \n)。

I have modified the buffer case like this:

def CountLines(filename):
    f = open(filename)
    try:
        lines = 1
        buf_size = 1024 * 1024
        read_f = f.read # loop optimization
        buf = read_f(buf_size)

        # Empty file
        if not buf:
            return 0

        while buf:
            lines += buf.count('\n')
            buf = read_f(buf_size)

        return lines
    finally:
        f.close()

Now also empty files and the last line (without \n) are counted.

紧拥背影 2024-07-26 13:52:49

已经有很多答案具有很好的时序比较,但我相信他们只是查看行数来衡量性能(例如,伟大的图来自 Nico Schlömer)。

为了在衡量性能时准确,我们应该查看:

  • 行数 行
  • 的平均大小
  • ... 文件的最终总大小(这可能会影响内存)

首先,OP 的功能(带有一个for) 和函数 sum(1 for line in f) 的表现不太好...

好的竞争者是 mmap >缓冲区

总结一下:根据我的分析(带 SSD 的 Windows 上的 Python 3.9):

  1. 对于行相对较短(100 个字符以内)的大文件:使用带有缓冲区的函数 buf_count_newlines_gen

    def buf_count_newlines_gen(fname: str) ->   整数: 
          """计算文件中的行数""" 
          def _make_gen(阅读器): 
              b = 读取器(1024 * 1024) 
              而b: 
                  产量b 
                  b = 读取器(1024 * 1024) 
    
          将 open(fname, "rb") 作为 f: 
              count = sum(buf.count(b"\n") for buf in _make_gen(f.raw.read)) 
          返回计数 
    
      
  2. 对于行可能较长(最多 2000 个字符)的文件,忽略行数:使用函数使用 mmap:count_nb_lines_mmap

    def count_nb_lines_mmap(文件:路径) ->   整数: 
          """计算文件中的行数""" 
          打开(文件,模式=“rb”)作为f: 
              mm = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ) 
              nb_行= 0 
              而 mm.readline(): 
                  nb_行 += 1 
              mm.close() 
              返回 nb_lines 
      
  3. 对于性能非常好的短代码(特别是对于大小大于的文件)至中等尺寸):

    def itercount(filename: str) ->;   整数: 
          """计算文件中的行数""" 
          将 open(文件名, 'rb') 作为 f: 
              返回 sum(1 for _ in f) 
      

以下是不同指标的摘要(7 次运行的 timeit 的平均时间,每次运行 10 个循环)

小文件,短行 小文件,长行 大文件,短行 大文件,长行 较大文件,短行
... 大小 ... 0.04 MB 1.16 MB 318 MB 17 MB 328 MB
... nb 行 ... 第915章 100 个字符 915 行 < 2000 个 字符 389000 行 < 100 个字符 389,000 行 < 2000 个字符 980 万行 < 100 个字符
count_nb_lines_blocks 0.183 毫秒 1.718 毫秒 36.799 毫秒 415.393 毫秒 517.920 毫秒
count_nb_lines_mmap 0.185 毫秒 0.582 毫秒 44.801 毫秒 185.461 毫秒 691.637 毫秒
buf_count_newlines_gen 0.665 毫秒 1.032 毫秒 15.620 毫秒 213.458 毫秒 318.939 毫秒
itercount 0.135 毫秒 0.817 ms 31.292 ms 223.120 ms 628.760 ms

注意:我还在 8 GB 的文件上比较了 count_nb_lines_mmapbuf_count_newlines_gen,其中 970 万行超过 800人物。 buf_count_newlines_gen 的平均时间为 5.39 秒,count_nb_lines_mmap 的平均时间为 4.2 秒,因此后一个函数对于行数较长的文件来说似乎确实更好。

这是我使用过的代码:

import mmap
from pathlib import Path

def count_nb_lines_blocks(file: Path) -> int:
    """Count the number of lines in a file"""

    def blocks(files, size=65536):
        while True:
            b = files.read(size)
            if not b:
                break
            yield b

    with open(file, encoding="utf-8", errors="ignore") as f:
        return sum(bl.count("\n") for bl in blocks(f))


def count_nb_lines_mmap(file: Path) -> int:
    """Count the number of lines in a file"""
    with open(file, mode="rb") as f:
        mm = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
        nb_lines = 0
        while mm.readline():
            nb_lines += 1
        mm.close()
        return nb_lines


def count_nb_lines_sum(file: Path) -> int:
    """Count the number of lines in a file"""
    with open(file, "r", encoding="utf-8", errors="ignore") as f:
        return sum(1 for line in f)


def count_nb_lines_for(file: Path) -> int:
    """Count the number of lines in a file"""
    i = 0
    with open(file) as f:
        for i, _ in enumerate(f, start=1):
            pass
    return i


def buf_count_newlines_gen(fname: str) -> int:
    """Count the number of lines in a file"""
    def _make_gen(reader):
        b = reader(1024 * 1024)
        while b:
            yield b
            b = reader(1024 * 1024)

    with open(fname, "rb") as f:
        count = sum(buf.count(b"\n") for buf in _make_gen(f.raw.read))
    return count


def itercount(filename: str) -> int:
    """Count the number of lines in a file"""
    with open(filename, 'rbU') as f:
        return sum(1 for _ in f)


files = [small_file, big_file, small_file_shorter, big_file_shorter, small_file_shorter_sim_size, big_file_shorter_sim_size]
for file in files:
    print(f"File: {file.name} (size: {file.stat().st_size / 1024 ** 2:.2f} MB)")
    for func in [
        count_nb_lines_blocks,
        count_nb_lines_mmap,
        count_nb_lines_sum,
        count_nb_lines_for,
        buf_count_newlines_gen,
        itercount,
    ]:
        result = func(file)
        time = Timer(lambda: func(file)).repeat(7, 10)
        print(f" * {func.__name__}: {result} lines in {mean(time) / 10 * 1000:.3f} ms")
    print()
File: small_file.ndjson (size: 1.16 MB)
 * count_nb_lines_blocks: 915 lines in 1.718 ms
 * count_nb_lines_mmap: 915 lines in 0.582 ms
 * count_nb_lines_sum: 915 lines in 1.993 ms
 * count_nb_lines_for: 915 lines in 3.876 ms
 * buf_count_newlines_gen: 915 lines in 1.032 ms
 * itercount: 915 lines in 0.817 ms

File: big_file.ndjson (size: 317.99 MB)
 * count_nb_lines_blocks: 389000 lines in 415.393 ms
 * count_nb_lines_mmap: 389000 lines in 185.461 ms
 * count_nb_lines_sum: 389000 lines in 485.370 ms
 * count_nb_lines_for: 389000 lines in 967.075 ms
 * buf_count_newlines_gen: 389000 lines in 213.458 ms
 * itercount: 389000 lines in 223.120 ms

File: small_file__shorter.ndjson (size: 0.04 MB)
 * count_nb_lines_blocks: 915 lines in 0.183 ms
 * count_nb_lines_mmap: 915 lines in 0.185 ms
 * count_nb_lines_sum: 915 lines in 0.251 ms
 * count_nb_lines_for: 915 lines in 0.244 ms
 * buf_count_newlines_gen: 915 lines in 0.665 ms
 * itercount: 915 lines in 0.135 ms

File: big_file__shorter.ndjson (size: 17.42 MB)
 * count_nb_lines_blocks: 389000 lines in 36.799 ms
 * count_nb_lines_mmap: 389000 lines in 44.801 ms
 * count_nb_lines_sum: 389000 lines in 59.068 ms
 * count_nb_lines_for: 389000 lines in 81.387 ms
 * buf_count_newlines_gen: 389000 lines in 15.620 ms
 * itercount: 389000 lines in 31.292 ms

File: small_file__shorter_sim_size.ndjson (size: 1.21 MB)
 * count_nb_lines_blocks: 36457 lines in 1.920 ms
 * count_nb_lines_mmap: 36457 lines in 2.615 ms
 * count_nb_lines_sum: 36457 lines in 3.993 ms
 * count_nb_lines_for: 36457 lines in 6.011 ms
 * buf_count_newlines_gen: 36457 lines in 1.363 ms
 * itercount: 36457 lines in 2.147 ms

File: big_file__shorter_sim_size.ndjson (size: 328.19 MB)
 * count_nb_lines_blocks: 9834248 lines in 517.920 ms
 * count_nb_lines_mmap: 9834248 lines in 691.637 ms
 * count_nb_lines_sum: 9834248 lines in 1109.669 ms
 * count_nb_lines_for: 9834248 lines in 1683.859 ms
 * buf_count_newlines_gen: 9834248 lines in 318.939 ms
 * itercount: 9834248 lines in 628.760 ms

There are already so many answers with great timing comparison, but I believe they are just looking at number of lines to measure performance (e.g., the great graph from Nico Schlömer).

To be accurate while measuring performance, we should look at:

  • the number of lines
  • the average size of the lines
  • ... the resulting total size of the file (which might impact memory)

First of all, the function of the OP (with a for) and the function sum(1 for line in f) are not performing that well...

Good contenders are with mmap or buffer.

To summarize: based on my analysis (Python 3.9 on Windows with SSD):

  1. For big files with relatively short lines (within 100 characters): use function with a buffer buf_count_newlines_gen

    def buf_count_newlines_gen(fname: str) -> int:
        """Count the number of lines in a file"""
        def _make_gen(reader):
            b = reader(1024 * 1024)
            while b:
                yield b
                b = reader(1024 * 1024)
    
        with open(fname, "rb") as f:
            count = sum(buf.count(b"\n") for buf in _make_gen(f.raw.read))
        return count
    
    
  2. For files with potentially longer lines (up to 2000 characters), disregarding the number of lines: use function with mmap: count_nb_lines_mmap

    def count_nb_lines_mmap(file: Path) -> int:
        """Count the number of lines in a file"""
        with open(file, mode="rb") as f:
            mm = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
            nb_lines = 0
            while mm.readline():
                nb_lines += 1
            mm.close()
            return nb_lines
    
  3. For a short code with very good performance (especially for files of size up to medium size):

    def itercount(filename: str) -> int:
        """Count the number of lines in a file"""
        with open(filename, 'rb') as f:
            return sum(1 for _ in f)
    

Here is a summary of the different metrics (average time with timeit on 7 runs with 10 loops each):

Function Small file, short lines Small file, long lines Big file, short lines Big file, long lines Bigger file, short lines
... size ... 0.04 MB 1.16 MB 318 MB 17 MB 328 MB
... nb lines ... 915 lines < 100 chars 915 lines < 2000 chars 389000 lines < 100 chars 389,000 lines < 2000 chars 9.8 millions lines < 100 chars
count_nb_lines_blocks 0.183 ms 1.718 ms 36.799 ms 415.393 ms 517.920 ms
count_nb_lines_mmap 0.185 ms 0.582 ms 44.801 ms 185.461 ms 691.637 ms
buf_count_newlines_gen 0.665 ms 1.032 ms 15.620 ms 213.458 ms 318.939 ms
itercount 0.135 ms 0.817 ms 31.292 ms 223.120 ms 628.760 ms

Note: I have also compared count_nb_lines_mmap and buf_count_newlines_gen on a file of 8 GB, with 9.7 million lines of more than 800 characters. We got an average of 5.39 seconds for buf_count_newlines_gen vs. 4.2 seconds for count_nb_lines_mmap, so this latter function seems indeed better for files with longer lines.

Here is the code I have used:

import mmap
from pathlib import Path

def count_nb_lines_blocks(file: Path) -> int:
    """Count the number of lines in a file"""

    def blocks(files, size=65536):
        while True:
            b = files.read(size)
            if not b:
                break
            yield b

    with open(file, encoding="utf-8", errors="ignore") as f:
        return sum(bl.count("\n") for bl in blocks(f))


def count_nb_lines_mmap(file: Path) -> int:
    """Count the number of lines in a file"""
    with open(file, mode="rb") as f:
        mm = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
        nb_lines = 0
        while mm.readline():
            nb_lines += 1
        mm.close()
        return nb_lines


def count_nb_lines_sum(file: Path) -> int:
    """Count the number of lines in a file"""
    with open(file, "r", encoding="utf-8", errors="ignore") as f:
        return sum(1 for line in f)


def count_nb_lines_for(file: Path) -> int:
    """Count the number of lines in a file"""
    i = 0
    with open(file) as f:
        for i, _ in enumerate(f, start=1):
            pass
    return i


def buf_count_newlines_gen(fname: str) -> int:
    """Count the number of lines in a file"""
    def _make_gen(reader):
        b = reader(1024 * 1024)
        while b:
            yield b
            b = reader(1024 * 1024)

    with open(fname, "rb") as f:
        count = sum(buf.count(b"\n") for buf in _make_gen(f.raw.read))
    return count


def itercount(filename: str) -> int:
    """Count the number of lines in a file"""
    with open(filename, 'rbU') as f:
        return sum(1 for _ in f)


files = [small_file, big_file, small_file_shorter, big_file_shorter, small_file_shorter_sim_size, big_file_shorter_sim_size]
for file in files:
    print(f"File: {file.name} (size: {file.stat().st_size / 1024 ** 2:.2f} MB)")
    for func in [
        count_nb_lines_blocks,
        count_nb_lines_mmap,
        count_nb_lines_sum,
        count_nb_lines_for,
        buf_count_newlines_gen,
        itercount,
    ]:
        result = func(file)
        time = Timer(lambda: func(file)).repeat(7, 10)
        print(f" * {func.__name__}: {result} lines in {mean(time) / 10 * 1000:.3f} ms")
    print()
File: small_file.ndjson (size: 1.16 MB)
 * count_nb_lines_blocks: 915 lines in 1.718 ms
 * count_nb_lines_mmap: 915 lines in 0.582 ms
 * count_nb_lines_sum: 915 lines in 1.993 ms
 * count_nb_lines_for: 915 lines in 3.876 ms
 * buf_count_newlines_gen: 915 lines in 1.032 ms
 * itercount: 915 lines in 0.817 ms

File: big_file.ndjson (size: 317.99 MB)
 * count_nb_lines_blocks: 389000 lines in 415.393 ms
 * count_nb_lines_mmap: 389000 lines in 185.461 ms
 * count_nb_lines_sum: 389000 lines in 485.370 ms
 * count_nb_lines_for: 389000 lines in 967.075 ms
 * buf_count_newlines_gen: 389000 lines in 213.458 ms
 * itercount: 389000 lines in 223.120 ms

File: small_file__shorter.ndjson (size: 0.04 MB)
 * count_nb_lines_blocks: 915 lines in 0.183 ms
 * count_nb_lines_mmap: 915 lines in 0.185 ms
 * count_nb_lines_sum: 915 lines in 0.251 ms
 * count_nb_lines_for: 915 lines in 0.244 ms
 * buf_count_newlines_gen: 915 lines in 0.665 ms
 * itercount: 915 lines in 0.135 ms

File: big_file__shorter.ndjson (size: 17.42 MB)
 * count_nb_lines_blocks: 389000 lines in 36.799 ms
 * count_nb_lines_mmap: 389000 lines in 44.801 ms
 * count_nb_lines_sum: 389000 lines in 59.068 ms
 * count_nb_lines_for: 389000 lines in 81.387 ms
 * buf_count_newlines_gen: 389000 lines in 15.620 ms
 * itercount: 389000 lines in 31.292 ms

File: small_file__shorter_sim_size.ndjson (size: 1.21 MB)
 * count_nb_lines_blocks: 36457 lines in 1.920 ms
 * count_nb_lines_mmap: 36457 lines in 2.615 ms
 * count_nb_lines_sum: 36457 lines in 3.993 ms
 * count_nb_lines_for: 36457 lines in 6.011 ms
 * buf_count_newlines_gen: 36457 lines in 1.363 ms
 * itercount: 36457 lines in 2.147 ms

File: big_file__shorter_sim_size.ndjson (size: 328.19 MB)
 * count_nb_lines_blocks: 9834248 lines in 517.920 ms
 * count_nb_lines_mmap: 9834248 lines in 691.637 ms
 * count_nb_lines_sum: 9834248 lines in 1109.669 ms
 * count_nb_lines_for: 9834248 lines in 1683.859 ms
 * buf_count_newlines_gen: 9834248 lines in 318.939 ms
 * itercount: 9834248 lines in 628.760 ms
初见 2024-07-26 13:52:49

如果想在 Linux 中便宜地获取 Python 的行数,我推荐这种方法:

import os
print os.popen("wc -l file_path").readline().split()[0]

file_path 可以是抽象文件路径,也可以是相对路径。 希望这会有所帮助。

If one wants to get the line count cheaply in Python in Linux, I recommend this method:

import os
print os.popen("wc -l file_path").readline().split()[0]

file_path can be both abstract file path or relative path. Hope this may help.

掩耳倾听 2024-07-26 13:52:49

这是对其他一些答案的元评论。

  1. 行读取和缓冲 \n 计数技术不会为每个文件返回相同的答案,因为某些文本文件在最后一行末尾没有换行符。 您可以通过检查最后一个非空缓冲区的最后一个字节并在不是 b'\n' 时添加 1 来解决此问题。

  2. 在 Python 3 中,以文本模式和二进制模式打开文件可能会产生不同的结果,因为文本模式默认将 CR、LF 和 CRLF 识别为行结尾(将它们全部转换为 '\n'< /code>),而在二进制模式下,如果计算 b'\n',则仅计算 LF 和 CRLF。 无论您是按行读取还是固定大小的缓冲区读取,这都适用。 经典 Mac OS 使用 CR 作为行结尾; 我不知道这些文件现在有多常见。


  3. 缓冲区读取方法使用有限数量的 RAM,与文件大小无关,而行读取方法可以在最坏的情况下立即将整个文件读入 RAM(特别是如果文件使用 CR 行结尾)。 在最坏的情况下,由于动态调整行缓冲区大小以及(如果在文本模式下打开)Unicode 解码和存储的开销,它可能会使用比文件大小更多的 RAM。

  4. 您可以通过预分配字节数组并使用 readinto 而不是 read 来提高缓冲方法的内存使用率,甚至可能提高速度。 现有的答案之一(几乎没有投票)可以做到这一点,但它有错误(它重​​复计算了一些字节)。

  5. 顶部缓冲区读取答案使用大缓冲区 (1MiB)。 由于操作系统预读,使用较小的缓冲区实际上可以更快。 如果您一次读取 32K 或 64K,操作系统可能会在您请求之前开始将下一个 32K/64K 读入缓存,并且每次访问内核都会几乎立即返回。 如果您一次读取 1MiB,则操作系统不太可能推测读取整个兆字节。 它可能会预读较少的数据,但您仍然会花费大量时间坐在内核中等待磁盘返回其余数据。

This is a meta-comment on some of the other answers.

  1. The line-reading and buffered \n-counting techniques won't return the same answer for every file, because some text files have no newline at the end of the last line. You can work around this by checking the last byte of the last nonempty buffer and adding 1 if it's not b'\n'.

  2. In Python 3, opening the file in text mode and in binary mode can yield different results, because text mode by default recognizes CR, LF, and CRLF as line endings (converting them all to '\n'), while in binary mode only LF and CRLF will be counted if you count b'\n'. This applies whether you read by lines or into a fixed-size buffer. The classic Mac OS used CR as a line ending; I don't know how common those files are these days.

  3. The buffer-reading approach uses a bounded amount of RAM independent of file size, while the line-reading approach could read the entire file into RAM at once in the worst case (especially if the file uses CR line endings). In the worst case it may use substantially more RAM than the file size, because of overhead from dynamic resizing of the line buffer and (if you opened in text mode) Unicode decoding and storage.

  4. You can improve the memory usage, and probably the speed, of the buffered approach by pre-allocating a bytearray and using readinto instead of read. One of the existing answers (with few votes) does this, but it's buggy (it double-counts some bytes).

  5. The top buffer-reading answer uses a large buffer (1 MiB). Using a smaller buffer can actually be faster because of OS readahead. If you read 32K or 64K at a time, the OS will probably start reading the next 32K/64K into the cache before you ask for it, and each trip to the kernel will return almost immediately. If you read 1 MiB at a time, the OS is unlikely to speculatively read a whole megabyte. It may preread a smaller amount but you will still spend a significant amount of time sitting in the kernel waiting for the disk to return the rest of the data.

魂归处 2024-07-26 13:52:49

已经有很多答案了,但不幸的是,它们中的大多数只是一个几乎无法优化的问题上的微小经济体......

我参与了几个项目,其中行数是软件的核心功能,并尽可能快地处理巨大的问题。文件数量至关重要。

行计数的主要瓶颈是 I/O 访问,因为您需要读取每一行才能检测行返回字符,根本没有办法解决。 第二个潜在瓶颈是内存管理:一次加载的越多,处理速度就越快,但与第一个瓶颈相比,这个瓶颈可以忽略不计。

因此,除了诸如禁用 GC收集和其他微观管理技巧:

  1. 硬件解决方案:主要且最明显的方法是非编程:购买非常快的SSD/闪存硬盘驾驶。 到目前为止,这是获得最大速度提升的方法。

  2. 数据预处理和行并行化:如果您生成或可以修改您处理的文件的生成方式,或者您可以预处理它们。 首先将行 return 转换为 Unix 样式 (\n ),因为与 Windows 相比,这将节省 1 个字符(节省的不多,但很容易获得),其次也是最重要的是,您可以编写固定长度的行。 如果您需要可变长度,如果长度变化不是那么大,您可以填充较小的线。 这样,您可以立即计算出文件总大小的行数,从而访问速度更快。 此外,通过固定长度的行,您通常不仅可以预先分配内存以加快处理速度,而且还可以并行处理行! 当然,并行化对于随机访问 I/O 比 HDD 快得多的闪存/SSD 磁盘效果更好。通常,问题的最佳解决方案是对其进行预处理,以便它更好地满足您的最终目的。

  3. 磁盘并行化+硬件解决方案:如果您可以购买多个硬盘(如果可能的话还可以购买SSD闪存盘),那么您甚至可以通过利用并行化、存储您的数据来超越一个磁盘的速度。在磁盘之间以平衡的方式(最简单的是按总大小平衡)文件,然后从所有这些磁盘并行读取。 然后,您可以期望获得与您拥有的磁盘数量成比例的乘数提升。 如果购买多个磁盘不适合您,那么并行化可能不会有帮助(除非您的磁盘像某些专业级磁盘一样具有多个读取头,但即使如此,磁盘的内部高速缓存和 PCB 电路也可能成为瓶颈并阻止您完全并行地使用所有磁头,此外,您还必须为您将使用的该硬盘设计特定的代码,因为您需要知道确切的簇映射,以便将文件存储在不同磁头下的簇上,等等之后你可以用不同的头脑来阅读它们)。 事实上,众所周知,顺序读取几乎总是比随机读取快,并且单个磁盘上的并行化将具有比顺序读取更类似于随机读取的性能(您可以使用CrystalDiskMark)。

如果这些都不是一个选择,那么您只能依靠微观管理技巧来将行计数功能的速度提高几个百分点,但不要指望有任何真正重要的效果。 相反,您可以预期,与您将看到的速度改进的回报相比,您花费在调整上的时间将是不成比例的。

There are a lot of answers already, but unfortunately most of them are just tiny economies on a barely optimizable problem...

I worked on several projects where line count was the core function of the software, and working as fast as possible with a huge number of files was of paramount importance.

The main bottleneck with line count is I/O access, as you need to read each line in order to detect the line return character, there is simply no way around. The second potential bottleneck is memory management: the more you load at once, the faster you can process, but this bottleneck is negligible compared to the first.

Hence, there are three major ways to reduce the processing time of a line count function, apart from tiny optimizations such as disabling GC collection and other micro-managing tricks:

  1. Hardware solution: the major and most obvious way is non-programmatic: buy a very fast SSD/flash hard drive. By far, this is how you can get the biggest speed boosts.

  2. Data preprocessing and lines parallelization: if you generate or can modify how the files you process are generated, or if it's acceptable that you can preprocess them. First convert the line return to Unix style (\n) as this will save 1 character compared to Windows (not a big save, but it's an easy gain), and secondly and most importantly, you can potentially write lines of fixed length. If you need variable length, you can pad smaller lines if the length variability is not that big. This way, you can calculate instantly the number of lines from the total file size, which is much faster to access. Also, by having fixed length lines, not only can you generally pre-allocate memory which will speed up processing, but also you can process lines in parallel! Of course, parallelization works better with a flash/SSD disk that has much faster random access I/O than HDDs.. Often, the best solution to a problem is to preprocess it so that it better fits your end purpose.

  3. Disks parallelization + hardware solution: if you can buy multiple hard disks (and if possible SSD flash disks), then you can even go beyond the speed of one disk by leveraging parallelization, by storing your files in a balanced way (easiest is to balance by total size) among disks, and then read in parallel from all those disks. Then, you can expect to get a multiplier boost in proportion with the number of disks you have. If buying multiple disks is not an option for you, then parallelization likely won't help (except if your disk has multiple reading headers like some professional-grade disks, but even then the disk's internal cache memory and PCB circuitry will likely be a bottleneck and prevent you from fully using all heads in parallel, plus you have to devise a specific code for this hard drive you'll use because you need to know the exact cluster mapping so that you store your files on clusters under different heads, and so that you can read them with different heads after). Indeed, it's commonly known that sequential reading is almost always faster than random reading, and parallelization on a single disk will have a performance more similar to random reading than sequential reading (you can test your hard drive speed in both aspects using CrystalDiskMark for example).

If none of those are an option, then you can only rely on micromanaging tricks to improve by a few percents the speed of your line counting function, but don't expect anything really significant. Rather, you can expect the time you'll spend tweaking will be disproportionate compared to the returns in speed improvement you'll see.

缺⑴份安定 2024-07-26 13:52:49
print open('file.txt', 'r').read().count("\n") + 1
print open('file.txt', 'r').read().count("\n") + 1
吃不饱 2024-07-26 13:52:49

使用 Numba

我们可以使用 Numba 将我们的函数 JIT(及时)编译为机器代码。 def numbacountparallel(fname) 运行速度提高 2.8 倍
比问题中的 def file_len(fname)

注意:

在运行基准测试之前,操作系统已经将文件缓存到内存中,因为我在电脑上没有看到太多磁盘活动。
第一次读取文件时,时间会慢很多,使得使用 Numba 的时间优势变得微不足道。

第一次调用该函数时,JIT 编译需要额外的时间。

如果我们要做的不仅仅是计算行数,这将很有用。

Cython 是另一种选择。

结论

由于计算行数将受到 I/O 限制,因此请使用问题中的 def file_len(fname) ,除非您想做的不仅仅是计算行数。

import timeit

from numba import jit, prange
import numpy as np

from itertools import (takewhile,repeat)

FILE = '../data/us_confirmed.csv' # 40.6MB, 371755 line file
CR = ord('\n')


# Copied from the question above. Used as a benchmark
def file_len(fname):
    with open(fname) as f:
        for i, l in enumerate(f):
            pass
    return i + 1


# Copied from another answer. Used as a benchmark
def rawincount(filename):
    f = open(filename, 'rb')
    bufgen = takewhile(lambda x: x, (f.read(1024*1024*10) for _ in repeat(None)))
    return sum( buf.count(b'\n') for buf in bufgen )


# Single thread
@jit(nopython=True)
def numbacountsingle_chunk(bs):

    c = 0
    for i in range(len(bs)):
        if bs[i] == CR:
            c += 1

    return c


def numbacountsingle(filename):
    f = open(filename, "rb")
    total = 0
    while True:
        chunk = f.read(1024*1024*10)
        lines = numbacountsingle_chunk(chunk)
        total += lines
        if not chunk:
            break

    return total


# Multi thread
@jit(nopython=True, parallel=True)
def numbacountparallel_chunk(bs):

    c = 0
    for i in prange(len(bs)):
        if bs[i] == CR:
            c += 1

    return c


def numbacountparallel(filename):
    f = open(filename, "rb")
    total = 0
    while True:
        chunk = f.read(1024*1024*10)
        lines = numbacountparallel_chunk(np.frombuffer(chunk, dtype=np.uint8))
        total += lines
        if not chunk:
            break

    return total

print('numbacountparallel')
print(numbacountparallel(FILE)) # This allows Numba to compile and cache the function without adding to the time.
print(timeit.Timer(lambda: numbacountparallel(FILE)).timeit(number=100))

print('\nnumbacountsingle')
print(numbacountsingle(FILE))
print(timeit.Timer(lambda: numbacountsingle(FILE)).timeit(number=100))

print('\nfile_len')
print(file_len(FILE))
print(timeit.Timer(lambda: rawincount(FILE)).timeit(number=100))

print('\nrawincount')
print(rawincount(FILE))
print(timeit.Timer(lambda: rawincount(FILE)).timeit(number=100))

每个函数 100 次调用的时间(以秒为单位)

numbacountparallel
371755
2.8007332000000003

numbacountsingle
371755
3.1508585999999994

file_len
371755
6.7945494

rawincount
371755
6.815438

Using Numba

We can use Numba to JIT (Just in time) compile our function to machine code. def numbacountparallel(fname) runs 2.8x faster
than def file_len(fname) from the question.

Notes:

The OS had already cached the file to memory before the benchmarks were run as I don't see much disk activity on my PC.
The time would be much slower when reading the file for the first time making the time advantage of using Numba insignificant.

The JIT compilation takes extra time the first time the function is called.

This would be useful if we were doing more than just counting lines.

Cython is another option.

Conclusion

As counting lines will be I/O bound, use the def file_len(fname) from the question unless you want to do more than just count lines.

import timeit

from numba import jit, prange
import numpy as np

from itertools import (takewhile,repeat)

FILE = '../data/us_confirmed.csv' # 40.6MB, 371755 line file
CR = ord('\n')


# Copied from the question above. Used as a benchmark
def file_len(fname):
    with open(fname) as f:
        for i, l in enumerate(f):
            pass
    return i + 1


# Copied from another answer. Used as a benchmark
def rawincount(filename):
    f = open(filename, 'rb')
    bufgen = takewhile(lambda x: x, (f.read(1024*1024*10) for _ in repeat(None)))
    return sum( buf.count(b'\n') for buf in bufgen )


# Single thread
@jit(nopython=True)
def numbacountsingle_chunk(bs):

    c = 0
    for i in range(len(bs)):
        if bs[i] == CR:
            c += 1

    return c


def numbacountsingle(filename):
    f = open(filename, "rb")
    total = 0
    while True:
        chunk = f.read(1024*1024*10)
        lines = numbacountsingle_chunk(chunk)
        total += lines
        if not chunk:
            break

    return total


# Multi thread
@jit(nopython=True, parallel=True)
def numbacountparallel_chunk(bs):

    c = 0
    for i in prange(len(bs)):
        if bs[i] == CR:
            c += 1

    return c


def numbacountparallel(filename):
    f = open(filename, "rb")
    total = 0
    while True:
        chunk = f.read(1024*1024*10)
        lines = numbacountparallel_chunk(np.frombuffer(chunk, dtype=np.uint8))
        total += lines
        if not chunk:
            break

    return total

print('numbacountparallel')
print(numbacountparallel(FILE)) # This allows Numba to compile and cache the function without adding to the time.
print(timeit.Timer(lambda: numbacountparallel(FILE)).timeit(number=100))

print('\nnumbacountsingle')
print(numbacountsingle(FILE))
print(timeit.Timer(lambda: numbacountsingle(FILE)).timeit(number=100))

print('\nfile_len')
print(file_len(FILE))
print(timeit.Timer(lambda: rawincount(FILE)).timeit(number=100))

print('\nrawincount')
print(rawincount(FILE))
print(timeit.Timer(lambda: rawincount(FILE)).timeit(number=100))

Time in seconds for 100 calls to each function

numbacountparallel
371755
2.8007332000000003

numbacountsingle
371755
3.1508585999999994

file_len
371755
6.7945494

rawincount
371755
6.815438
秋日私语 2024-07-26 13:52:49

简单方法:

  1. 方法1

    >>>   f = len(open("myfile.txt").readlines()) 
      >>>>>   F 
      

    输出:

    430 
      
  2. 方法 2

    >>>   f = open("myfile.txt").read().count('\n') 
      >>>>>   F 
      

    输出:

    430 
      
  3. 方法 3

    num_lines = len(list(open('myfile.txt'))) 
      

Simple methods:

  1. Method 1

    >>> f = len(open("myfile.txt").readlines())
    >>> f
    

    Output:

    430
    
  2. Method 2

    >>> f = open("myfile.txt").read().count('\n')
    >>> f
    

    Output:

    430
    
  3. Method 3

    num_lines = len(list(open('myfile.txt')))
    
凑诗 2024-07-26 13:52:49
def count_text_file_lines(path):
    with open(path, 'rt') as file:
        line_count = sum(1 for _line in file)
    return line_count
def count_text_file_lines(path):
    with open(path, 'rt') as file:
        line_count = sum(1 for _line in file)
    return line_count
究竟谁懂我的在乎 2024-07-26 13:52:49

大文件的替代方法是使用 xreadlines():

count = 0
for line in open(thefilepath).xreadlines(  ): count += 1

对于 Python 3,请参阅:什么替代 xreadlines() Python 3?

An alternative for big files is using xreadlines():

count = 0
for line in open(thefilepath).xreadlines(  ): count += 1

For Python 3 please see: What substitutes xreadlines() in Python 3?

青丝拂面 2024-07-26 13:52:49

打开文件的结果是一个迭代器,它可以转换为一个序列,该序列具有长度:

with open(filename) as f:
   return len(list(f))

这比显式循环更简洁,并且避免了枚举。

The result of opening a file is an iterator, which can be converted to a sequence, which has a length:

with open(filename) as f:
   return len(list(f))

This is more concise than your explicit loop, and avoids the enumerate.

不必在意 2024-07-26 13:52:49

这可以工作:

import fileinput
import sys

counter = 0
for line in fileinput.input([sys.argv[1]]):
    counter += 1

fileinput.close()
print counter

This could work:

import fileinput
import sys

counter = 0
for line in fileinput.input([sys.argv[1]]):
    counter += 1

fileinput.close()
print counter
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文