当前位置：文江博客话题详情

python中多线程读取txt文件

发布于 2024-12-10 00:43:57 字数 246 浏览 0 评论 0原文

我正在尝试用 python 读取一个文件（扫描它的行并查找术语）并写入结果 - 比方说，每个术语的计数器。我需要对大量文件（超过 3000 个）执行此操作。可以做多线程吗？如果是，怎么办？

因此，场景是这样的：

读取每个文件并扫描其行将
我已读取的所有文件的计数器写入同一输出文件。

第二个问题是，它是否提高了读/写速度。

希望它足够清楚。谢谢，

罗恩。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

送君千里 2024-12-17 00:43:57

我同意@aix，multiprocessing 绝对是可行的方法。不管你是否会受到 I/O 限制——无论运行多少个并行进程，你的读取速度都有限。但很容易一些加速。

请考虑以下情况（input/ 是包含来自古腾堡计划的多个 .txt 文件的目录）。

import os.path
from multiprocessing import Pool
import sys
import time

def process_file(name):
    ''' Process one file: count number of lines and words '''
    linecount=0
    wordcount=0
    with open(name, 'r') as inp:
        for line in inp:
            linecount+=1
            wordcount+=len(line.split(' '))

    return name, linecount, wordcount

def process_files_parallel(arg, dirname, names):
    ''' Process each file in parallel via Poll.map() '''
    pool=Pool()
    results=pool.map(process_file, [os.path.join(dirname, name) for name in names])

def process_files(arg, dirname, names):
    ''' Process each file in via map() '''
    results=map(process_file, [os.path.join(dirname, name) for name in names])

if __name__ == '__main__':
    start=time.time()
    os.path.walk('input/', process_files, None)
    print "process_files()", time.time()-start

    start=time.time()
    os.path.walk('input/', process_files_parallel, None)
    print "process_files_parallel()", time.time()-start

当我在我的双核机器上运行这个程序时，会出现明显的加速（但不是 2 倍）：

$ python process_files.py
process_files() 1.71218085289
process_files_parallel() 1.28905105591

如果文件足够小以适合内存，并且您有大量不受 I/O 限制的处理需要完成，那么您应该会看到更好的改进。

I agree with @aix, multiprocessing is definitely the way to go. Regardless you will be i/o bound -- you can only read so fast, no matter how many parallel processes you have running. But there can easily be some speedup.

Consider the following (input/ is a directory that contains several .txt files from Project Gutenberg).

import os.path
from multiprocessing import Pool
import sys
import time

def process_file(name):
    ''' Process one file: count number of lines and words '''
    linecount=0
    wordcount=0
    with open(name, 'r') as inp:
        for line in inp:
            linecount+=1
            wordcount+=len(line.split(' '))

    return name, linecount, wordcount

def process_files_parallel(arg, dirname, names):
    ''' Process each file in parallel via Poll.map() '''
    pool=Pool()
    results=pool.map(process_file, [os.path.join(dirname, name) for name in names])

def process_files(arg, dirname, names):
    ''' Process each file in via map() '''
    results=map(process_file, [os.path.join(dirname, name) for name in names])

if __name__ == '__main__':
    start=time.time()
    os.path.walk('input/', process_files, None)
    print "process_files()", time.time()-start

    start=time.time()
    os.path.walk('input/', process_files_parallel, None)
    print "process_files_parallel()", time.time()-start

When I run this on my dual core machine there is a noticeable (but not 2x) speedup:

$ python process_files.py
process_files() 1.71218085289
process_files_parallel() 1.28905105591

If the files are small enough to fit in memory, and you have lots of processing to be done that isn't i/o bound, then you should see even better improvement.

回复收藏 0 原文