python中多线程读取txt文件
我正在尝试用 python 读取一个文件(扫描它的行并查找术语)并写入结果 - 比方说,每个术语的计数器。我需要对大量文件(超过 3000 个)执行此操作。可以做多线程吗?如果是,怎么办?
因此,场景是这样的:
- 读取每个文件并扫描其行将
- 我已读取的所有文件的计数器写入同一输出文件。
第二个问题是,它是否提高了读/写速度。
希望它足够清楚。谢谢,
罗恩。
I'm trying to read a file in python (scan it lines and look for terms) and write the results- let say, counters for each term. I need to do that for a big amount of files (more than 3000). Is it possible to do that multi threaded? If yes, how?
So, the scenario is like this:
- Read each file and scan its lines
- Write counters to same output file for all the files I've read.
Second question is, does it improve the speed of read/write.
Hope it is clear enough. Thanks,
Ron.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
![扫码二维码加入Web技术交流群](/public/img/jiaqun_03.jpg)
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我同意@aix,
multiprocessing
绝对是可行的方法。不管你是否会受到 I/O 限制——无论运行多少个并行进程,你的读取速度都有限。但很容易一些加速。请考虑以下情况(input/ 是包含来自古腾堡计划的多个 .txt 文件的目录)。
当我在我的双核机器上运行这个程序时,会出现明显的加速(但不是 2 倍):
如果文件足够小以适合内存,并且您有大量不受 I/O 限制的处理需要完成,那么您应该会看到更好的改进。
I agree with @aix,
multiprocessing
is definitely the way to go. Regardless you will be i/o bound -- you can only read so fast, no matter how many parallel processes you have running. But there can easily be some speedup.Consider the following (input/ is a directory that contains several .txt files from Project Gutenberg).
When I run this on my dual core machine there is a noticeable (but not 2x) speedup:
If the files are small enough to fit in memory, and you have lots of processing to be done that isn't i/o bound, then you should see even better improvement.
是的,应该可以以并行方式完成此操作。
然而,在Python中很难实现多线程的并行性。因此
multiprocessing
是更好的默认选择并联。很难说您可以期望实现什么样的加速。这取决于可以并行完成的工作量比例(越多越好)以及必须串行完成的工作量比例(越少越好)。
Yes, it should be possible to do this in a parallel manner.
However, in Python it's hard to achieve parallelism with multiple threads. For this reason
multiprocessing
is the better default choice for doing things in parallel.It is hard to say what kind of speedup you can expect to achieve. It depends on what fraction of the workload it will be possible to do in parallel (the more the better), and what fraction will have to be done serially (the less the better).