python 是否会自动并行化 IO 和 CPU 或内存限制部分？

发布于 2024-07-20 06:39:10 字数 995 浏览 7 评论 0原文

这是上一个问题的后续问题。

考虑一下这段代码，它比上一个问题<中的代码要简单一些。 /a> （但仍然比我真正的简单得多）

import sys
data=[]

for line in open(sys.argv[1]):
    data.append(line[-1])

print data[-1]

现在，我期望更长的运行时间（我的基准文件有 65150224 行长），可能更长。事实并非如此，它在与以前相同的硬件上运行大约 2 分钟！

data.append() 非常轻量吗？我不相信，因此我编写了这个假代码来测试它：

data=[]
counter=0
string="a\n"

for counter in xrange(65150224):
    data.append(string[-1])

print data[-1]

这个运行时间为 1.5 到 3 分钟（运行之间存在很大的差异）

为什么我在前一个程序中没有得到 3.5 到 5 分钟的时间？显然 data.append() 是与 IO 并行发生的。

这是个好消息！

但它是如何运作的呢？它是已记录的功能吗？我的代码是否有任何需要遵循的要求，以使其尽可能正常工作（除了负载平衡 IO 和内存/CPU 活动之外）？或者只是简单的缓冲/缓存在起作用？

再次，我在这个问题上标记了“linux”，因为我只对特定于 Linux 的答案感兴趣。如果您认为值得这样做，请随意给出与操作系统无关的答案，甚至是其他操作系统的答案。

原文

This is a follow-up questions on a previous one.

Consider this code, which is less toyish than the one in the previous question (but still much simpler than my real one)

import sys
data=[]

for line in open(sys.argv[1]):
    data.append(line[-1])

print data[-1]

Now, I was expecting a longer run time (my benchmark file is 65150224 lines long), possibly much longer. This was not the case, it runs in ~ 2 minutes on the same hw as before!

Is it data.append() very lightweight? I don't believe so, thus I wrote this fake code to test it:

data=[]
counter=0
string="a\n"

for counter in xrange(65150224):
    data.append(string[-1])

print data[-1]

This runs in 1.5 to 3 minutes (there is strong variability among runs)

Why don't I get 3.5 to 5 minutes in the former program? Obviously data.append() is happening in parallel with the IO.

This is good news!

But how does it work? Is it a documented feature? Is there any requirement on my code that I should follow to make it works as much as possible (besides load-balancing IO and memory/CPU activities)? Or is it just plain buffering/caching in action?

Again, I tagged "linux" this question, because I'm interested only in linux-specific answers. Feel free to give OS-agnostic, or even other-OS answers, if you think it's worth doing.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

我做我的改变 2024-07-27 06:39:10

显然 data.append() 与 IO 并行发生。

恐怕不是。在 Python 中并行 IO 和计算是可能的，但这不会神奇地发生。

您可以做的一件事是使用 posix_fadvise(2) 向操作系统提示您计划按顺序读取文件 (POSIX_FADV_SEQUENTIAL )。

在对 600 meg 文件（ISO）执行“wc -l”的一些粗略测试中，性能提高了约 20%。每个测试都是在清除磁盘缓存后立即完成的。

有关 fadvise 的 Python 接口，请参阅 python-fadvise。

回复收藏 0 原文

别靠近我心 2024-07-27 06:39:10

您的文件中的行有多大？如果它们不是很长（大约 1K 以下的任何内容都可能合格），那么您可能会因为输入缓冲而看到性能提升。

回复收藏 0 原文

本宫微胖 2024-07-27 06:39:10

为什么你认为 list.append() 会是一个较慢的操作？考虑到列表使用的内部指针数组来保存对其中对象的引用，它的速度非常快，并且被分配在越来越大的块中，因此每个追加实际上不会重新分配数组，并且大多数可以简单地增加长度计数器和设置指针并递增。

回复收藏 0 原文

疏忽 2024-07-27 06:39:10

我没有看到任何证据表明“data.append() 与 IO 并行发生”。和 Benji 一样，我认为这并不像你想象的那样是自动的。您表明，执行 data.append(line[-1]) 所需的时间与 lc = lc + 1 花费的时间大致相同（与 IO 和行分割相比，基本上根本没有时间）。 data.append(line[-1]) 非常快，这并不奇怪。人们会期望整行都在快速缓存中，并且如上所述，追加会提前准备缓冲区，并且很少需要重新分配。此外，line[-1] 将始终为 '\n'，除了文件的最后一行（不知道 Python 是否对此进行了优化）。

唯一让我有点惊讶的是 xrange 变化很大。我希望它总是更快，因为没有 IO，而且你实际上并没有使用计数器。

回复收藏 0 原文