为什么在 C++ 中从 stdin 读取行要慢得多?比Python?
我想比较使用 Python 和 C++ 读取来自 stdin 的字符串输入行,并惊讶地发现我的 C++ 代码运行速度比等效的 Python 代码慢一个数量级。由于我的 C++ 很生疏,而且我还不是专家 Pythonista,请告诉我我是否做错了什么或者我是否误解了什么。
(TLDR 答案: 包含语句:cin.sync_with_stdio(false)
或仅使用 fgets
。
TLDR 结果: 一直向下滚动到我的问题的底部并查看表格。)
C++ 代码:
#include <iostream>
#include <time.h>
using namespace std;
int main() {
string input_line;
long line_count = 0;
time_t start = time(NULL);
int sec;
int lps;
while (cin) {
getline(cin, input_line);
if (!cin.eof())
line_count++;
};
sec = (int) time(NULL) - start;
cerr << "Read " << line_count << " lines in " << sec << " seconds.";
if (sec > 0) {
lps = line_count / sec;
cerr << " LPS: " << lps << endl;
} else
cerr << endl;
return 0;
}
// Compiled with:
// g++ -O3 -o readline_test_cpp foo.cpp
Python 等效代码:
#!/usr/bin/env python
import time
import sys
count = 0
start = time.time()
for line in sys.stdin:
count += 1
delta_sec = int(time.time() - start_time)
if delta_sec >= 0:
lines_per_sec = int(round(count/delta_sec))
print("Read {0} lines in {1} seconds. LPS: {2}".format(count, delta_sec,
lines_per_sec))
这是我的结果: >
$ cat test_lines | ./readline_test_cpp
Read 5570000 lines in 9 seconds. LPS: 618889
$ cat test_lines | ./readline_test.py
Read 5570000 lines in 1 seconds. LPS: 5570000
我应该注意到我尝试过这个均在 Mac OS X v10.6.8 (Snow Leopard) 和 Linux 2.6.32 (Red Hat Linux 6.2) 下运行。前者是 MacBook Pro,后者是一个非常强大的服务器,但这并不是太相关。
$ for i in {1..5}; do echo "Test run $i at `date`"; echo -n "CPP:"; cat test_lines | ./readline_test_cpp ; echo -n "Python:"; cat test_lines | ./readline_test.py ; done
Test run 1 at Mon Feb 20 21:29:28 EST 2012
CPP: Read 5570001 lines in 9 seconds. LPS: 618889
Python:Read 5570000 lines in 1 seconds. LPS: 5570000
Test run 2 at Mon Feb 20 21:29:39 EST 2012
CPP: Read 5570001 lines in 9 seconds. LPS: 618889
Python:Read 5570000 lines in 1 seconds. LPS: 5570000
Test run 3 at Mon Feb 20 21:29:50 EST 2012
CPP: Read 5570001 lines in 9 seconds. LPS: 618889
Python:Read 5570000 lines in 1 seconds. LPS: 5570000
Test run 4 at Mon Feb 20 21:30:01 EST 2012
CPP: Read 5570001 lines in 9 seconds. LPS: 618889
Python:Read 5570000 lines in 1 seconds. LPS: 5570000
Test run 5 at Mon Feb 20 21:30:11 EST 2012
CPP: Read 5570001 lines in 10 seconds. LPS: 557000
Python:Read 5570000 lines in 1 seconds. LPS: 5570000
微小的基准附录和回顾
为了完整起见,我想我应该更新同一文件的读取速度与原始(同步)C++ 代码相同的框。同样,这是针对快速磁盘上的 100M 行文件。以下是几种解决方案/方法的比较:
执行行 | 每秒 |
---|---|
数python (默认) | 3,571,428 |
cin (默认/天真) | 819,672 |
cin (无同步) | 12,500,000 |
fgets | 14,285,714 |
wc (不公平比较) | 54,644,808 |
I wanted to compare reading lines of string input from stdin using Python and C++ and was shocked to see my C++ code run an order of magnitude slower than the equivalent Python code. Since my C++ is rusty and I'm not yet an expert Pythonista, please tell me if I'm doing something wrong or if I'm misunderstanding something.
(TLDR answer: include the statement: cin.sync_with_stdio(false)
or just use fgets
instead.
TLDR results: scroll all the way down to the bottom of my question and look at the table.)
C++ code:
#include <iostream>
#include <time.h>
using namespace std;
int main() {
string input_line;
long line_count = 0;
time_t start = time(NULL);
int sec;
int lps;
while (cin) {
getline(cin, input_line);
if (!cin.eof())
line_count++;
};
sec = (int) time(NULL) - start;
cerr << "Read " << line_count << " lines in " << sec << " seconds.";
if (sec > 0) {
lps = line_count / sec;
cerr << " LPS: " << lps << endl;
} else
cerr << endl;
return 0;
}
// Compiled with:
// g++ -O3 -o readline_test_cpp foo.cpp
Python Equivalent:
#!/usr/bin/env python
import time
import sys
count = 0
start = time.time()
for line in sys.stdin:
count += 1
delta_sec = int(time.time() - start_time)
if delta_sec >= 0:
lines_per_sec = int(round(count/delta_sec))
print("Read {0} lines in {1} seconds. LPS: {2}".format(count, delta_sec,
lines_per_sec))
Here are my results:
$ cat test_lines | ./readline_test_cpp
Read 5570000 lines in 9 seconds. LPS: 618889
$ cat test_lines | ./readline_test.py
Read 5570000 lines in 1 seconds. LPS: 5570000
I should note that I tried this both under Mac OS X v10.6.8 (Snow Leopard) and Linux 2.6.32 (Red Hat Linux 6.2). The former is a MacBook Pro, and the latter is a very beefy server, not that this is too pertinent.
$ for i in {1..5}; do echo "Test run $i at `date`"; echo -n "CPP:"; cat test_lines | ./readline_test_cpp ; echo -n "Python:"; cat test_lines | ./readline_test.py ; done
Test run 1 at Mon Feb 20 21:29:28 EST 2012
CPP: Read 5570001 lines in 9 seconds. LPS: 618889
Python:Read 5570000 lines in 1 seconds. LPS: 5570000
Test run 2 at Mon Feb 20 21:29:39 EST 2012
CPP: Read 5570001 lines in 9 seconds. LPS: 618889
Python:Read 5570000 lines in 1 seconds. LPS: 5570000
Test run 3 at Mon Feb 20 21:29:50 EST 2012
CPP: Read 5570001 lines in 9 seconds. LPS: 618889
Python:Read 5570000 lines in 1 seconds. LPS: 5570000
Test run 4 at Mon Feb 20 21:30:01 EST 2012
CPP: Read 5570001 lines in 9 seconds. LPS: 618889
Python:Read 5570000 lines in 1 seconds. LPS: 5570000
Test run 5 at Mon Feb 20 21:30:11 EST 2012
CPP: Read 5570001 lines in 10 seconds. LPS: 557000
Python:Read 5570000 lines in 1 seconds. LPS: 5570000
Tiny benchmark addendum and recap
For completeness, I thought I'd update the read speed for the same file on the same box with the original (synced) C++ code. Again, this is for a 100M line file on a fast disk. Here's the comparison, with several solutions/approaches:
Implementation | Lines per second |
---|---|
python (default) | 3,571,428 |
cin (default/naive) | 819,672 |
cin (no sync) | 12,500,000 |
fgets | 14,285,714 |
wc (not fair comparison) | 54,644,808 |
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(11)
tl;dr: 因为 C++ 中不同的默认设置需要更多的系统调用。
默认情况下,
cin
与 stdio 同步,这会导致它避免任何输入缓冲。如果将其添加到 main 的顶部,您应该会看到更好的性能:通常,当缓冲输入流时,不是一次读取一个字符,而是以更大的块读取流。这减少了系统调用的数量,而系统调用通常相对昂贵。但是,由于基于
FILE*
的stdio
和iostreams
通常具有单独的实现,因此具有单独的缓冲区,因此如果同时使用两者,可能会导致问题一起。例如:如果
cin
读取的输入多于实际需要的输入,则第二个整数值将不可用于scanf
函数,该函数拥有自己的独立缓冲区。这会导致意想不到的结果。为了避免这种情况,默认情况下,流与
stdio
同步。实现此目的的一种常见方法是使用stdio
函数根据需要让cin
一次读取每个字符。不幸的是,这带来了很多开销。对于少量输入,这不是一个大问题,但是当您读取数百万行时,性能损失就很大了。幸运的是,库设计者认为,如果您知道自己在做什么,您也应该能够禁用此功能以获得更高的性能,因此他们提供了
sync_with_stdio
方法。从这个链接(添加了重点):tl;dr: Because of different default settings in C++ requiring more system calls.
By default,
cin
is synchronized with stdio, which causes it to avoid any input buffering. If you add this to the top of your main, you should see much better performance:Normally, when an input stream is buffered, instead of reading one character at a time, the stream will be read in larger chunks. This reduces the number of system calls, which are typically relatively expensive. However, since the
FILE*
basedstdio
andiostreams
often have separate implementations and therefore separate buffers, this could lead to a problem if both were used together. For example:If more input was read by
cin
than it actually needed, then the second integer value wouldn't be available for thescanf
function, which has its own independent buffer. This would lead to unexpected results.To avoid this, by default, streams are synchronized with
stdio
. One common way to achieve this is to havecin
read each character one at a time as needed usingstdio
functions. Unfortunately, this introduces a lot of overhead. For small amounts of input, this isn't a big problem, but when you are reading millions of lines, the performance penalty is significant.Fortunately, the library designers decided that you should also be able to disable this feature to get improved performance if you knew what you were doing, so they provided the
sync_with_stdio
method. From this link (emphasis added):出于好奇,我了解了底层发生的情况,并使用了 dtruss/strace每次测试。
C++
系统调用
sudo dtruss -c ./a.out
在
Python
系统调用中
sudo dtruss -c ./a.py
在
Just out of curiosity I've taken a look at what happens under the hood, and I've used dtruss/strace on each test.
C++
syscalls
sudo dtruss -c ./a.out < in
Python
syscalls
sudo dtruss -c ./a.py < in
我落后了几年,但是:
在原始帖子的“编辑 4/5/6”中,您正在使用这种结构:
这在几个不同的方面都是错误的:
您实际上是在计时
cat
的执行,而不是您的基准测试。time
显示的“user”和“sys”CPU 使用率是cat
的 CPU 使用率,而不是您的基准测试程序。更糟糕的是,“真实”时间也不一定准确。根据cat
和本地操作系统中管道的实现,cat
可能会在读取器进程完成其工作之前写入最终的巨大缓冲区并退出。< /p>使用
cat
是不必要的,而且实际上会适得其反;您正在添加移动部件。如果您使用的是一个足够旧的系统(即具有单个 CPU,并且在某些代的计算机中 I/O 比 CPU 更快),仅cat
正在运行这一事实就可能会显着影响结果。您还受到cat
可能执行的任何输入和输出缓冲以及其他处理的影响。 (这可能会为您赢得“猫的无用使用”奖,如果我是 Randal Schwartz。更好的构造是:
shell 打开 big_file,将其传递给您的程序(嗯,实际上是
time
,然后它会执行您的程序)子进程)作为已打开的文件描述符。 100% 的文件读取完全由您尝试进行基准测试的程序负责。这可以让您真实地了解其性能,而不会出现虚假的复杂情况。我将提到两个可能但实际上错误的“修复”,也可以考虑(但我对它们进行了不同的“编号”,因为这些不是原始帖子中的错误):
A.您可以通过计时“修复”此问题仅您的程序:
B. 或对整个管道进行计时:
这些都是错误的,原因与 #2 相同:它们仍然不必要地使用
cat
。我提到它们有几个原因:对于那些不太熟悉 POSIX shell 的 I/O 重定向功能的人来说,它们更“自然”
可能在某些情况下需要
cat
(例如:要读取的文件需要某种权限才能访问,而您确实这样做不想向要进行基准测试的程序授予该权限:sudo cat /dev/sda | /usr/bin/time my_compression_test --no-output
)实际上,在现代机器上,管道中添加的
cat
可能是没有真正的后果。但我在说最后一句话时有些犹豫。如果我们检查“Edit 5”中的最后结果 --
-- 这表明
cat
在测试期间消耗了 74% 的 CPU;事实上 1.34/1.83 大约是 74%。也许运行:只需要剩下的 0.49 秒!可能不是:这里的
cat
必须支付从“磁盘”(实际上是缓冲区缓存)传输文件的read()
系统调用(或等效函数)的费用,以及管道写入将它们传送到wc
。正确的测试仍然必须执行这些read()
调用;只有写入管道和从管道读取的调用会被保存,而且这些调用应该相当便宜。尽管如此,我预测您将能够测量
cat file | 之间的差异。 wc -l 和 wc -l <code>文件
并找到明显的(两位数百分比)差异。每个较慢的测试都会在绝对时间上付出类似的代价;然而,这仅占其较大总时间的一小部分。事实上,我在 Linux 3.13 (Ubuntu 14.04) 系统上用 1.5 GB 的垃圾文件做了一些快速测试,获得了这些结果(这些结果实际上是“3 中最好的”结果;当然是在启动缓存之后)
:这两个管道结果声称占用的 CPU 时间(用户+系统)比实际挂钟时间要多。这是因为我使用的是 shell (bash) 的内置“time”命令,该命令可以识别管道;我在一台多核机器上,管道中的单独进程可以使用单独的核心,从而比实时更快地累积 CPU 时间。使用
/usr/bin/time
我看到比实时更小的 CPU 时间——这表明它只能对在命令行上传递给它的单个管道元素进行计时。此外,shell 的输出给出毫秒,而 /usr/bin/time 只给出百分之一秒。因此,在
wc -l
的效率水平下,cat
会产生巨大的差异:409 / 283 = 1.453 或实时性提高 45.3%,而 775 / 280 = 2.768,或CPU 使用量增加了 177%!在我随机的当时就在那里的测试盒上。我应该补充一点,这些测试风格之间至少还有一个显着差异,我不能说这是优点还是缺点;你必须自己决定:
当你运行 cat big_file | 时/usr/bin/time my_program,您的程序正在以
cat
发送的速度从管道接收输入,并且块大小不大于cat
写入的大小代码>.当您运行
/usr/bin/time my_program < big_file
,您的程序接收实际文件的打开文件描述符。当提供引用常规文件的文件描述符时,您的程序(或在许多情况下是编写它的语言的 I/O 库)可能会采取不同的操作。它可以使用mmap(2)
将输入文件映射到其地址空间,而不是使用显式的read(2)
系统调用。这些差异对基准测试结果的影响可能比运行cat
二进制文件的小成本大得多。当然,如果同一程序在两种情况下的性能显着不同,那么这是一个有趣的基准测试结果。它表明,程序或其 I/O 库确实正在做一些有趣的事情,例如使用
mmap()
。因此,在实践中,以两种方式运行基准测试可能会更好;也许可以通过一些小因素对cat
结果进行折扣,以“原谅”运行cat
本身的成本。I'm a few years behind here, but:
In 'Edit 4/5/6' of the original post, you are using the construction:
This is wrong in a couple of different ways:
You're actually timing the execution of
cat
, not your benchmark. The 'user' and 'sys' CPU usage displayed bytime
are those ofcat
, not your benchmarked program. Even worse, the 'real' time is also not necessarily accurate. Depending on the implementation ofcat
and of pipelines in your local OS, it is possible thatcat
writes a final giant buffer and exits long before the reader process finishes its work.Use of
cat
is unnecessary and in fact counterproductive; you're adding moving parts. If you were on a sufficiently old system (i.e. with a single CPU and -- in certain generations of computers -- I/O faster than CPU) -- the mere fact thatcat
was running could substantially color the results. You are also subject to whatever input and output buffering and other processingcat
may do. (This would likely earn you a 'Useless Use Of Cat' award if I were Randal Schwartz.A better construction would be:
In this statement it is the shell which opens big_file, passing it to your program (well, actually to
time
which then executes your program as a subprocess) as an already-open file descriptor. 100% of the file reading is strictly the responsibility of the program you're trying to benchmark. This gets you a real reading of its performance without spurious complications.I will mention two possible, but actually wrong, 'fixes' which could also be considered (but I 'number' them differently as these are not things which were wrong in the original post):
A. You could 'fix' this by timing only your program:
B. or by timing the entire pipeline:
These are wrong for the same reasons as #2: they're still using
cat
unnecessarily. I mention them for a few reasons:they're more 'natural' for people who aren't entirely comfortable with the I/O redirection facilities of the POSIX shell
there may be cases where
cat
is needed (e.g.: the file to be read requires some sort of privilege to access, and you do not want to grant that privilege to the program to be benchmarked:sudo cat /dev/sda | /usr/bin/time my_compression_test --no-output
)in practice, on modern machines, the added
cat
in the pipeline is probably of no real consequence.But I say that last thing with some hesitation. If we examine the last result in 'Edit 5' --
-- this claims that
cat
consumed 74% of the CPU during the test; and indeed 1.34/1.83 is approximately 74%. Perhaps a run of:would have taken only the remaining .49 seconds! Probably not:
cat
here had to pay for theread()
system calls (or equivalent) which transferred the file from 'disk' (actually buffer cache), as well as the pipe writes to deliver them towc
. The correct test would still have had to do thoseread()
calls; only the write-to-pipe and read-from-pipe calls would have been saved, and those should be pretty cheap.Still, I predict you would be able to measure the difference between
cat file | wc -l
andwc -l < file
and find a noticeable (2-digit percentage) difference. Each of the slower tests will have paid a similar penalty in absolute time; which would however amount to a smaller fraction of its larger total time.In fact I did some quick tests with a 1.5 gigabyte file of garbage, on a Linux 3.13 (Ubuntu 14.04) system, obtaining these results (these are actually 'best of 3' results; after priming the cache, of course):
Notice that the two pipeline results claim to have taken more CPU time (user+sys) than real wall-clock time. This is because I'm using the shell (bash)'s built-in 'time' command, which is cognizant of the pipeline; and I'm on a multi-core machine where separate processes in a pipeline can use separate cores, accumulating CPU time faster than realtime. Using
/usr/bin/time
I see smaller CPU time than realtime -- showing that it can only time the single pipeline element passed to it on its command line. Also, the shell's output gives milliseconds while/usr/bin/time
only gives hundredths of a second.So at the efficiency level of
wc -l
, thecat
makes a huge difference: 409 / 283 = 1.453 or 45.3% more realtime, and 775 / 280 = 2.768, or a whopping 177% more CPU used! On my random it-was-there-at-the-time test box.I should add that there is at least one other significant difference between these styles of testing, and I can't say whether it is a benefit or fault; you have to decide this yourself:
When you run
cat big_file | /usr/bin/time my_program
, your program is receiving input from a pipe, at precisely the pace sent bycat
, and in chunks no larger than written bycat
.When you run
/usr/bin/time my_program < big_file
, your program receives an open file descriptor to the actual file. Your program -- or in many cases the I/O libraries of the language in which it was written -- may take different actions when presented with a file descriptor referencing a regular file. It may usemmap(2)
to map the input file into its address space, instead of using explicitread(2)
system calls. These differences could have a far larger effect on your benchmark results than the small cost of running thecat
binary.Of course it is an interesting benchmark result if the same program performs significantly differently between the two cases. It shows that, indeed, the program or its I/O libraries are doing something interesting, like using
mmap()
. So in practice it might be good to run the benchmarks both ways; perhaps discounting thecat
result by some small factor to "forgive" the cost of runningcat
itself.我在 Mac 上使用 g++ 在我的计算机上重现了原始结果。
在
while
循环之前将以下语句添加到 C++ 版本中,使其与 Python 版本:sync_with_stdio
将速度提高到 2 秒,设置更大的缓冲区将其降低到 1 秒。I reproduced the original result on my computer using g++ on a Mac.
Adding the following statements to the C++ version just before the
while
loop brings it inline with the Python version:sync_with_stdio
improved speed to 2 seconds, and setting a larger buffer brought it down to 1 second.如果您不关心文件加载时间或加载小文本文件,
getline
、流运算符、scanf
会很方便。但是,如果您关心性能,那么您实际上应该将整个文件缓冲到内存中(假设它适合)。下面是一个示例:
如果需要,您可以在该缓冲区周围包装一个流,以便更方便地访问,如下所示:
此外,如果您可以控制文件,请考虑使用平面二进制数据格式而不是文本。读写更加可靠,因为您不必处理空白的所有歧义。它也更小并且解析速度更快。
getline
, stream operators,scanf
, can be convenient if you don't care about file loading time or if you are loading small text files. But, if the performance is something you care about, you should really just buffer the entire file into memory (assuming it will fit).Here's an example:
If you want, you can wrap a stream around that buffer for more convenient access like this:
Also, if you are in control of the file, consider using a flat binary data format instead of text. It's more reliable to read and write because you don't have to deal with all the ambiguities of whitespace. It's also smaller and much faster to parse.
对于我来说,以下代码比迄今为止发布的其他代码更快:
(Visual Studio 2013,64 位,500 MB 文件,行长度统一为 [0, 1000))。
它比我所有的 Python 尝试都高出 2 倍以上。
The following code was faster for me than the other code posted here so far:
(Visual Studio 2013, 64-bit, 500 MB file with line length uniformly in [0, 1000)).
It beats all my Python attempts by more than a factor 2.
顺便说一句,C++ 版本的行数比 Python 版本的行数大 1 的原因是,仅当尝试读取超出 eof 的内容时,才会设置 eof 标志。所以正确的循环是:
By the way, the reason the line count for the C++ version is one greater than the count for the Python version is that the eof flag only gets set when an attempt is made to read beyond eof. So the correct loop would be:
在第二个示例(使用
scanf()
)中,速度仍然较慢的原因可能是因为scanf("%s")
解析字符串并查找任何空格字符(空格、制表符、换行符)。另外,是的,CPython 会进行一些缓存以避免硬盘读取。
In your second example (with
scanf()
) reason why this is still slower might be becausescanf("%s")
parses string and looks for any space char (space, tab, newline).Also, yes, CPython does some caching to avoid harddisk reads.
答案的第一个元素:
很慢。该死的慢。我使用scanf
获得了巨大的性能提升,如下所示,但它仍然比 Python 慢两倍。A first element of an answer:
<iostream>
is slow. Damn slow. I get a huge performance boost withscanf
as in the below, but it is still two times slower than Python.好吧,我看到在你的第二个解决方案中,你从
cin
切换到scanf
,这是我要给你的第一个建议 (cin
是 slooooooooooow)。现在,如果您从scanf
切换到fgets
,您会看到性能的另一个提升:fgets
是最快的字符串输入 C++ 函数。顺便说一句,不知道同步的事情,很好。但您仍然应该尝试
fgets
。Well, I see that in your second solution you switched from
cin
toscanf
, which was the first suggestion I was going to make you (cin
is sloooooooooooow). Now, if you switch fromscanf
tofgets
, you would see another boost in performance:fgets
is the fastest C++ function for string input.BTW, didn't know about that sync thing, nice. But you should still try
fgets
.我玩这个游戏已经晚了,但我想我应该把我的两分钱投入:
Python 行:
不从流中读取数据。它仅计算流遇到的行数 - 仅此而已。
Peter Mortensen 的 dtruss 分析得出了一些结论:
https://stackoverflow.com/a/9657502/1043530
请注意,
这不是用 python 完成的。 python 解释器就像任何其他程序一样,如果它执行 I/O,它将通过 strace 或 dtruss 显示它。
用实际读取重做这个“超快”Python 程序,我相信您会看到 dtruss 输出中的一些变化。
是的,我知道我迟到了。 。 。但这引起了我的注意,因为人们正在谈论禁用这个那个,fgets,vs 等等等等。 。 。如果程序“A”不执行磁盘 I/O,但程序“B”执行磁盘 I/O,则在许多/大多数/所有情况下,这可以解释为什么程序“A”更快。
TLDR:Peter Mortensen dtruss run 所证明的程序与他的帖子之间的错误等价应标记为答案。
-标记
I am way late to this game, but I thought I'd put my two cents in:
The python line:
Does NOT read data from the stream. It merely counts the number lines that the stream encounters - nothing more.
Peter Mortensen was on to something with his dtruss analysis:
https://stackoverflow.com/a/9657502/1043530
Note the
that is not done with the python. A python interpreter is just like any other program, if it does I/O it will show it via strace or dtruss.
Redo this 'super fast' python program with actual reads and I believe you'll see some changes in the dtruss output.
Yeah, I know I'm late . . . but this one caught my eye because of how folks are talking disabling this and that, fgets, vs blah whatever . . . If program 'A' does not do disk I/O but program 'B' does, in many/most/all cases that explain why program 'A' is faster.
TLDR: False equivalence between programs as proven by Peter Mortensen dtruss run and his post should be marked as the answer.
-Mark