为什么在 C++ 中分割字符串速度较慢比Python?
我正在尝试将一些代码从 Python 转换为 C++,以提高速度并提高我生疏的 C++ 技能。昨天,当 Python 中从 stdin 读取行的简单实现比 C++ 快得多时,我感到震惊(请参阅 这个)。今天,我终于弄清楚了如何在 C++ 中使用合并定界符分割字符串(与 python 的 split() 语义类似),现在感觉似曾相识!我的 C++ 代码需要更长的时间来完成这项工作(尽管不是一个数量级,就像昨天的课程一样)。
Python 代码:
#!/usr/bin/env python
from __future__ import print_function
import time
import sys
count = 0
start_time = time.time()
dummy = None
for line in sys.stdin:
dummy = line.split()
count += 1
delta_sec = int(time.time() - start_time)
print("Python: Saw {0} lines in {1} seconds. ".format(count, delta_sec), end='')
if delta_sec > 0:
lps = int(count/delta_sec)
print(" Crunch Speed: {0}".format(lps))
else:
print('')
C++ 代码:
#include <iostream>
#include <string>
#include <sstream>
#include <time.h>
#include <vector>
using namespace std;
void split1(vector<string> &tokens, const string &str,
const string &delimiters = " ") {
// Skip delimiters at beginning
string::size_type lastPos = str.find_first_not_of(delimiters, 0);
// Find first non-delimiter
string::size_type pos = str.find_first_of(delimiters, lastPos);
while (string::npos != pos || string::npos != lastPos) {
// Found a token, add it to the vector
tokens.push_back(str.substr(lastPos, pos - lastPos));
// Skip delimiters
lastPos = str.find_first_not_of(delimiters, pos);
// Find next non-delimiter
pos = str.find_first_of(delimiters, lastPos);
}
}
void split2(vector<string> &tokens, const string &str, char delim=' ') {
stringstream ss(str); //convert string to stream
string item;
while(getline(ss, item, delim)) {
tokens.push_back(item); //add token to vector
}
}
int main() {
string input_line;
vector<string> spline;
long count = 0;
int sec, lps;
time_t start = time(NULL);
cin.sync_with_stdio(false); //disable synchronous IO
while(cin) {
getline(cin, input_line);
spline.clear(); //empty the vector for the next line to parse
//I'm trying one of the two implementations, per compilation, obviously:
// split1(spline, input_line);
split2(spline, input_line);
count++;
};
count--; //subtract for final over-read
sec = (int) time(NULL) - start;
cerr << "C++ : Saw " << count << " lines in " << sec << " seconds." ;
if (sec > 0) {
lps = count / sec;
cerr << " Crunch speed: " << lps << endl;
} else
cerr << endl;
return 0;
//compiled with: g++ -Wall -O3 -o split1 split_1.cpp
请注意,我尝试了两种不同的拆分实现。一种(split1)使用字符串方法来搜索令牌,并且能够合并多个令牌以及处理大量令牌(它来自此处)。第二个 (split2) 使用 getline 将字符串作为流读取,不合并分隔符,并且仅支持单个分隔符(该字符是由多个 StackOverflow 用户在字符串拆分问题的答案中发布的)。
我以不同的顺序运行了多次。我的测试机器是 Macbook Pro(2011 年,8GB,四核),这并不重要。我正在使用一个 20M 行的文本文件进行测试,其中包含三个空格分隔的列,每个列看起来类似于:“foo.bar 127.0.0.1 home.foo.bar”
结果:
$ /usr/bin/time cat test_lines_double | ./split.py
15.61 real 0.01 user 0.38 sys
Python: Saw 20000000 lines in 15 seconds. Crunch Speed: 1333333
$ /usr/bin/time cat test_lines_double | ./split1
23.50 real 0.01 user 0.46 sys
C++ : Saw 20000000 lines in 23 seconds. Crunch speed: 869565
$ /usr/bin/time cat test_lines_double | ./split2
44.69 real 0.02 user 0.62 sys
C++ : Saw 20000000 lines in 45 seconds. Crunch speed: 444444
我做错了什么?有没有更好的方法在 C++ 中进行字符串分割,不依赖外部库(即没有 boost),支持合并定界符序列(如 python 的 split),线程安全(所以没有 strtok),并且其性能至少与Python同等吗?
编辑 1 / 部分解决方案?:
我尝试通过让 python 重置虚拟列表并每次追加到它,使其成为更公平的比较,就像 C++ 所做的那样。这仍然不完全是 C++ 代码正在做的事情,但它更接近一些。基本上,现在的循环是:
for line in sys.stdin:
dummy = []
dummy += line.split()
count += 1
python 的性能现在与 split1 C++ 实现大致相同。
/usr/bin/time cat test_lines_double | ./split5.py
22.61 real 0.01 user 0.40 sys
Python: Saw 20000000 lines in 22 seconds. Crunch Speed: 909090
我仍然感到惊讶的是,即使 Python 对于字符串处理进行了如此优化(正如 Matt Joiner 所建议的),这些 C++ 实现也不会更快。如果有人知道如何使用 C++ 以更优化的方式执行此操作,请分享您的代码。 (我认为我的下一步将尝试用纯 C 语言实现这一点,尽管我不会牺牲程序员的生产力来用 C 语言重新实现我的整个项目,所以这只是字符串分割速度的一个实验。)
感谢大家的帮助。
最终编辑/解决方案:
请参阅阿尔夫接受的答案。由于 python 严格通过引用处理字符串,并且经常复制 STL 字符串,因此普通 python 实现的性能会更好。为了进行比较,我通过 Alf 的代码编译并运行了我的数据,这是与所有其他运行在同一台机器上的性能,本质上与简单的 python 实现相同(尽管比重置/附加列表的 python 实现更快,如如上面的编辑所示):
$ /usr/bin/time cat test_lines_double | ./split6
15.09 real 0.01 user 0.45 sys
C++ : Saw 20000000 lines in 15 seconds. Crunch speed: 1333333
我唯一剩下的小抱怨是关于在这种情况下让 C++ 执行所需的代码量。
从这个问题和昨天的标准输入行读取问题(上面链接)中得到的教训之一是,应该始终进行基准测试,而不是对语言的相对“默认”性能做出天真的假设。我很欣赏这种教育。
再次感谢大家的建议!
I'm trying to convert some code from Python to C++ in an effort to gain a little bit of speed and sharpen my rusty C++ skills. Yesterday I was shocked when a naive implementation of reading lines from stdin was much faster in Python than C++ (see this). Today, I finally figured out how to split a string in C++ with merging delimiters (similar semantics to python's split()), and am now experiencing deja vu! My C++ code takes much longer to do the work (though not an order of magnitude more, as was the case for yesterday's lesson).
Python Code:
#!/usr/bin/env python
from __future__ import print_function
import time
import sys
count = 0
start_time = time.time()
dummy = None
for line in sys.stdin:
dummy = line.split()
count += 1
delta_sec = int(time.time() - start_time)
print("Python: Saw {0} lines in {1} seconds. ".format(count, delta_sec), end='')
if delta_sec > 0:
lps = int(count/delta_sec)
print(" Crunch Speed: {0}".format(lps))
else:
print('')
C++ Code:
#include <iostream>
#include <string>
#include <sstream>
#include <time.h>
#include <vector>
using namespace std;
void split1(vector<string> &tokens, const string &str,
const string &delimiters = " ") {
// Skip delimiters at beginning
string::size_type lastPos = str.find_first_not_of(delimiters, 0);
// Find first non-delimiter
string::size_type pos = str.find_first_of(delimiters, lastPos);
while (string::npos != pos || string::npos != lastPos) {
// Found a token, add it to the vector
tokens.push_back(str.substr(lastPos, pos - lastPos));
// Skip delimiters
lastPos = str.find_first_not_of(delimiters, pos);
// Find next non-delimiter
pos = str.find_first_of(delimiters, lastPos);
}
}
void split2(vector<string> &tokens, const string &str, char delim=' ') {
stringstream ss(str); //convert string to stream
string item;
while(getline(ss, item, delim)) {
tokens.push_back(item); //add token to vector
}
}
int main() {
string input_line;
vector<string> spline;
long count = 0;
int sec, lps;
time_t start = time(NULL);
cin.sync_with_stdio(false); //disable synchronous IO
while(cin) {
getline(cin, input_line);
spline.clear(); //empty the vector for the next line to parse
//I'm trying one of the two implementations, per compilation, obviously:
// split1(spline, input_line);
split2(spline, input_line);
count++;
};
count--; //subtract for final over-read
sec = (int) time(NULL) - start;
cerr << "C++ : Saw " << count << " lines in " << sec << " seconds." ;
if (sec > 0) {
lps = count / sec;
cerr << " Crunch speed: " << lps << endl;
} else
cerr << endl;
return 0;
//compiled with: g++ -Wall -O3 -o split1 split_1.cpp
Note that I tried two different split implementations. One (split1) uses string methods to search for tokens and is able to merge multiple tokens as well as handle numerous tokens (it comes from here). The second (split2) uses getline to read the string as a stream, doesn't merge delimiters, and only supports a single delimeter character (that one was posted by several StackOverflow users in answers to string splitting questions).
I ran this multiple times in various orders. My test machine is a Macbook Pro (2011, 8GB, Quad Core), not that it matters much. I'm testing with a 20M line text file with three space-separated columns that each look similar to this: "foo.bar 127.0.0.1 home.foo.bar"
Results:
$ /usr/bin/time cat test_lines_double | ./split.py
15.61 real 0.01 user 0.38 sys
Python: Saw 20000000 lines in 15 seconds. Crunch Speed: 1333333
$ /usr/bin/time cat test_lines_double | ./split1
23.50 real 0.01 user 0.46 sys
C++ : Saw 20000000 lines in 23 seconds. Crunch speed: 869565
$ /usr/bin/time cat test_lines_double | ./split2
44.69 real 0.02 user 0.62 sys
C++ : Saw 20000000 lines in 45 seconds. Crunch speed: 444444
What am I doing wrong? Is there a better way to do string splitting in C++ that does not rely on external libraries (i.e. no boost), supports merging sequences of delimiters (like python's split), is thread safe (so no strtok), and whose performance is at least on par with python?
Edit 1 / Partial Solution?:
I tried making it a more fair comparison by having python reset the dummy list and append to it each time, as C++ does. This still isn't exactly what the C++ code is doing, but it's a bit closer. Basically, the loop is now:
for line in sys.stdin:
dummy = []
dummy += line.split()
count += 1
The performance of python is now about the same as the split1 C++ implementation.
/usr/bin/time cat test_lines_double | ./split5.py
22.61 real 0.01 user 0.40 sys
Python: Saw 20000000 lines in 22 seconds. Crunch Speed: 909090
I still am surprised that, even if Python is so optimized for string processing (as Matt Joiner suggested), that these C++ implementations would not be faster. If anyone has ideas about how to do this in a more optimal way using C++, please share your code. (I think my next step will be trying to implement this in pure C, although I'm not going to trade off programmer productivity to re-implement my overall project in C, so this will just be an experiment for string splitting speed.)
Thanks to all for your help.
Final Edit/Solution:
Please see Alf's accepted answer. Since python deals with strings strictly by reference and STL strings are often copied, performance is better with vanilla python implementations. For comparison, I compiled and ran my data through Alf's code, and here is the performance on the same machine as all the other runs, essentially identical to the naive python implementation (though faster than the python implementation that resets/appends the list, as shown in the above edit):
$ /usr/bin/time cat test_lines_double | ./split6
15.09 real 0.01 user 0.45 sys
C++ : Saw 20000000 lines in 15 seconds. Crunch speed: 1333333
My only small remaining gripe is regarding the amount of code necessary to get C++ to perform in this case.
One of the lessons here from this issue and yesterday's stdin line reading issue (linked above) are that one should always benchmark instead of making naive assumptions about languages' relative "default" performance. I appreciate the education.
Thanks again to all for your suggestions!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(8)
据猜测,Python 字符串是引用计数的不可变字符串,因此在 Python 代码中不会复制任何字符串,而 C++
std::string
是可变值类型,并且在最小的机会时进行复制。如果目标是快速分割,那么可以使用恒定时间子字符串操作,这意味着仅引用原始字符串的部分,就像在Python(以及Java和C#...)中一样。
不过,C++
std::string
类有一个可取之处:它是标准的,因此可以在效率不高的地方安全、可移植地传递字符串。主要考虑。但聊够了。代码——在我的机器上这当然比 Python 快,因为 Python 的字符串处理是用 C 实现的,而 C 是 C++ 的子集(呵呵):免责声明:我希望没有任何错误。我没有测试功能,只是检查了速度。但我认为,即使有一两个错误,纠正它也不会显着影响速度。
As a guess, Python strings are reference counted immutable strings, so that no strings are copied around in the Python code, while C++
std::string
is a mutable value type, and is copied at the smallest opportunity.If the goal is fast splitting, then one would use constant time substring operations, which means only referring to parts of the original string, as in Python (and Java, and C#…).
The C++
std::string
class has one redeeming feature, though: it is standard, so that it can be used to pass strings safely and portably around where efficiency is not a main consideration. But enough chat. Code -- and on my machine this is of course faster than Python, since Python's string handling is implemented in C which is a subset of C++ (he he):Disclaimer: I hope there aren't any bugs. I haven't tested the functionality, but only checked the speed. But I think, even if there is a bug or two, correcting that won't significantly affect the speed.
我没有提供任何更好的解决方案(至少在性能方面),但提供了一些可能有趣的附加数据。
使用
strtok_r
(strtok
的可重入变体):另外使用字符串作为参数,使用
fgets
作为输入:并且,在某些情况下,销毁输入字符串是可以接受的:
这些的时间如下(包括我对问题的其他变体和接受的答案的结果):
正如我们所看到的,接受的答案的解决方案仍然是最快的。
对于任何想要进行进一步测试的人,我还提供了一个 Github 存储库,其中包含问题中的所有程序、接受的答案、此答案,以及另外一个 Makefile 和一个用于生成测试数据的脚本: https://github.com/tobbez/string-splitting。
I'm not providing any better solutions (at least performance-wise), but some additional data that could be interesting.
Using
strtok_r
(reentrant variant ofstrtok
):Additionally using character strings for parameters, and
fgets
for input:And, in some cases, where destroying the input string is acceptable:
The timings for these are as follows (including my results for the other variants from the question and the accepted answer):
As we can see, the solution from the accepted answer is still fastest.
For anyone who would want to do further tests, I also put up a Github repo with all the programs from the question, the accepted answer, this answer, and additionally a Makefile and a script to generate test data: https://github.com/tobbez/string-splitting.
我怀疑这是因为
std::vector
在 Push_back() 函数调用过程中调整大小的方式造成的。如果您尝试使用 std::list 或 std::vector::reserve() 为句子保留足够的空间,您应该会获得更好的性能。或者您可以将两者结合使用,如下所示的 split1():编辑:我看到的另一个明显的事情是 Python 变量
dummy
被分配 > 每次但未修改。所以这与 C++ 的比较并不公平。您应该尝试将 Python 代码修改为dummy = []
来初始化它,然后执行dummy += line.split()
。您可以报告此后的运行时间吗?EDIT2:为了使其更加公平,您可以将 C++ 代码中的 while 循环修改为:
I suspect that this is because of the way
std::vector
gets resized during the process of a push_back() function call. If you try usingstd::list
orstd::vector::reserve()
to reserve enough space for the sentences, you should get a much better performance. Or you could use a combination of both like below for split1():EDIT: The other obvious thing I see is that Python variable
dummy
gets assigned each time but not modified. So it's not a fair comparison against C++. You should try modifying your Python code to bedummy = []
to initialize it and then dodummy += line.split()
. Can you report the runtime after this?EDIT2: To make it even more fair can you modify the while loop in C++ code to be:
我认为下面的代码更好,使用了一些 C++17 和 C++14 功能:
容器的选择:
std::vector
。假设分配的内部数组的初始大小为 1,最终大小为 N,您将分配和释放 log2(N) 次,并且将复制 (2 ^ (log2(N) + 1) - 1) = (2N - 1) 次。正如std::vector 的性能较差是因为没有调用 realloc 对数次数吗?,当大小为向量是不可预测的并且可能非常大。
但是,如果您可以估计它的大小,那么问题就不那么严重了。
std::list
。对于每个push_back,它消耗的时间是一个常数,但它可能比单个push_back上的std::vector花费更多的时间。使用每线程内存池和自定义分配器可以缓解这个问题。
std::forward_list
。与 std::list 相同,但每个元素占用更少的内存。由于缺少 API Push_back,需要包装类才能工作。
std::array
。如果你能知道增长的极限,那么你可以使用std::array。当然,你不能直接使用它,因为它没有push_back API。但是您可以定义一个包装器,我认为这是最快的方法,如果您的估计相当准确,可以节省一些内存。
std::deque
。此选项允许您以内存换取性能。不会有 (2 ^ (N + 1) - 1) 次元素副本,只有 N 次分配,并且不会释放。此外,您还将拥有恒定的随机访问时间,并能够在两端添加新元素。
根据 std::deque-cppreference
,或者您可以使用这些的组合:
std::vector< std::array>
这和std::deque类似,不同的是这个容器不支持在前面添加元素。但它的性能仍然更快,因为它不会复制底层 std::array (2 ^ (N + 1) - 1) 次,它只会复制指针数组 (2 ^ (N - M + 1) - 1) 次,并且仅在当前数组已满且不需要释放任何内容时才分配新数组。顺便说一句,您可以获得恒定的随机访问时间。
<代码>std::list< std::array>
大大缓解记忆框架的压力。它只会在当前数组已满时分配新数组,并且不需要复制任何内容。与组合 1 相比,您仍然需要为额外的指针付出代价。
std::forward_list< std::array>
与 2 相同,但消耗的内存与组合 1 相同。
I think the following code is better, using some C++17 and C++14 features:
The choice of container:
std::vector
.Assuming the initial size of allocated internal array is 1, and the ultimate size is N, you will allocate and deallocate for log2(N) times, and you will copy the (2 ^ (log2(N) + 1) - 1) = (2N - 1) times. As pointed out in Is the poor performance of std::vector due to not calling realloc a logarithmic number of times?, this can have a poor performance when the size of vector is unpredictable and could be very large.
But, if you can estimate the size of it, this'll be less a problem.
std::list
.For every push_back, the time it consumed is a constant, but it'll probably takes more time than std::vector on individual push_back. Using a per-thread memory pool and a custom allocator can ease this problem.
std::forward_list
.Same as std::list, but occupy less memory per element. Require a wrapper class to work due to the lack of API push_back.
std::array
.If you can know the limit of growth, then you can use std::array. Of cause, you can't use it directly, since it doesn't have the API push_back. But you can define a wrapper, and I think it's the fastest way here and can save some memory if your estimation is quite accurate.
std::deque
.This option allows you to trade memory for performance. There'll be no (2 ^ (N + 1) - 1) times copy of element, just N times allocation, and no deallocation. Also, you'll has constant random access time, and the ability to add new elements at both ends.
According to std::deque-cppreference
or you can use combo of these:
std::vector< std::array<T, 2 ^ M> >
This is similar to std::deque, the difference is just this container doesn't support to add element at the front. But it is still faster in performance, due to the fact that it won't copy the underlying std::array for (2 ^ (N + 1) - 1) times, it'll just copy the pointer array for (2 ^ (N - M + 1) - 1) times, and allocating new array only when the current is full and doesn't need to deallocate anything. By the way, you can get constant random access time.
std::list< std::array<T, ...> >
Greatly ease the pressure of memory framentation. It will only allocate new array when the current is full, and does not need to copy anything. You will still have to pay the price for an additional pointer conpared to combo 1.
std::forward_list< std::array<T, ...> >
Same as 2, but cost the same memory as combo 1.
您错误地假设您选择的 C++ 实现一定比 Python 更快。 Python 中的字符串处理经过高度优化。有关更多信息,请参阅此问题:为什么 std::string 操作性能不佳?
You're making the mistaken assumption that your chosen C++ implementation is necessarily faster than Python's. String handling in Python is highly optimized. See this question for more: Why do std::string operations perform poorly?
如果您采用 split1 实现并更改签名以更紧密地匹配 split2 的签名,请更改为: 更改
为:
您会得到 split1 和 split2 之间更显着的差异,以及更公平的比较:
If you take the split1 implementaion and change the signature to more closely match that of split2, by changing this:
to this:
You get a more dramatic difference between split1 and split2, and a fairer comparison:
我怀疑这与 Python 中 sys.stdin 的缓冲有关,但在 C++ 实现中没有缓冲。
有关如何更改缓冲区大小的详细信息,请参阅这篇文章,然后再次尝试比较:
为 sys.stdin 设置较小的缓冲区大小?
I suspect that this is related to buffering on sys.stdin in Python, but no buffering in the C++ implementation.
See this post for details on how to change the buffer size, then try the comparison again:
Setting smaller buffer size for sys.stdin?