c++文本文件读取性能
我正在尝试将 ac# 程序迁移到 c++。 C#程序逐行读取1~5GB大小的文本文件,并对每一行进行一些分析。 C# 代码如下所示。
using (var f = File.OpenRead(fname))
using (var reader = new StreamReader(f))
while (!reader.EndOfStream) {
var line = reader.ReadLine();
// do some analysis
}
对于给定的包含 700 万行的 1.6 GB 文件,此代码大约需要 18 秒。
我首先编写的用于迁移的 C++ 代码如下所示
ifstream f(fname);
string line;
while (getline(f, line)) {
// do some analysis
}
上面的 C++ 代码大约需要 420 秒。我写的第二个C++代码如下。
ifstream f(fname);
char line[2000];
while (f.getline(line, 2000)) {
// do some analysis
}
上面的c++大约需要85秒。
我尝试的最后一个代码是c代码,如下所示。
FILE *file = fopen ( fname, "r" );
char line[2000];
while (fgets(line, 2000, file) != NULL ) {
// do some analysis
}
fclose ( file );
上面的c代码大约需要33秒。
最后两段代码都将行解析为 char[] 而不是字符串,需要大约 30 秒才能将 char[] 转换为字符串。
有没有办法提高 c/c++ 代码的性能以逐行读取文本文件以匹配 c# 性能? (补充:我使用的是 Windows 7 64 位操作系统和 VC++ 10.0,x64)
I'm trying to migrate a c# program to c++.
The c# program reads a 1~5 gb sized text file line by line and does some analysis on each line.
The c# code is like below.
using (var f = File.OpenRead(fname))
using (var reader = new StreamReader(f))
while (!reader.EndOfStream) {
var line = reader.ReadLine();
// do some analysis
}
For a given 1.6 gb file with 7 million lines, this code takes about 18 seconds.
The c++ code I wrote first to migrate is like below
ifstream f(fname);
string line;
while (getline(f, line)) {
// do some analysis
}
The c++ code above takes about 420 seconds. The second c++ code I wrote is like below.
ifstream f(fname);
char line[2000];
while (f.getline(line, 2000)) {
// do some analysis
}
The c++ above takes about 85 seconds.
The last code I tried is c code, like below.
FILE *file = fopen ( fname, "r" );
char line[2000];
while (fgets(line, 2000, file) != NULL ) {
// do some analysis
}
fclose ( file );
The c code above takes about 33 seconds.
Both of the last 2 codes, which parse the lines into char[] instead of string, need about 30 seconds more to convert char[] to string.
Is there a way to improve the performance of c/c++ code to read a text file line by line to match the c# performance?
(Added : I'm using windows 7 64 bit OS with VC++ 10.0, x64)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
提高文件读取性能的最佳方法之一是使用内存映射文件(Unix 上的
mmap()
,Windows 上的CreateFileMapping()
等)。然后,您的文件将在内存中显示为一整块字节,并且读取它的速度比执行缓冲 I/O 快得多。对于大于 1 GB 左右的文件,您需要使用 64 位操作系统(带有 64 位进程)。我已经用 Python 处理了一个 30 GB 的文件,并取得了很好的结果。
One of the best ways to increase file reading performance is to use memory mapped files (
mmap()
on Unix,CreateFileMapping()
etc on Windows). Then your file appears in memory as one flat chunk of bytes, and you can read it much faster than doing buffered I/O.For a file larger than a gigabyte or so, you will want to be using a 64-bit OS (with a 64-bit process). I've done this to process a 30 GB file in Python with excellent results.
我建议两件事:
使用
f.rdbuf()->pubsetbuf(...)
设置更大的读取缓冲区。我注意到当使用更大的缓冲区大小时,fstream 性能有一些真正显着的提高。使用
read(...)
代替getline(...)
来读取较大的数据块并手动解析它们。I suggest two things:
Use
f.rdbuf()->pubsetbuf(...)
to set a bigger read buffer. I've noticed some really significant increases in fstream performance when using larger buffer sizes.Instead of
getline(...)
useread(...)
to read larger blocks of data and parse them manually.进行优化编译。 C++ 有相当多的理论开销,优化器将消除这些开销。例如,许多简单的字符串方法将被内联。这可能就是您的
char[2000]
版本更快的原因。Compile with optimizations. C++ has quite some theoretical overhead that the optimizer will remove. E.g. many simple string methods will be inlined. That's probably why your
char[2000]
version is faster.