如何加速从文件流加载15M整数?

发布于 2024-09-13 04:41:50 字数 609 浏览 5 评论 0 原文

我有一个预先计算的整数数组,它的固定大小为 15M 值。我需要在程序启动时加载这些值。目前加载最多需要 2 分钟,文件大小约为 130MB。有什么办法可以加快加载速度吗?我也可以自由地更改保存过程。

std::array<int, 15000000> keys;

std::string config = "config.dat";

// how array is saved
std::ofstream out(config.c_str());
std::copy(keys.cbegin(), keys.cend(),
  std::ostream_iterator<int>(out, "\n"));

// load of array
std::ifstream in(config.c_str());
std::copy(std::istream_iterator<int>(in),
  std::istream_iterator<int>(), keys.begin());
in_ranks.close();

提前致谢。

已解决。使用已接受答案中提出的方法。现在只需要一眨眼的时间。

感谢大家的见解。

I have an array of precomputed integers, it's fixed size of 15M values. I need to load these values at the program start. Currently it takes up to 2 mins to load, file size is ~130MB. Is it any way to speed-up loading. I'm free to change save process as well.

std::array<int, 15000000> keys;

std::string config = "config.dat";

// how array is saved
std::ofstream out(config.c_str());
std::copy(keys.cbegin(), keys.cend(),
  std::ostream_iterator<int>(out, "\n"));

// load of array
std::ifstream in(config.c_str());
std::copy(std::istream_iterator<int>(in),
  std::istream_iterator<int>(), keys.begin());
in_ranks.close();

Thanks in advance.

SOLVED. Used the approach proposed in accepted answer. Now it takes just a blink.

Thanks all for your insights.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(8

挽袖吟 2024-09-20 04:41:50

您有两个关于写入和读取操作速度的问题。

首先,std::copy 在写入 output_iterator 时无法进行块复制优化,因为它无法直接访问底层目标。

其次,您将整数写为 ascii 而不是二进制,因此对于写入的每次迭代,output_iterator 都会创建 int 的 ascii 表示形式,并且在读取时必须将文本解析回整数。我相信这是您的性能问题的首当其冲。

数组的原始存储(假设 4 字节 int)应该仅为 60MB,但由于 ascii 中整数的每个字符都是 1 字节,因此任何超过 4 个字符的 int 都将大于二进制存储,因此您的 130MB文件。

没有一种简单的方法可以解决可移植的速度问题(以便可以在不同的字节序或 int 大小的机器上读取文件)或使用 std::copy 时的速度问题。最简单的方法是将整个数组转储到磁盘,然后使用 fstream.write 和 read 将其全部读回,只需记住它不是严格可移植的。

要写:

std::fstream out(config.c_str(), ios::out | ios::binary);
out.write( keys.data(), keys.size() * sizeof(int) );

和读:

std::fstream in(config.c_str(), ios::in | ios::binary);
in.read( keys.data(), keys.size() * sizeof(int) );

----更新----

如果您真的关心可移植性,您可以轻松地在分发工件中使用可移植格式(如您的初始 ascii 版本),那么当程序首次运行时,它可以将该可移植格式转换为本地优化版本,以便在后续执行期间使用。

也许是这样的:

std::array<int, 15000000> keys;

// data.txt are the ascii values and data.bin is the binary version
if(!file_exists("data.bin")) {
    std::ifstream in("data.txt");
    std::copy(std::istream_iterator<int>(in),
         std::istream_iterator<int>(), keys.begin());
    in.close();

    std::fstream out("data.bin", ios::out | ios::binary);
    out.write( keys.data(), keys.size() * sizeof(int) );
} else {
    std::fstream in("data.bin", ios::in | ios::binary);
    in.read( keys.data(), keys.size() * sizeof(int) );
}

如果你有一个安装过程,那么这个预处理也可以在那时完成......

You have two issues regarding the speed of your write and read operations.

First, std::copy cannot do a block copy optimization when writing to an output_iterator because it doesn't have direct access to underlying target.

Second, you're writing the integers out as ascii and not binary, so for each iteration of your write output_iterator is creating an ascii representation of your int and on read it has to parse the text back into integers. I believe this is the brunt of your performance issue.

The raw storage of your array (assuming a 4 byte int) should only be 60MB, but since each character of an integer in ascii is 1 byte any ints with more than 4 characters are going to be larger than the binary storage, hence your 130MB file.

There is not an easy way to solve your speed problem portably (so that the file can be read on different endian or int sized machines) or when using std::copy. The easiest way is to just dump the whole of the array to disk and then read it all back using fstream.write and read, just remember that it's not strictly portable.

To write:

std::fstream out(config.c_str(), ios::out | ios::binary);
out.write( keys.data(), keys.size() * sizeof(int) );

And to read:

std::fstream in(config.c_str(), ios::in | ios::binary);
in.read( keys.data(), keys.size() * sizeof(int) );

----Update----

If you are really concerned about portability you could easily use a portable format (like your initial ascii version) in your distribution artifacts then when the program is first run it could convert that portable format to a locally optimized version for use during subsequent executions.

Something like this perhaps:

std::array<int, 15000000> keys;

// data.txt are the ascii values and data.bin is the binary version
if(!file_exists("data.bin")) {
    std::ifstream in("data.txt");
    std::copy(std::istream_iterator<int>(in),
         std::istream_iterator<int>(), keys.begin());
    in.close();

    std::fstream out("data.bin", ios::out | ios::binary);
    out.write( keys.data(), keys.size() * sizeof(int) );
} else {
    std::fstream in("data.bin", ios::in | ios::binary);
    in.read( keys.data(), keys.size() * sizeof(int) );
}

If you have an install process this preprocessing could also be done at that time...

朮生 2024-09-20 04:41:50

注意力。提前现实检查:

从大型文本文件中读取整数是 IO 绑定操作,除非您做的事情完全错误(例如为此使用 C++ 流)。当文件已经缓冲时,在 AMD64@3GHZ 上从文本文件加载 15M 整数只需不到 2 秒(如果必须从足够快的磁盘中获取,则只需要一点点时间)。这是一个快速&肮脏的例程来证明我的观点(这就是为什么我不检查整数格式中所有可能的错误,也不在最后关闭我的文件,因为无论如何我都会退出())。

$ wc nums.txt
 15000000  15000000 156979060 nums.txt

$ head -n 5 nums.txt
730547560
-226810937
607950954
640895092
884005970

$ g++ -O2 read.cc
$ time ./a.out <nums.txt
=>1752547657

real    0m1.781s
user    0m1.651s
sys     0m0.114s

$ cat read.cc 
#include <stdio.h>
#include <stdlib.h>
#include <ctype.h>
#include <vector>

int main()
{
        char c;
        int num=0;
        int pos=1;
        int line=1;
        std::vector<int> res;
        while(c=getchar(),c!=EOF)
        {
                if (c>='0' && c<='9')
                        num=num*10+c-'0';
                else if (c=='-') 
                        pos=0;
                else if (c=='\n')
                {
                        res.push_back(pos?num:-num);
                        num=0;
                        pos=1;
                        line++;
                }
                else
                {
                        printf("I've got a problem with this file at line %d\n",line);
                        exit(1);
                }
        }
        // make sure the optimizer does not throw vector away, also a check.
        unsigned sum=0;
    for (int i=0;i<res.size();i++) 
    {
    sum=sum+(unsigned)res[i];
    }
    printf("=>%d\n",sum); 
}

更新:这是我使用mmap读取文本文件(非二进制)时的结果:

$ g++ -O2 mread.cc
$ time ./a.out nums.txt
=>1752547657

real    0m0.559s
user    0m0.478s
sys     0m0.081s

pastebin上的代码:

我建议什么

1-2 秒是典型台式机加载此数据的实际下限。 2 分钟听起来更像是 60 Mhz 微控制器从廉价 SD 卡读取数据。因此,要么您有未检测到/未提及的硬件状况,要么您的 C++ 流实现在某种程度上被破坏或无法使用。我建议通过运行我的示例代码在您的计算机上建立此任务的下限。

Attention. Reality check ahead:

Reading integers from a large text file is an IO bound operation unless you're doing something completely wrong (like using C++ streams for this). Loading 15M integers from a text file takes less than 2 seconds on an AMD64@3GHZ when the file is already buffered (and only a bit long if had to be fetched from a sufficiently fast disk). Here's a quick & dirty routine to prove my point (that's why I do not check for all possible errors in the format of the integers, nor close my files at the end, because I exit() anyway).

$ wc nums.txt
 15000000  15000000 156979060 nums.txt

$ head -n 5 nums.txt
730547560
-226810937
607950954
640895092
884005970

$ g++ -O2 read.cc
$ time ./a.out <nums.txt
=>1752547657

real    0m1.781s
user    0m1.651s
sys     0m0.114s

$ cat read.cc 
#include <stdio.h>
#include <stdlib.h>
#include <ctype.h>
#include <vector>

int main()
{
        char c;
        int num=0;
        int pos=1;
        int line=1;
        std::vector<int> res;
        while(c=getchar(),c!=EOF)
        {
                if (c>='0' && c<='9')
                        num=num*10+c-'0';
                else if (c=='-') 
                        pos=0;
                else if (c=='\n')
                {
                        res.push_back(pos?num:-num);
                        num=0;
                        pos=1;
                        line++;
                }
                else
                {
                        printf("I've got a problem with this file at line %d\n",line);
                        exit(1);
                }
        }
        // make sure the optimizer does not throw vector away, also a check.
        unsigned sum=0;
    for (int i=0;i<res.size();i++) 
    {
    sum=sum+(unsigned)res[i];
    }
    printf("=>%d\n",sum); 
}

UPDATE: and here's my result when read the text file (not binary) using mmap:

$ g++ -O2 mread.cc
$ time ./a.out nums.txt
=>1752547657

real    0m0.559s
user    0m0.478s
sys     0m0.081s

code's on pastebin:

What do I suggest

1-2 seconds is a realistic lower bound for a typical desktop machine for load this data. 2 minutes sounds more like a 60 Mhz micro controller reading from a cheap SD card. So either you have an undetected/unmentioned hardware condition or your implementation of C++ stream is somehow broken or unusable. I suggest to establish a lower bound for this task on your your machine by running my sample code.

调妓 2024-09-20 04:41:50

如果整数以二进制格式保存并且您不关心 Endian 问题,请尝试立即将整个文件读入内存(fread)并将指针强制转换为 int *

if the integers are saved in binary format and you're not concerned with Endian problems, try reading the entire file into memory at once (fread) and cast the pointer to int *

命比纸薄 2024-09-20 04:41:50

您可以将数组预编译为 .o 文件,除非数据发生更改,否则不需要重新编译。

thedata.hpp:

static const int NUM_ENTRIES = 5;
extern int thedata[NUM_ENTRIES];

thedata.cpp:

#include "thedata.hpp"
int thedata[NUM_ENTRIES] = {
10
,200
,3000
,40000
,500000
};

编译此:

# make thedata.o

那么您的主应用程序将类似于:

#include "thedata.hpp"
using namespace std;
int main() {
  for (int i=0; i<NUM_ENTRIES; i++) {
    cout << thedata[i] << endl;
  }
}

假设数据不经常更改,并且您可以处理数据以创建 thedata.cpp,那么这实际上是即时加载时间。我不知道编译器是否会因为这么大的文字数组而窒息!

You could precompile the array into a .o file, which wouldn't need to be recompiled unless the data changes.

thedata.hpp:

static const int NUM_ENTRIES = 5;
extern int thedata[NUM_ENTRIES];

thedata.cpp:

#include "thedata.hpp"
int thedata[NUM_ENTRIES] = {
10
,200
,3000
,40000
,500000
};

To compile this:

# make thedata.o

Then your main application would look something like:

#include "thedata.hpp"
using namespace std;
int main() {
  for (int i=0; i<NUM_ENTRIES; i++) {
    cout << thedata[i] << endl;
  }
}

Assuming the data doesn't change often, and that you can process the data to create thedata.cpp, then this is effectively instant loadtime. I don't know if the compiler would choke on such a large literal array though!

转瞬即逝 2024-09-20 04:41:50

以二进制格式保存文件。

通过获取指向 int 数组开头的指针并将其转换为 char 指针来写入文件。然后将 15000000*sizeof(int) 字符写入文件。

当您读取文件时,执行相反的操作:将文件作为字符序列读取,获取指向序列开头的指针,并将其转换为 int*

当然,这假设字节序不是问题。

对于实际读取和写入文件,内存映射可能是最明智的方法。

Save the file in a binary format.

Write the file by taking a pointer to the start of your int array and convert it to a char pointer. Then write the 15000000*sizeof(int) chars to the file.

And when you read the file, do the same in reverse: read the file as a sequence of chars, take a pointer to the beginning of the sequence, and convert it to an int*.

of course, this assumes that endianness isn't an issue.

For actually reading and writing the file, memory mapping is probably the most sensible approach.

素染倾城色 2024-09-20 04:41:50

如果数字从未改变,请将文件预处理为 C++ 源代码并将其编译到应用程序中。

如果数字可以更改,因此您必须将它们保存在必须在启动时加载的单独文件中,那么请避免使用 C++ IO 流逐个执行该数字。 C++ IO 流是很好的抽象,但对于快速加载一堆数字这样的简单任务来说,它的抽象太多了。根据我的经验,运行时间的很大一部分花在解析数字上,另一部分花在逐个字符地访问文件上。

(假设您的文件不止一个长行。)使用 std::getline() 逐行读取文件,不使用流而是使用 std::strtol 解析每行中的数字()。这避免了很大一部分开销。您可以通过制作自己的 std::getline() 变体来提高流的速度,以便提前读取输入(使用 istream::read() );标准 std::getline() 也会逐个字符读取输入。

If the numbers never change, preprocess the file into a C++ source and compile it into the application.

If the number can change and thus you have to keep them in separate file that you have to load on startup then avoid doing that number by number using C++ IO streams. C++ IO streams are nice abstraction but there is too much of it for such simple task as loading a bunch of number fast. In my experience, huge part of the run time is spent in parsing the numbers and another in accessing the file char by char.

(Assuming your file is more than single long line.) Read the file line by line using std::getline(), parse numbers out of each line using not streams but std::strtol(). This avoids huge part of the overhead. You can get more speed out of the streams by crafting your own variant of std::getline(), such that reads the input ahead (using istream::read()); standard std::getline() also reads input char by char.

阳光①夏 2024-09-20 04:41:50

使用 1000 个(甚至 15M,您可以随意修改此大小)整数的缓冲区,而不是一个又一个的整数。在我看来,不使用缓冲区显然是问题所在。

Use a buffer of 1000 (or even 15M, you can modify this size as you please) integers, not integer after integer. Not using a buffer is clearly the problem in my opinion.

故事和酒 2024-09-20 04:41:50

如果文件中的数据是二进制的,并且您不必担心字节顺序,并且您所在的系统支持它,请使用 mmap 系统调用。请参阅 IBM 网站上的这篇文章:

高性能网络编程,部分2:加快客户端和服务器的处理速度

另请参阅此帖子:

什么时候应该使用 mmap 进行文件访问?

If the data in the file is binary and you don't have to worry about endianess, and you're on a system that supports it, use the mmap system call. See this article on IBM's website:

High-performance network programming, Part 2: Speed up processing at both the client and server

Also see this SO post:

When should I use mmap for file access?

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文