在 C++ 中有效读取大型电子表格文件

发布于 2024-09-07 01:33:22 字数 738 浏览 4 评论 0原文

我通常使用 csv 解析器 中描述的方法来读取电子表格文件。然而,当读取一个 64MB 的文件(大约有 40 列和 250K 行数据)时,大约需要 4 分钟。在原始方法中,使用一个CSVRow类来逐行读取文件,并使用一个私有向量来存储一行中的所有数据。

需要注意的几点:

  • 我确实保留了足够的向量容量,但没有多大帮助。
  • 我还需要在读取每一行时创建某个类的实例,但即使代码只是读入数据而不创建任何实例,也需要很长时间。
  • 该文件是制表符分隔而不是逗号分隔,但我认为这并不重要。

由于该文件中的某些列不是有用的数据,因此我更改了方法,使用私有字符串成员来存储所有数据,然后找到第 (n-1) 个和第 n 个分隔符的位置以获取有用的数据(当然还有很多有用的专栏)。通过这样做,我避免了一些push_back操作,并将时间缩短到2分钟多一点。然而,这对我来说仍然太长了。

这是我的问题:

  1. 有没有办法阅读这样的内容 电子表格文件更有效吗?

  2. 我可以通过缓冲区读取文件吗 而不是逐行?如果是这样,如何 按缓冲区读取并使用 csvrow 类?

  3. 我还没有尝试过 boost tokenizer,是 那个更有效率?

感谢您的帮助!

I normally use the method described in csv parser to read spreadsheet files. However, when reading a 64MB file which has around 40 columns and 250K rows of data, it takes about 4 minutes. In the original method, a CSVRow class is used to read the file row by row, and a private vector is used to store all the data in a row.

Several things to note:

  • I did reserve enough capacity of the vector but not much helpful.
  • I also need to create instances of some class when reading each line, but even when the code just read in the data without creating any instances, it takes long time.
  • The file is tab-delimited instead of comma-delimited, but I don't think it matters.

Since some columns in that file are not useful data, I changed the method to have a private string member to store all the data and then find the position of the (n-1)th and the nth delimiter to get the useful data (of course there are many useful columns). By doing so, I avoid some push_back operations, and cut the time to a little more than 2 minutes. However, that still seems too long to me.

Here are my questions:

  1. Is there a way to read such a
    spreadsheet file more efficiently?

  2. Shall I read the file by buffer
    instead of line by line? If so, how
    to read by buffer and use the csvrow
    class?

  3. I haven't tried boost tokenizer, is
    that more efficient?

Thank you for your help!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

半葬歌 2024-09-14 01:33:22

看来你被IO瓶颈了。不要逐行读取文件,而是以大约 8 MB 的块读取它。解析读取的记录块并确定块的末尾是否是部分记录。如果是,则从块中复制最后一条记录的部分并将其添加到下一个块中。重复此操作,直到文件全部读取完毕。这样,对于 64 MB 的文件,您只需发出 8 个 IO 请求。您可以试验块大小,以确定什么可以提供最佳性能与内存使用率。

It looks like your being bottle-necked by IO. Instead of reading the file line by line, read it in blocks of maybe 8 MB. Parse the block read for records and determine if end of the block is a partial record. If it is, copy the portion of the last record from the block and prepend it to the next block. Repeat until the file is all read. This way, for a 64 MB file you're only making 8 IO requests. You can experiment with block size to determine what gives the best performance vs memory usage.

み青杉依旧 2024-09-14 01:33:22

如果将整个数据读入内存是可接受的(显然是这样),那么我会这样做:

  1. 将整个文件读入 std::vector
  2. 填充一个向量 >其中包含所有换行符和数据单元格的起始位置。这些位置表示每个单元格的开始/结束

一些代码草图演示了这一想法:

vector<vector<vector<char>::size_Type> > rows;
for ( vector<char>::size_type i = 0; i < data.size(); ++i ) {
    vector<vector<char>::size_type> currentRow;
    currentRow.push_back( i );
    while ( data[i] != '\n' ) {
        if ( data[i] == ',' ) { // XXX consider comma at end of line
            currentRow.push_back( i );
        }
    }
    rows.push_back( currentRow );  
}
// XXX consider files which don't end in a newline

因此,您知道所有换行符和所有逗号的位置,并且您可以将完整的 CSV 日期用作一个连续的内存块。因此,您可以轻松提取单元格文本,如下所示:

// XXX error checking omitted for simplicity
string getCellText( int row, int col )
{
     // XXX Needs handling for last cell of a line
     const vector<char>::size_type start = rows[row][col];
     const vector<char>::size_type end = rows[row][col + 1]; 
     return string(data[start], data[end]);
}

If reading the whole data into memory acceptable (and apparently it is), then I'd do this:

  1. Read the whole file into a std::vector
  2. Populate a vector > which contains the start positions of all newline characters and cells the data. These positions denote the start/end of each cell

Some code sketch to demonstrate the idea:

vector<vector<vector<char>::size_Type> > rows;
for ( vector<char>::size_type i = 0; i < data.size(); ++i ) {
    vector<vector<char>::size_type> currentRow;
    currentRow.push_back( i );
    while ( data[i] != '\n' ) {
        if ( data[i] == ',' ) { // XXX consider comma at end of line
            currentRow.push_back( i );
        }
    }
    rows.push_back( currentRow );  
}
// XXX consider files which don't end in a newline

Thus, you know the positions of all newlines and all commas, and you have the complete CSV date available as one contiguous memory block. So you can easily extract a cell text like this:

// XXX error checking omitted for simplicity
string getCellText( int row, int col )
{
     // XXX Needs handling for last cell of a line
     const vector<char>::size_type start = rows[row][col];
     const vector<char>::size_type end = rows[row][col + 1]; 
     return string(data[start], data[end]);
}
眼前雾蒙蒙 2024-09-14 01:33:22

这篇文章应该会有所帮助。

简而言之:
1. 使用内存映射文件或读取 4kbyte 块中的文件来访问数据。内存映射文件会更快。
2. 尽量避免在解析循环中使用push_back、std::string 操作(如+)和stl 中的类似例程。它们都很好,但是它们都使用动态分配的内存,并且动态内存分配很慢。任何频繁动态分配的内容都会使您的程序变慢。尝试在解析之前预分配所有缓冲区。计算所有令牌以便为它们预分配内存应该不难。
3. 使用探查器确定导致速度变慢的原因。

4. 您可能要尽量避免使用 iostream 的 <<和>>运算符,并自己解析文件。

一般来说,高效的 C/C++ 解析器实现应该能够在 3 秒内解析 20 MB 的大文本文件。

This article should be helpful.

In short:
1. Either use memory mapped files OR read file in 4kbyte blocks to access the data. Memory-mapped files will be faster.
2. Try to avoid using push_back, std::string operations (like +) and similar routines from stl within parsing loop. They are nice, but they ALL use dynamically allocated memory, and dynamic memory allocation is slow. Anything that is being frequently dynamically allocated, will make your program slower. Try to preallocate all buffers before parsing. Counting all tokens in order to preallocate memory for them shouldn't be difficult.
3. Use profiler to identify what causes the slowdown.
4. You may want to try to avoid using iostream's << and >> operators, and parse file yourself.

In general, efficient C/C++ parser implementation should be able to parse 20 megabytes big text file within 3 seconds.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文