是否可以从大型固定宽度CSV文件中有效获取行的子集？

发布于 2025-01-24 02:25:27 字数 766 浏览 3 评论 0原文

我有一个极大的固定宽度CSV文件（130万行和80k列）。大约230 GB。我需要能够获取这些行的一个子集。我有一个我需要的行索引向量。但是，我现在需要弄清楚如何遍历这样的庞大文件以获取它们。

我理解它的方式，C ++将逐行浏览文件，直到击中Newline（或给定的定界符），此时，它将清除缓冲区，然后移至下一步。我还听说了一个可以转到流中给定位置的seek（）函数。因此，是否可以以某种方式使用此功能来快速获取正确的行号指针？

我认为，由于该程序基本上不必运行数十亿个IF语句来检查新线，因此如果我简单地告诉程序在固定宽度文件中去哪里，它可能会提高速度。但是我不知道该怎么做。

假设我的文件具有n字符的宽度，我的行号为{l_1，l_2，l_3，... l_m}（其中l_1＆lt; l_2＆lt; ...＆lt; l_m）。在这种情况下，我可以简单地告诉文件指针到（l_1-1） * n，对吗？但是，对于下一行，我是从l_1行的末端还是从下一行的开头计算下一个跳跃？在计算跳跃时，我应该包括新线吗？

这甚至有助于提高速度，还是我只是在这里误解了一些东西？

感谢您抽出宝贵的时间来

编辑：该文件看起来像这样：

id0000001,AB,AB,AA,--,BB
id0000002,AA,--,AB,--,BB
id0000003,AA,AA,--,--,BB
id0000004,AB,AB,AA,AB,BB

原文

I have an extremely large fixed-width CSV file (1.3 million rows and 80K columns). It's about 230 GB in size. I need to be able to fetch a subset of those rows. I have a vector of row indices that I need. However, I need to now figure out how to traverse such a massive file to get them.

The way I understand it, C++ will go through the file line by line, until it hits the newline (or a given delimiter), at which point, it'll clear the buffer, and then move onto the next line. I have also heard of a seek() function that can go to a given position in a stream. So is it possible to use this function somehow to get the pointer to the correct line number quickly?

I figured that since the program doesn't have to basically run billions of if statements to check for newlines, it might improve the speed if I simply tell the program where to go in the fixed-width file. But I have no idea how to do that.

Let's say that my file has a width of n characters and my line numbers are {l_1, l_2, l_3, ... l_m} (where l_1 < l_2 < l_3, ... < l_m). In that case, I can simply tell the file pointer to go to (l_1 - 1) * n, right? But then for the next line, do I calculate the next jump from the end of the l_1 line or from the beginning of the next line? And should I include the newlines when calculating the jumps?

Will this even help improve speed, or am I just misunderstanding something here?

Thanks for taking the time to help

EDIT: The file will look like this:

id0000001,AB,AB,AA,--,BB
id0000002,AA,--,AB,--,BB
id0000003,AA,AA,--,--,BB
id0000004,AB,AB,AA,AB,BB

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

江城子 2025-01-31 02:25:28

正如我在评论中提出的那样，您可以将数据字段压缩到两个位：

-- 00
AA 01
AB 10
BB 11

将文件大小切成12次，因此〜20GB。考虑到您的处理可能是io-bound的，您可能会加快处理相同的12次加速处理。

结果文件的记录长度为20,000个字节，因此很容易计算出任何给定记录的偏移。没有新的行符号要考虑:)

这是我构建该二进制文件的方式：

#include <fstream>
#include <iostream>
#include <string>
#include <chrono>

int main()
{
    auto t1 = std::chrono::high_resolution_clock::now();
    std::ifstream src("data.txt", std::ios::binary);
    std::ofstream bin("data.bin", std::ios::binary);
    size_t length = 80'000 * 3 + 9 + 2; // the `2` is a length of CR/LF on my Windows; use `1` for other systems
    std::string str(length, '\0');
    while (src.read(&str[0], length))
    {
        size_t pos = str.find(',') + 1;
        for (int group = 0; group < 2500; ++group) {
            uint64_t compressed(0), field(0);
            for (int i = 0; i < 32; ++i, pos += 3) {
                if (str[pos] == '-')
                    field = 0;
                else if (str[pos] == 'B')
                    field = 3;
                else if (str[pos + 1] == 'B')
                    field = 2;
                else
                    field = 1;

                compressed <<= 2;
                compressed |= field;
            }
            bin.write(reinterpret_cast<char*>(&compressed), sizeof compressed);
        }
    }
    auto t2 = std::chrono::high_resolution_clock::now();
    std::cout << std::chrono::duration_cast<std::chrono::microseconds>(t2 - t1).count() << std::endl;

    // clear `bad` bit set by trying to read past EOF
    src.clear();
    // rewind to the first record
    src.seekg(0);
    src.read(&str[0], length);
    // read next (second) record
    src.read(&str[0], length);
    // read forty second record from start (skip 41)
    src.seekg(41 * length, std::ios_base::beg);
    src.read(&str[0], length);
    // read next (forty third) record
    src.read(&str[0], length);
    // read fifties record (skip 6 from current position)
    src.seekg(6 * length, std::ios_base::cur);
    src.read(&str[0], length);

    return 0;
}

这可以在一秒钟内编码约1,600个记录，因此整个文件将花费约15分钟。您现在需要多长时间处理它？

更新：

添加了如何从src读取各个记录的示例。

我仅设法使seekg（）在二进制模式下工作。

As I proposed in the comment, you can compress your data field to two bits:

-- 00
AA 01
AB 10
BB 11

That cuts your file size 12 times, so it'll be ~20GB. Considering that your processing is likely IO-bound, you may speed up processing by the same 12 times.

The resulting file will have a record length of 20,000 bytes, so it will be easy to calculate an offset to any given record. No new line symbols to consider :)

Here is how I build that binary file:

#include <fstream>
#include <iostream>
#include <string>
#include <chrono>

int main()
{
    auto t1 = std::chrono::high_resolution_clock::now();
    std::ifstream src("data.txt", std::ios::binary);
    std::ofstream bin("data.bin", std::ios::binary);
    size_t length = 80'000 * 3 + 9 + 2; // the `2` is a length of CR/LF on my Windows; use `1` for other systems
    std::string str(length, '\0');
    while (src.read(&str[0], length))
    {
        size_t pos = str.find(',') + 1;
        for (int group = 0; group < 2500; ++group) {
            uint64_t compressed(0), field(0);
            for (int i = 0; i < 32; ++i, pos += 3) {
                if (str[pos] == '-')
                    field = 0;
                else if (str[pos] == 'B')
                    field = 3;
                else if (str[pos + 1] == 'B')
                    field = 2;
                else
                    field = 1;

                compressed <<= 2;
                compressed |= field;
            }
            bin.write(reinterpret_cast<char*>(&compressed), sizeof compressed);
        }
    }
    auto t2 = std::chrono::high_resolution_clock::now();
    std::cout << std::chrono::duration_cast<std::chrono::microseconds>(t2 - t1).count() << std::endl;

    // clear `bad` bit set by trying to read past EOF
    src.clear();
    // rewind to the first record
    src.seekg(0);
    src.read(&str[0], length);
    // read next (second) record
    src.read(&str[0], length);
    // read forty second record from start (skip 41)
    src.seekg(41 * length, std::ios_base::beg);
    src.read(&str[0], length);
    // read next (forty third) record
    src.read(&str[0], length);
    // read fifties record (skip 6 from current position)
    src.seekg(6 * length, std::ios_base::cur);
    src.read(&str[0], length);

    return 0;
}

This can encode about 1,600 records in a second, so the whole file will take ~15 minutes. How long does it take you now to process it?

UPDATE:

Added example of how to read individual records from src.

I only managed to make seekg() work in binary mode.

回复收藏 0 原文

左岸枫 2025-01-31 02:25:28

寻求 ＆lt; iostream＆gt;类中的功能系列通常面向字节。您可以使用它们，如果您绝对相信您的记录（在这种情况下为行）具有固定的字节数；在这种情况下，您可以将文件打开为二进制文件，而不是getline，并使用.Read可以将指定的字节读取到足够容量的字节数组中。但是 - 由于文件毕竟是在存储文本 - 如果一个单个记录的大小不同，您也会摆脱对齐；如果保证 id 字段等于线的数字，或者至少是越来越多的映射，则有教育的猜测和后续试验＆amp;错误可能会有所帮助。您需要快速切换到一些更好的数据库管理；即使是10GB的单个二进制文件也太大了，容易发生快速腐败。您可以考虑将其切成较小的切片（可能为100MB的订单），以最大程度地减少损害传播的机会。另外，您必须需要一些冗余机制才能恢复/校正。