水桶排序/计数的大文件

发布于 2025-01-18 04:30:18 字数 781 浏览 3 评论 0原文

我有一个非常大的文件,我想以粗略的方式对其进行排序(10 TB)。基本上,我对该文件中的一个字段进行哈希处理,获取该哈希值的最后 4 位数字并将其附加为一列。这为我提供了与每行关联的 4 位 base16 数字,这意味着每行可以放入 65536 个桶之一。然后我想在 65536 个文件之间分发该文件,每个文件代表这些存储桶之一。

我不认为 GNU 排序足够聪明来加速这个操作——我不能指定只有 65536 个可能的键,所以我的假设是它会像任何其他排序操作一样处理这个问题。

我当前的策略是打开 65536 个文件句柄并逐行浏览文件,将每一行写入相应的文件。这打破了单个用户的 ulimit 界限,我知道可以修改它,但我不确定这是否是一个好的策略。以前有人做过这样的事情吗?

现在我有一个看起来像这样的Python脚本:

bucketfilemap = { ... } # 65536 open files
s = time.time()
with open(infile, 'rb') as inf:
    for line in inf:
        tokens = line.split(delim)
        bucketkey = tokens[keyloc]
        bucketfilemap[bucketkey].write(line)
e = time.time()
print("time total:", (e - s))

在我对较小文件的测试中,它比我想要的要慢,尽管它确实随着文件的大小线性缩放,这正是我想要的。

I have a very large file I want to sort through (10s of TB) in a coarse manner. Basically, I hash one of the fields in this file, take the last 4 digits of that hash and append it as a column. This gives me a 4 digit base16 number associated with each line, which means each line can fit into one of 65536 buckets. I want to then distribute this file between 65536 files, each representing one of these buckets.

I don't think GNU sort is smart enough to speed this operation up- I can't specify that there are only 65536 possible keys, so my assumption is that it will approach this like it would any other sort operation.

My current strategy is to open 65536 file handles and go through the file line-by line, writing each line to the appropriate file. This breaks ulimit bounds for a single user, which I know can be modified, but I'm unsure if this is a good strategy. Has anyone done something like this before?

Right now I have a python script that looks like this:

bucketfilemap = { ... } # 65536 open files
s = time.time()
with open(infile, 'rb') as inf:
    for line in inf:
        tokens = line.split(delim)
        bucketkey = tokens[keyloc]
        bucketfilemap[bucketkey].write(line)
e = time.time()
print("time total:", (e - s))

In my testing on smaller files, it has been slower than I'd like, though it does scale linearly with the size of the file, which is what I would like.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

栖竹 2025-01-25 04:30:18

我设计了一个 C 程序来代替 python 脚本来执行此操作。有了一个奇怪的发现,但最终速度要快得多。

#include <stdio.h>
#include <string.h>
#include <time.h>
#include <fcntl.h>

#define N 65536
#define N2 4

// custom hex to integer
int hextoint(char* str) {

        int t = 0;
        for (int i = 0; i < N2; i++) {
                char c = str[i];
                if (c < 58) {
                        t |= ((int)(c-48))<<(i<<2); // each hex digits represents 4 bits, 16=2**4
                } else {
                        t |= ((int)(c-87))<<(i<<2);
                }
        }
        return t;

}

int bucketsort(char* infilename, char** outfilenames) {

        FILE* buckets[N];
        FILE* infile;
        char lbuf[512];
        int hashidx;
        int cp;
        int len;

        for (int i = 0; i < N; i++) {
                buckets[i] = fopen(outfilenames[i], "wb");
        }

        infile = fopen(infilename, "rb");

        int j = 0;
        while (1) {
                cp = fgets(lbuf, sizeof(lbuf), infile);
                if (cp == NULL) {
                        break;
                }
                len = strlen(lbuf);
                hashidx = hextoint(lbuf); // file should be formatted such that the hex key always comes at the beginning of the line
                fwrite(lbuf, 1, len, buckets[hashidx]);
                j++;
        }

        for (int i = 0; i < N; i++) {
                int r = fclose(buckets[i]);
        }


        return 0;
}

void main(int argc, char** argv) {

        char* outfilesfilename = argv[2];
        char* infilename = argv[1];
        FILE* outfilefd = fopen(outfilesfilename, "r");
        char lbuf[512];
        char** outfilenames = malloc(sizeof(char*)*N);
        for (int i = 0; i < N; i++) {
                outfilenames[i] = malloc(sizeof(char)*256);
        }

        int i = 0;
        while (fgets(lbuf, sizeof(lbuf), outfilefd)) {
                int len = strlen(lbuf);
                memcpy(outfilenames[i], lbuf, len-1);
                i++;
        }

        bucketsort(infilename, outfilenames);

}

我的奇怪发现是,fclose 在 C 中可能非常慢,并且该速度似乎取决于打开的文件描述符的数量。打开文件很快,写入文件也很快。但是,当我打开 65536 个文件时,执行 65536 个 fclose 需要 30-50 秒。当我将 N 更改为 256(并将 N2 更改为 2)时,只需要十分之一秒。

640MB file
N = 256
1.970000 seconds elapsed on write
0.110000 seconds elapsed on close

N = 65536
4.550000 seconds elapsed on write
36.869999 seconds elapsed on close

无论我写入 500KB 还是 640MB,关闭文件总是需要 30-50 秒,而 python 大约需要 0.24 秒才能将 500KB 写入 65536 个文件。这仍然更好,因为它似乎是一种恒定的成本,而不是随着文件大小而缩放的成本。例如,对于 500KB 文件:

500KB file
0.440000 seconds elapsed on write
45.889999 seconds elapsed on close

I designed a C program to do this in place of the python script. Made an odd discovery, but ultimately it is much faster.

#include <stdio.h>
#include <string.h>
#include <time.h>
#include <fcntl.h>

#define N 65536
#define N2 4

// custom hex to integer
int hextoint(char* str) {

        int t = 0;
        for (int i = 0; i < N2; i++) {
                char c = str[i];
                if (c < 58) {
                        t |= ((int)(c-48))<<(i<<2); // each hex digits represents 4 bits, 16=2**4
                } else {
                        t |= ((int)(c-87))<<(i<<2);
                }
        }
        return t;

}

int bucketsort(char* infilename, char** outfilenames) {

        FILE* buckets[N];
        FILE* infile;
        char lbuf[512];
        int hashidx;
        int cp;
        int len;

        for (int i = 0; i < N; i++) {
                buckets[i] = fopen(outfilenames[i], "wb");
        }

        infile = fopen(infilename, "rb");

        int j = 0;
        while (1) {
                cp = fgets(lbuf, sizeof(lbuf), infile);
                if (cp == NULL) {
                        break;
                }
                len = strlen(lbuf);
                hashidx = hextoint(lbuf); // file should be formatted such that the hex key always comes at the beginning of the line
                fwrite(lbuf, 1, len, buckets[hashidx]);
                j++;
        }

        for (int i = 0; i < N; i++) {
                int r = fclose(buckets[i]);
        }


        return 0;
}

void main(int argc, char** argv) {

        char* outfilesfilename = argv[2];
        char* infilename = argv[1];
        FILE* outfilefd = fopen(outfilesfilename, "r");
        char lbuf[512];
        char** outfilenames = malloc(sizeof(char*)*N);
        for (int i = 0; i < N; i++) {
                outfilenames[i] = malloc(sizeof(char)*256);
        }

        int i = 0;
        while (fgets(lbuf, sizeof(lbuf), outfilefd)) {
                int len = strlen(lbuf);
                memcpy(outfilenames[i], lbuf, len-1);
                i++;
        }

        bucketsort(infilename, outfilenames);

}

The odd discovery I made is that fclose can be very slow in C, and that speed seems to depend on the number of file descriptors open. Opening the files is quick, and writing to them also quick. However, when I have 65536 files open, performing 65536 fcloses takes 30-50 seconds. When I change N to 256 (and N2 to 2) it takes barely a tenth of a second.

640MB file
N = 256
1.970000 seconds elapsed on write
0.110000 seconds elapsed on close

N = 65536
4.550000 seconds elapsed on write
36.869999 seconds elapsed on close

It does not matter if I'm writing 500KB or 640MB, it always takes 30-50 seconds to close the files, whereas it would take python ~0.24 seconds to write out 500KB to 65536 files. This is still better, as it seems to be sort of a constant cost rather than one that scales with the size of the file. For example, with the 500KB file:

500KB file
0.440000 seconds elapsed on write
45.889999 seconds elapsed on close
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文