比较多个非常大的 csv 文件

发布于 2024-12-07 12:17:00 字数 930 浏览 1 评论 0原文

我有 n 个 csv 文件，我需要将它们相互比较并随后修改它们。问题是每个 csv 文件大约有 800.000 行。

为了读取 csv 文件，我使用 fgetcsv 并且效果很好。获得一些内存，但最终它足够快。但如果我尝试将数组相互比较，则需要很长时间。

另一个问题是，由于文件数量有 n 个，我必须使用 foreach 来通过 fgetcsv 获取 csv 数据。我最终得到一个超大数组，无法将其与 array_diff 进行比较。所以我需要将它与嵌套的 foreach 循环进行比较，这需要很长时间。

为了更好地理解代码片段：

foreach( $files as $value ) {
    $data[] = $csv->read( $value['path'] );
}

我的 csv 类使用 fgetcsv 将输出添加到数组：

fgetcsv( $this->_fh, $this->_lengthToRead, $this->_delimiter, $this->_enclosure )

所有 csv 文件的每个数据都存储在 $data 数组中。这可能是仅使用一个数组的第一个大错误，但我不知道如何在不使用 foreach 的情况下保持文件的灵活性。我尝试使用灵活的变量名称，但我也卡在那里:)

现在我有了这个大数组。通常，如果我尝试将这些值相互比较并查明文件一中的数据是否存在于文件二中，依此类推，我会使用 array_diff 或 array_intersect。但在这种情况下我只有这一个大数组。正如我所说，运行 foreach 需要很长时间。

此外，仅在 3 个文件之后，我就有了一个包含 3 * 800.000 个条目的数组。我猜最近 10 个文件之后我的记忆力就会爆炸。

那么有没有更好的方法使用PHP来比较n个非常大的csv文件呢？

原文

I have n csv files which I need to compare against each other and modify them afterwards.
The Problem is that each csv file has around 800.000 lines.

To read the csv file I use fgetcsv and it works good. Get some memory pikes but in the end it is fast enough. But if I try to compare the array against each other it takes ages.

One other Problem is that I have to use a foreach to get the csv data with fgetcsv because of the n amount of files. I end up with one ultra big array and can't compare it with array_diff. So i need to compare it with nested foreach loops and that take ages.

a code snippet for better understanding:

foreach( $files as $value ) {
    $data[] = $csv->read( $value['path'] );
}

my csv class use fgetcsv to add the output to the array:

fgetcsv( $this->_fh, $this->_lengthToRead, $this->_delimiter, $this->_enclosure )

Every data of all the csv files are stored in the $data array. This is probably the first big mistake to use only one array, but I have no clue how to stay flexible with the files without to use an foreach. I tried to use flexible variable names but I stucked there as well :)

Now I have this big array. Normally if I try to compare the values against each other and to find out if the data from file one exists in file two and so on, I use array_diff or array_intersect. But in this case I have only this one big array. And as I said, to run an foreach over it takes ages.

Also after only 3 files I have an array with 3 * 800.000 entries. I guess latest after 10 files my memory will explode.

So is there any better way to use PHP to compare n amount of very large csv files?

分享到QQ

分享到微博