比较多个非常大的 csv 文件
我有 n 个 csv 文件,我需要将它们相互比较并随后修改它们。 问题是每个 csv 文件大约有 800.000 行。
为了读取 csv 文件,我使用 fgetcsv 并且效果很好。获得一些内存,但最终它足够快。但如果我尝试将数组相互比较,则需要很长时间。
另一个问题是,由于文件数量有 n 个,我必须使用 foreach 来通过 fgetcsv 获取 csv 数据。我最终得到一个超大数组,无法将其与 array_diff 进行比较。所以我需要将它与嵌套的 foreach 循环进行比较,这需要很长时间。
为了更好地理解代码片段:
foreach( $files as $value ) {
$data[] = $csv->read( $value['path'] );
}
我的 csv 类使用 fgetcsv 将输出添加到数组:
fgetcsv( $this->_fh, $this->_lengthToRead, $this->_delimiter, $this->_enclosure )
所有 csv 文件的每个数据都存储在 $data 数组中。这可能是仅使用一个数组的第一个大错误,但我不知道如何在不使用 foreach 的情况下保持文件的灵活性。我尝试使用灵活的变量名称,但我也卡在那里:)
现在我有了这个大数组。通常,如果我尝试将这些值相互比较并查明文件一中的数据是否存在于文件二中,依此类推,我会使用 array_diff 或 array_intersect。但在这种情况下我只有这一个大数组。正如我所说,运行 foreach 需要很长时间。
此外,仅在 3 个文件之后,我就有了一个包含 3 * 800.000 个条目的数组。我猜最近 10 个文件之后我的记忆力就会爆炸。
那么有没有更好的方法使用PHP来比较n个非常大的csv文件呢?
I have n csv files which I need to compare against each other and modify them afterwards.
The Problem is that each csv file has around 800.000 lines.
To read the csv file I use fgetcsv and it works good. Get some memory pikes but in the end it is fast enough. But if I try to compare the array against each other it takes ages.
One other Problem is that I have to use a foreach to get the csv data with fgetcsv because of the n amount of files. I end up with one ultra big array and can't compare it with array_diff. So i need to compare it with nested foreach loops and that take ages.
a code snippet for better understanding:
foreach( $files as $value ) {
$data[] = $csv->read( $value['path'] );
}
my csv class use fgetcsv to add the output to the array:
fgetcsv( $this->_fh, $this->_lengthToRead, $this->_delimiter, $this->_enclosure )
Every data of all the csv files are stored in the $data array. This is probably the first big mistake to use only one array, but I have no clue how to stay flexible with the files without to use an foreach. I tried to use flexible variable names but I stucked there as well :)
Now I have this big array. Normally if I try to compare the values against each other and to find out if the data from file one exists in file two and so on, I use array_diff or array_intersect. But in this case I have only this one big array. And as I said, to run an foreach over it takes ages.
Also after only 3 files I have an array with 3 * 800.000 entries. I guess latest after 10 files my memory will explode.
So is there any better way to use PHP to compare n amount of very large csv files?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
使用 SQL
您没有描述如何比较 n 个文件,并且有多种方法可以做到这一点。如果您只想查找 A1 中但不在 A2,...,An 中的行,那么您只需在表中添加一个布尔列 diff 即可。如果您想知道某行在哪些文件中重复,则需要一个文本列,或者如果一行可以位于多个文件中,则需要一个新表。
编辑:如果您使用 MySQL,请谈谈性能(我现在不太了解其他 RDBMS)。
一行一行地插入会太慢。除非您可以将 CSV 文件直接放入数据库服务器的文件系统,否则您可能无法使用
LOAD DATA
。因此,我认为最好的解决方案是读取 CSV 中的数百行,然后发送多个插入查询INSERT INTO mytable VALUES (..1..), (..2..)
。您无法为在其他文件中读取的每一行发出
SELECT
,因此您最好将它们放在另一个表中。然后发出多表更新以标记表 t1 和 t2 中相同的行: UPDATE t1 JOIN t2 ON (t1.a = t2.a AND t1.b = t2.b) SET t1.diff =1也许你可以尝试使用sqlite。这里没有并发问题,而且它可能比 MySQL 的客户端/服务器模型更快。而且您不需要进行太多设置即可使用 sqlite。
Use SQL
You did not describe how you compare n files, and there are several ways to do so. If you just want to find the line that are in A1 but not in A2,...,An, then you'll just have to add a boolean column diff in your table. If you want to know in which files a line is repeated, you'll need a text column, or a new table if a line can be in several files.
Edit: a few words on performance if you're using MySQL (I do not now much about other RDBMS).
Inserting lines one by one would be too slow. You probably can't use
LOAD DATA
unless you can put the CSV files directly onto the DB server's filesystem. So I guess the best solution is to read a few hundreds of lines in the CSV then send a multiple insert queryINSERT INTO mytable VALUES (..1..), (..2..)
.You can't issue a
SELECT
for each line you read in your other files, so you'd better put them in another table. Then issue a multiple-table update to mark the rows that are identical in the tables t1 and t2:UPDATE t1 JOIN t2 ON (t1.a = t2.a AND t1.b = t2.b) SET t1.diff=1
Maybe you could try using sqlite. No concurrency problems here, and it could be faster than the client/server model of MySQL. And you don't need to setup much to use sqlite.