处理非常大的 csv 文件,没有超时和内存错误
目前我正在为一个非常大的 CSV 文件编写导入脚本。问题是大多数时候它会在一段时间后由于超时而停止或引发内存错误。
我的想法现在是在“100 行”步骤中解析 CSV 文件,并在 100 行后自动调用脚本。我尝试使用 header (location ...) 来实现此目的,并使用 get 传递当前行,但它没有按照我想要的方式工作。
有没有更好的方法,或者有人知道如何消除内存错误和超时?
At the moment I'm writing an import script for a very big CSV file. The Problem is most times it stops after a while because of an timeout or it throws an memory error.
My Idea was now to parse the CSV file in "100 lines" steps and after 100 lines recall the script automatically. I tried to achieve this with header (location ...) and pass the current line with get but it didn't work out as I want to.
Is there a better way to this or does someone have an idea how to get rid of the memory error and the timeout?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
我使用
fgetcsv
来阅读一个 120MB 的 csv 以流式方式(这是正确的英语吗?)。它逐行读取,然后我将每一行插入数据库。这样每次迭代时内存中只保存一行。剧本还需要20分钟。运行。也许我下次会尝试使用Python…不要尝试将巨大的csv文件加载到数组中,这确实会消耗大量内存。I've used
fgetcsv
to read a 120MB csv in a stream-wise-manner (is that correct english?). That reads in line by line and then I've inserted every line into a database. That way only one line is hold in memory on each iteration. The script still needed 20 min. to run. Maybe I try Python next time… Don't try to load a huge csv-file into an array, that really would consume a lot of memory.我发现上传文件并使用 mysql 的 LOAD DATA LOCAL 查询插入是一个快速的解决方案,例如:
I find uploading the file and inserting using mysql's LOAD DATA LOCAL query a fast solution eg:
如果您不关心它需要多长时间以及需要多少内存,您可以简单地增加此脚本的值。只需将以下行添加到脚本顶部:
使用函数 memory_get_usage() 您可以找出脚本需要多少内存才能找到合适的 memory_limit 值。
您可能还想看看 fgets() 它允许您读取文件行按行。我不确定这是否需要更少的内存,但我真的认为这会起作用。但即使在这种情况下,您也必须将 max_execution_time 增加到更高的值。
If you don't care about how long it takes and how much memory it needs, you can simply increase the values for this script. Just add the following lines to the top of your script:
With the function memory_get_usage() you can find out how much memory your script needs to find a good value for the memory_limit.
You might also want to have a look at fgets() which allows you to read a file line by line. I am not sure if that takes less memory, but I really think this will work. But even in this case you have to increase the max_execution_time to a higher value.
在内存消耗方面,fgetcsv() 和 fgets() 之间似乎存在巨大差异。
一个只有一列的简单 CSV 通过了 fgetcsv() 的 50000 条记录的 512M 内存限制,并花了 8 分钟来报告这一点。
使用 fgets() 只需 3 分钟即可成功处理 649175 条记录,并且我的本地服务器甚至没有喘气。
因此,如果 csv 中的列数有限,我的建议是使用 fgets()。就我而言,fgets() 直接返回第 1 列内的字符串。
对于多于一列,您可以在一次性数组中使用explode(),在每次记录操作后取消set()。
竖起大拇指回答 3 @ndkauboy
There seems to be an enormous difference between fgetcsv() and fgets() when it comes to memory consumption.
A simple CSV with only one column passed my 512M memory limit for just 50000 records with fgetcsv() and took 8 minutes to report that.
With fgets() it took only 3 minutes to successfully process 649175 records, and my local server wasn't even gasping for additional air..
So my advice is to use fgets() if the number of columns in your csv is limited. In my case fgets() returned directly the string inside column 1.
For more then one column, you might use explode() in a disposable array which you unset() after each record operation.
Thumbed up answer 3 @ndkauboy
哦。只需将此脚本称为 CLI,而不是通过愚蠢的 Web 界面。因此,执行时间限制不会影响它。
并且不要永远保留解析的结果,而是立即将它们写下来 - 因此,您也不会受到内存限制的影响。
Oh. Just make this script called as CLI, not via silly web interface. So, no execution time limit will affect it.
And do not keep parsed results forever but write them down immediately - so, you won't be affected by memory limit either.