Node.js 在读取大型逐位文件时内存不足
我正在尝试编写一些 JS 来读取文件并将其写入流。问题是该文件非常大,所以我必须一点一点地阅读它。看来我不应该耗尽内存,但我确实这么做了。代码如下:
var size = fs.statSync("tmpfile.tmp").size;
var fp = fs.openSync("tmpfile.tmp", "r");
for(var pos = 0; pos < size; pos += 50000){
var buf = new Buffer(50000),
len = fs.readSync(fp, buf, 0, 50000, (function(){
console.log(pos);
return pos;
})());
data_output.write(buf.toString("utf8", 0, len));
delete buf;
}
data_output.end();
由于某种原因,它达到了 264900000,然后抛出致命错误:CALL_AND_RETRY_2 分配失败 - 进程内存不足
。我认为 data_output.write()
调用会强制它将数据写入 data_output
,然后从内存中丢弃它,但我可能是错的。有些东西导致数据保留在内存中,但我不知道它是什么。任何帮助将不胜感激。
I'm attempting to write a bit of JS that will read a file and write it out to a stream. The deal is that the file is extremely large, and so I have to read it bit by bit. It seems that I shouldn't be running out of memory, but I do. Here's the code:
var size = fs.statSync("tmpfile.tmp").size;
var fp = fs.openSync("tmpfile.tmp", "r");
for(var pos = 0; pos < size; pos += 50000){
var buf = new Buffer(50000),
len = fs.readSync(fp, buf, 0, 50000, (function(){
console.log(pos);
return pos;
})());
data_output.write(buf.toString("utf8", 0, len));
delete buf;
}
data_output.end();
For some reason it hits 264900000 and then throws FATAL ERROR: CALL_AND_RETRY_2 Allocation failed - process out of memory
. I'd figure that the data_output.write()
call would force it to write the data out to data_output
, and then discard it from memory, but I could be wrong. Something is causing the data to stay in memory, and I've no idea what it would be. Any help would be greatly appreciated.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
我有一个非常相似的问题。我正在读取一个包含 10M 行的非常大的 csv 文件,并写出它的 json 等效项。我在 Windows 任务管理器中看到我的进程正在使用 > 2GB内存。最终我发现输出流可能比输入流慢,并且输出流正在缓冲大量数据。我能够通过每 100 次写入输出流暂停输入流并等待输出流清空来解决此问题。这为外流赶上内流提供了时间。我认为这对于本次讨论来说并不重要,但我使用“readline”一次一行地处理 csv 文件。
我还发现,如果不是将每一行都写入输出,而是将 100 行左右连接在一起,然后将它们写在一起,这也可以改善内存情况并实现更快的操作。
最后,我发现只需要70M内存就可以完成文件传输(csv -> json)。
这是我的写入功能的代码片段:
I had a very similar problem. I was reading in a very large csv file with 10M lines, and writing out its json equivalent. I saw in the windows task manager that my process was using > 2GB of memory. Eventually I figured out that the output stream was probably slower than the input stream, and that the outstream was buffering a huge amount of data. I was able to fix this by pausing the instream every 100 writes to the outstream, and waiting for the outstream to empty. This gives time for the outstream to catch up with the instream. I don't think it matters for the sake of this discussion, but I was using 'readline' to process the csv file one line at a time.
I also figured out along the way that if, instead of writing every line to the outstream, I concatenate 100 or so lines together, then write them together, this also improved the memory situation and made for faster operation.
In the end, I found that I could do the file transfer (csv -> json) using just 70M of memory.
Here's a code snippet for my write function:
您应该使用管道,例如:
有关更多信息,请查看: http://nodejs.org/docs/v0.5.10/api/streams.html#stream.pipe
编辑:顺便说一句,您的实现中的问题是,通过像这样分块执行,写入缓冲区不去刷新,并且您将在写回大部分文件之前读取整个文件。
You should be using pipes, such as:
For more information, check out: http://nodejs.org/docs/v0.5.10/api/streams.html#stream.pipe
EDIT: the problem in your implementation, btw, is that by doing it in chunks like that, the write buffer isn't going to get flushed, and you're going to read in the entire file before writing much of it back out.
根据 文档,
data_output.write如果字符串已被刷新,(...)
将返回true
,如果尚未刷新(由于内核缓冲区已满),则返回false
。这是什么样的流?另外,我(相当)确定这不是问题,但是:为什么在每次循环迭代时分配一个新的缓冲区?在循环之前初始化 buf 不是更有意义吗?
According to the documentation,
data_output.write(...)
will returntrue
if the string has been flushed, andfalse
if it has not (due to the kernel buffer being full). What kind of stream is this?Also, I'm (fairly) sure this isn't the problem, but: how come you allocate a new
Buffer
on each loop iteration? Wouldn't it make more sense to initializebuf
before the loop?我不知道同步文件功能是如何实现的,但是您是否考虑过使用异步功能?这更有可能允许垃圾收集和 I/O 刷新发生。因此,您将在上一次读取的回调函数中触发下一次读取,而不是 for 循环。
沿着这些思路(另请注意,根据其他评论,我正在重用缓冲区):
I don't know how the synchronous file functions are implemented, but have you considered using the asynch ones? That would be more likely to allow garbage collection and i/o flushing to happen. So instead of a for loop, you would trigger the next read in the callback function of the previous read.
Something along these lines (note also that, per other comments, I'm reusing the Buffer):