使用php读取文件中的一行块
考虑到我有一个 100GB 的 txt 文件,其中包含数百万行文本。我如何使用 PHP 按行块读取此文本文件?
我无法使用 file_get_contents();
因为文件太大。 fgets()
还会逐行读取文本,这可能需要更长的时间才能完成整个文件的读取。
如果我将使用 fread($fp,5030) ,其中“5030”是它必须读取的某个长度值。是否会出现由于已达到最大长度而无法读取整行(例如停在行中间)的情况?
Considering i have a 100GB txt file containing millions of lines of text. How could i read this text file by block of lines using PHP?
i can't use file_get_contents();
because the file is too large. fgets()
also read the text line by line which will likely takes longer time to finish reading the whole file.
If i'll be using fread($fp,5030)
wherein '5030' is some length value for which it has to read. Would there be a case where it won't read the whole line(such as stop at the middle of the line) because it has reached the max length?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
不明白,为什么你不应该使用
fgets()
无论如何,读取 100GB 都需要时间。
Don't see, why you shouldn't be able to use
fgets()
Reading 100GB will take time anyway.
fread
方法听起来像是一个合理的解决方案。您可以通过检查字符串中的最后一个字符是否为换行符 ('\n'
) 来检测是否已到达行尾。如果不是,那么您可以读取更多字符并将它们附加到现有字符串中,或者可以将字符串中的字符修剪回最后一个换行符,然后使用fseek
调整您的在文件中的位置。旁白:您是否知道读取 100GB 的文件将需要非常很长时间?
The
fread
approach sounds like a reasonable solution. You can detect whether you've reached the end of a line by checking whether the final character in the string is a newline character ('\n'
). If it isn't, then you can either read some more characters and append them to your existing string, or you can trim characters from your string back to the last newline, and then usefseek
to adjust your position in the file.Side point: Are you aware that reading a 100GB file will take a very long time?
我认为你必须使用 fread($fp, somesize) ,并手动检查是否找到了行尾,否则读取另一个块。
希望这有帮助。
i think that you have to use fread($fp, somesize), and check manually if you have founded the end of the line, otherwise read another chunk.
Hope this helps.
我建议在函数中实现对单行的读取,从代码的其余部分隐藏该特定步骤的实现细节 - 处理函数不得关心如何检索该行。然后,您可以使用
fgets()
实现您的第一个版本,然后尝试其他方法(如果您发现它太慢的话)。最初的实施很可能太慢,但要点是:在进行基准测试之前您不会知道。I would recommend implementing the reading of a single line within a function, hiding the implementation details of that specific step from the rest of your code - the processing function must not care how the line was retrieved. You can then implement your first version using
fgets()
and then try other methods if you notice that it is too slow. It could very well be that the initial implementation is too slow, but the point is: you won't know until you've benchmarked.我知道这是一个老问题,但我认为对于最终发现这个问题的任何人来说,新答案都是有价值的。
我同意读取 100GB 需要时间,这就是为什么我也同意我们需要找到最有效的选项来读取它,这样它就可以尽可能少,而不是只是想“如果已经很多了,谁在乎它是多少” “所以,让我们找出可能的最短时间。
另一种解决方案:
缓存一大块原始数据
使用 fread 读取该数据的缓存
逐行读取 从
缓存中逐行读取,直到找到缓存末尾或数据末尾
读取下一个块并重复
抓取未处理的最后部分块(您正在寻找行分隔符的块)并将其移到前面,然后读取您定义的大小减去未处理数据的大小的块,并将其放在该未处理的块之后,然后,你走了,你就有了一个新的完整的块。
重复按行读取此过程,直到完全读取文件。
您应该使用大于任何预期行大小的缓存块。
缓存大小越大,读取速度越快,但使用的内存也越多。
I know this is an old question, but I think there is value for a new answer for anyone that finds this question eventually.
I agree that reading 100GB takes time, that I why I also agree that we need to find the most effective option to read it so it can be as little as possible instead of just thinking "who cares how much it is if is already a lot", so, lets find out our lowest time possible.
Another solution:
Cache a chunk of raw data
Use fread to read a cache of that data
Read line by line
Read line by line from the cache until end of cache or end of data found
Read next chunk and repeat
Grab the un processed last part of the chunk (the one you were looking for the line delimiter) and move it at the front, then reads a chunk of the size you had defined minus the size of the unprocessed data and put it just after that un processed chunk, then, there you go, you have a new complete chunk.
Repeat the read by line and this process until the file is read completely.
You should use a cache chunk bigger than any expected size of line.
The bigger the cache size the faster you read, but the more memory you use.