使用 PHP 将一个大文件分成许多较小的文件
我有一个 209MB 的 .txt 文件,大约有 95,000 行,该文件每周自动推送到我的服务器一次,以更新我网站上的一些内容。问题是我无法分配足够的内存来处理这么大的文件,所以我想将大文件分成每个 5,000 行的小文件。
在文件被分成更小的部分之前我根本无法使用 file() ,所以我一直在使用 SplFileObject 。但我却毫无进展。这是我想要完成的一些伪代码:
read the file contents
while there are still lines left to be read in the file
create a new file
write the next 5000 lines to this file
close this file
for each file created
run mysql update queries with the new content
delete all of the files that were created
该文件采用 csv 格式。
编辑:这是按行读取文件的解决方案,给出以下答案:
function getLine($number) {
global $handle, $index;
$offset = $index[$number];
fseek($handle, $offset);
return explode("|",fgets($handle));
}
$handle = @fopen("content.txt", "r");
while (false !== ($line = fgets($handle))) {
$index[] = ftell($handle);
}
print_r(getLine(18437));
fclose($handle);
I have a 209MB .txt file with about 95,000 lines that is automatically pushed to my server once a week to update some content on my website. The problem is I cannot allocate enough memory to process such a large file, so I want to break the large file into smaller files with 5,000 lines each.
I cannot use file() at all until the file is broken into smaller pieces, so I have been working with SplFileObject. But I have gotten nowhere with it. Here's some pseudocode of what I want to accomplish:
read the file contents
while there are still lines left to be read in the file
create a new file
write the next 5000 lines to this file
close this file
for each file created
run mysql update queries with the new content
delete all of the files that were created
The file is in csv format.
EDIT: Here is the solution for reading the file by line given the answers below:
function getLine($number) {
global $handle, $index;
$offset = $index[$number];
fseek($handle, $offset);
return explode("|",fgets($handle));
}
$handle = @fopen("content.txt", "r");
while (false !== ($line = fgets($handle))) {
$index[] = ftell($handle);
}
print_r(getLine(18437));
fclose($handle);
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
这应该可行,大文件每个文件应包含 5000 行,并且像 newfile1.txt、newfile2.txt 等输出文件可以通过
$i <= 5000
位进行调整for 循环。哦,我明白了,您想要插入大文件中的数据,而不是存储有关文件的信息。然后只需使用 fopen/fgets 并插入直到 feof。
This should work, the big file should go through 5000 lines per file, and output files like newfile1.txt, newfile2.txt, etc., can be adjusted by the
$i <= 5000
bit in the for loop.Oh, I see, you want to do insertion on the data from the big file, not store the info about the files. Then just use fopen/fgets and insert until feof.
如果您的大文件是 CSV 格式,我猜您需要逐行处理它,实际上并不需要将其分解为较小的文件。不需要在内存中同时保存 5.000 行或更多行!为此,只需使用 PHP 的“低级”文件函数:
如果您需要随机访问,例如逐行读取行号,您可以为文件创建一个“行索引”:
现在
$index 将行号映射到字节偏移量,您可以使用 fseek() 导航到行:
请注意,与文本编辑器不同,我从 0 开始行计数。
If your big file is in CSV format, I guess that you need to process it line by line and don't actually need to break it into smaller files. There should be no need to hold 5.000 or more lines in memory at once! To do that, simply use PHP's "low-level" file functions:
If you need random-access, e.g. read a line by line number, you could create a "line index" for your file:
Now
$index
maps line numbers to byte offsets and you can navigate to a line by usingfseek()
:Note that I started line counting at 0, unlike text editors.
这应该可以解决你的问题,我没有一个非常大的文本文件,但我测试了一个 1300 行长的文件,它将文件分成 3 个文件:
你现在可能发现的问题是执行时间是当您谈论 200+mb 文件时,脚本太长了。
This should do the trick for you, I don't have a very large text file, but I tested with a file that is 1300 lines long and it split the file into 3 files:
The problem you may now find is the execution time is too long for the script when you are talking about a 200+mb file.
您可以使用
fgets
逐行读取。您需要创建一个函数将读取的内容放入新文件中。示例:
此后,您可以递归调用此函数,在每个读取周期中将
startline
传递到该函数上。You can use
fgets
to read line by line.You'll need to create a function to put the read contents to a new file. Example:
After this, you can call this function recursively to pass
startline
on the function in each cicle of reading.如果这是在 Linux 服务器上运行,只需让 php 在命令行执行以下命令:
split -l 5000 -a 4 test.txt out
然后 glob 可以打开的文件名结果。
我认为你的算法很尴尬,看起来你无缘无故地分解了文件。
如果您只是简单地打开初始数据文件并逐行读取它,您仍然可以执行 mysql 插入,然后只需删除该文件即可。
If this is running on a linux server simply have php make the command line execute the following:
split -l 5000 -a 4 test.txt out
Then glob the results for file names which you can fopen on.
I think your algo is awkward, it looks like you're breaking up files for no reason.
If you simply fopen the initial data file and read it line-by-line you can still preform the mysql insertion, then just remove the file.