使用 PHP 将一个大文件分成许多较小的文件

发布于 2024-10-12 02:08:21 字数 912 浏览 6 评论 0原文

我有一个 209MB 的 .txt 文件,大约有 95,000 行,该文件每周自动推送到我的服务器一次,以更新我网站上的一些内容。问题是我无法分配足够的内存来处理这么大的文件,所以我想将大文件分成每个 5,000 行的小文件。

在文件被分成更小的部分之前我根本无法使用 file() ,所以我一直在使用 SplFileObject 。但我却毫无进展。这是我想要完成的一些伪代码:

read the file contents

while there are still lines left to be read in the file
    create a new file
    write the next 5000 lines to this file
    close this file

for each file created
    run mysql update queries with the new content

delete all of the files that were created

该文件采用 csv 格式。

编辑:这是按行读取文件的解决方案,给出以下答案:

function getLine($number) {
    global $handle, $index;
    $offset = $index[$number];
    fseek($handle, $offset);
    return explode("|",fgets($handle));
}

$handle = @fopen("content.txt", "r");

while (false !== ($line = fgets($handle))) {
    $index[] = ftell($handle);
}

print_r(getLine(18437));

fclose($handle);

I have a 209MB .txt file with about 95,000 lines that is automatically pushed to my server once a week to update some content on my website. The problem is I cannot allocate enough memory to process such a large file, so I want to break the large file into smaller files with 5,000 lines each.

I cannot use file() at all until the file is broken into smaller pieces, so I have been working with SplFileObject. But I have gotten nowhere with it. Here's some pseudocode of what I want to accomplish:

read the file contents

while there are still lines left to be read in the file
    create a new file
    write the next 5000 lines to this file
    close this file

for each file created
    run mysql update queries with the new content

delete all of the files that were created

The file is in csv format.

EDIT: Here is the solution for reading the file by line given the answers below:

function getLine($number) {
    global $handle, $index;
    $offset = $index[$number];
    fseek($handle, $offset);
    return explode("|",fgets($handle));
}

$handle = @fopen("content.txt", "r");

while (false !== ($line = fgets($handle))) {
    $index[] = ftell($handle);
}

print_r(getLine(18437));

fclose($handle);

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

十年九夏 2024-10-19 02:08:21
//MySQL Connection Stuff goes here

$handle = fopen('/path/to/bigfile.txt','r');  //open big file with fopen
$f = 1; //new file number

while(!feof($handle))
{
    $newfile = fopen('/path/to/newfile' . $f . '.txt','w'); //create new file to write to with file number
    for($i = 1; $i <= 5000; $i++) //for 5000 lines
    {
        $import = fgets($handle);
        fwrite($newfile,$import);
        if(feof($handle))
        {break;} //If file ends, break loop
    }
    fclose($newfile);
    //MySQL newfile insertion stuff goes here
    $f++; //Increment newfile number
}
fclose($handle);

这应该可行,大文件每个文件应包含 5000 行,并且像 newfile1.txt、newfile2.txt 等输出文件可以通过 $i <= 5000 位进行调整for 循环。

哦,我明白了,您想要插入大文件中的数据,而不是存储有关文件的信息。然后只需使用 fopen/fgets 并插入直到 feof。

//MySQL Connection Stuff goes here

$handle = fopen('/path/to/bigfile.txt','r');  //open big file with fopen
$f = 1; //new file number

while(!feof($handle))
{
    $newfile = fopen('/path/to/newfile' . $f . '.txt','w'); //create new file to write to with file number
    for($i = 1; $i <= 5000; $i++) //for 5000 lines
    {
        $import = fgets($handle);
        fwrite($newfile,$import);
        if(feof($handle))
        {break;} //If file ends, break loop
    }
    fclose($newfile);
    //MySQL newfile insertion stuff goes here
    $f++; //Increment newfile number
}
fclose($handle);

This should work, the big file should go through 5000 lines per file, and output files like newfile1.txt, newfile2.txt, etc., can be adjusted by the $i <= 5000 bit in the for loop.

Oh, I see, you want to do insertion on the data from the big file, not store the info about the files. Then just use fopen/fgets and insert until feof.

独夜无伴 2024-10-19 02:08:21

如果您的大文件是 CSV 格式,我猜您需要逐行处理它,实际上并不需要将其分解为较小的文件。不需要在内存中同时保存 5.000 行或更多行!为此,只需使用 PHP 的“低级”文件函数:

$fp = fopen("path/to/file", "r");

while (false !== ($line = fgets($fp))) {
    // Process $line, e.g split it into values since it is CSV.
    $values = explode(",", $line);

    // Do stuff: Run MySQL updates, ...
}

fclose($fp);

如果您需要随机访问,例如逐行读取行号,您可以为文件创建一个“行索引”:

$fp = fopen("path/to/file", "r");

$index = array(0);

while (false !== ($line = fgets($fp))) {
    $index[] = ftell($fp);  // get the current byte offset
}

现在 $index 将行号映射到字节偏移量,您可以使用 fseek() 导航到行:

function get_line($number)
{
    global $fp, $index;
    $offset = $index[$number];
    fseek($fp, $offset);
    return fgets($fp);
}

$line10 = get_line(10);

// ... Once you are done:
fclose($fp);

请注意,与文本编辑器不同,我从 0 开始行计数。

If your big file is in CSV format, I guess that you need to process it line by line and don't actually need to break it into smaller files. There should be no need to hold 5.000 or more lines in memory at once! To do that, simply use PHP's "low-level" file functions:

$fp = fopen("path/to/file", "r");

while (false !== ($line = fgets($fp))) {
    // Process $line, e.g split it into values since it is CSV.
    $values = explode(",", $line);

    // Do stuff: Run MySQL updates, ...
}

fclose($fp);

If you need random-access, e.g. read a line by line number, you could create a "line index" for your file:

$fp = fopen("path/to/file", "r");

$index = array(0);

while (false !== ($line = fgets($fp))) {
    $index[] = ftell($fp);  // get the current byte offset
}

Now $index maps line numbers to byte offsets and you can navigate to a line by using fseek():

function get_line($number)
{
    global $fp, $index;
    $offset = $index[$number];
    fseek($fp, $offset);
    return fgets($fp);
}

$line10 = get_line(10);

// ... Once you are done:
fclose($fp);

Note that I started line counting at 0, unlike text editors.

后来的我们 2024-10-19 02:08:21

这应该可以解决你的问题,我没有一个非常大的文本文件,但我测试了一个 1300 行长的文件,它将文件分成 3 个文件:

    // Store the line no:
    $i = 0;
    // Store the output file no:
    $file_count = 1;
    // Create a handle for the input file:
    $input_handle = fopen('test.txt', "r") or die("Can't open output file.");
    // Create an output file:
    $output_handle = fopen('test-'.$file_count.'.txt', "w") or die("Can't open output file.");

    // Loop through the file until you get to the end:
    while (!feof($input_handle)) 
    {
        // Read from the file:
        $buffer = fgets($input_handle);
        // Write the read data from the input file to the output file:
        fwrite($output_handle, $buffer);
        // Increment the line no:
        $i++;
        // If on the 5000th line:
        if ($i==5000)
        {
            // Reset the line no:
            $i=0;
            // Close the output file:
            fclose($output_handle);
            // Increment the output file count:
            $file_count++;
            // Create the next output file:
            $output_handle = fopen('test-'.$file_count.'.txt', "w") or die("Can't open output file.");
        }
    }
    // Close the input file:
    fclose($input_handle);
    // Close the output file:
    fclose($output_handle);

你现在可能发现的问题是执行时间是当您谈论 200+mb 文件时,脚本太长了。

This should do the trick for you, I don't have a very large text file, but I tested with a file that is 1300 lines long and it split the file into 3 files:

    // Store the line no:
    $i = 0;
    // Store the output file no:
    $file_count = 1;
    // Create a handle for the input file:
    $input_handle = fopen('test.txt', "r") or die("Can't open output file.");
    // Create an output file:
    $output_handle = fopen('test-'.$file_count.'.txt', "w") or die("Can't open output file.");

    // Loop through the file until you get to the end:
    while (!feof($input_handle)) 
    {
        // Read from the file:
        $buffer = fgets($input_handle);
        // Write the read data from the input file to the output file:
        fwrite($output_handle, $buffer);
        // Increment the line no:
        $i++;
        // If on the 5000th line:
        if ($i==5000)
        {
            // Reset the line no:
            $i=0;
            // Close the output file:
            fclose($output_handle);
            // Increment the output file count:
            $file_count++;
            // Create the next output file:
            $output_handle = fopen('test-'.$file_count.'.txt', "w") or die("Can't open output file.");
        }
    }
    // Close the input file:
    fclose($input_handle);
    // Close the output file:
    fclose($output_handle);

The problem you may now find is the execution time is too long for the script when you are talking about a 200+mb file.

起风了 2024-10-19 02:08:21

您可以使用fgets逐行读取。

您需要创建一个函数将读取的内容放入新文件中。示例:

function load(startLine) {
    read the original file from a point startline
    puts the content into new file
}

此后,您可以递归调用此函数,在每个读取周期中将 startline 传递到该函数上。

You can use fgets to read line by line.

You'll need to create a function to put the read contents to a new file. Example:

function load(startLine) {
    read the original file from a point startline
    puts the content into new file
}

After this, you can call this function recursively to pass startline on the function in each cicle of reading.

小梨窩很甜 2024-10-19 02:08:21

如果这是在 Linux 服务器上运行,只需让 php 在命令行执行以下命令:

split -l 5000 -a 4 test.txt out

然后 glob 可以打开的文件名结果。


我认为你的算法很尴尬,看起来你无缘无故地分解了文件。
如果您只是简单地打开初始数据文件并逐行读取它,您仍然可以执行 mysql 插入,然后只需删除该文件即可。

If this is running on a linux server simply have php make the command line execute the following:

split -l 5000 -a 4 test.txt out

Then glob the results for file names which you can fopen on.


I think your algo is awkward, it looks like you're breaking up files for no reason.
If you simply fopen the initial data file and read it line-by-line you can still preform the mysql insertion, then just remove the file.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文