压缩大量文件以生成统计文件

发布于 2024-11-04 17:09:18 字数 2711 浏览 0 评论 0原文

我有一堆文件需要处理，我担心可扩展性和速度。

文件名和文件数据（仅第一行）存储在 RAM 中的数组中，以便稍后在脚本中创建一些静态文件。这些文件必须保留为文件，不能放入数据库中。

文件名的格式如下： YMD-title.ext（其中 Y 是年，M 是月，D 是日）

我实际上使用 glob 列出所有文件并创建我的数组：这是创建“年”或“月”数组的代码示例（它在只有一个参数的函数中使用 -> $period）

[...]
function create_data_info($period=NULL){
    $data = array();
    $files = glob(ROOT_DIR.'/'.'*.ext');
    $size = sizeOf($files);
    $existing_title = array(); //Used so we can handle having the same titles two times at different date.

    if (isSet($period)){
        if ( "year" === $period ){
            for ($i = 0; $i < $size; $i++) {
                $info = extract_info($files[$i], $existing_file);
                //Create the data array with all the data ordered by year/month/day
                $data[(int)$info[5]][] = $info;
                unset($info);
            }
        }elseif ( "month" === $period ){
            for ($i = 0; $i < $size; $i++) {
                $info = extract_info($files[$i], $existing_file);
                $key = $info[5].$info[6];
                //Create the data array with all the data ordered by year/month/day
                $data[(int)$key][] = $info;
                unset($info);
            }
        }
    }
    [...]
}

function extract_info($file, &$existing){
    $full_path_file = $file;
    $file = basename($file);

    $info_file = explode("-", $file, 4);

    $filetitle = explode(".", $info_file[3]);
    $info[0] = $filetitle[0];

    if (!isSet($existing[$info[0]]))
        $existing[$info[0]] = -1;
    $existing[$info[0]] += 1;
    if ($existing[$info[0]] > 0)
        //We have already found a post with this title
        //the creation of the cache is based on info[4] data for the filename
        //so we need to tune it
        $info[0] = $info[0]."-".$existing[$info[0]];

    $info[1] = $info_file[3];
    $info[2] = $full_path_file;
    $post_content = file(ROOT_DIR.'/'.$file, FILE_IGNORE_NEW_LINES | FILE_SKIP_EMPTY_LINES);
    $info[3] = $post_content[0]; //first line of the files
    unset($post_content);

    $info[4] = filemtime(ROOT_DIR.'/'.$file);

    $info[5] = $info_file[0]; //year
    $info[6] = $info_file[1]; //month
    $info[7] = $info_file[2]; //day
    return $info;
}

所以在我的脚本中我只调用 create_data_info(PERIOD) （PERIOD 是“年”、“月”等。）

它返回一个填充了我需要的信息的数组，然后我可以循环它来创建我的统计文件。每次启动 PHP 脚本时都会执行此过程。

我的问题是：这段代码是否是最佳的（当然不是）？我可以做什么来从我的代码中榨取一些汁液？我不知道如何缓存它（即使可能），因为涉及大量 I/O。

如果树结构与扁平结构相比可以改变的话，我可以改变树结构，但从我的测试发现，扁平结构似乎是最好的。

我已经考虑过在 C 中创建一个小“助推器”，只进行运算，但由于它是 I/O 限制，我认为这不会产生巨大的差异，并且应用程序对于共享托管用户的兼容性会大大降低。

非常感谢您的意见，我希望我在这里说得足够清楚。如果您需要澄清，请告诉我（并忘记我的英语错误）。

原文

I have a bunch of files I need to crunch and I'm worrying about scalability and speed.

The filename and filedata(only the first line) is stored into an array in RAM to create some statical files later in the script.
The files must remain files and can't be put into a databases.

The filename are formatted in the following fashion :
Y-M-D-title.ext (where Y is Year, M for Month and D for Day)

I'm actually using glob to list all the files and create my array :
Here is a sample of the code creating the array "for year" or "month" (It's used in a function with only one parameter -> $period)

[...]
function create_data_info($period=NULL){
    $data = array();
    $files = glob(ROOT_DIR.'/'.'*.ext');
    $size = sizeOf($files);
    $existing_title = array(); //Used so we can handle having the same titles two times at different date.

    if (isSet($period)){
        if ( "year" === $period ){
            for ($i = 0; $i < $size; $i++) {
                $info = extract_info($files[$i], $existing_file);
                //Create the data array with all the data ordered by year/month/day
                $data[(int)$info[5]][] = $info;
                unset($info);
            }
        }elseif ( "month" === $period ){
            for ($i = 0; $i < $size; $i++) {
                $info = extract_info($files[$i], $existing_file);
                $key = $info[5].$info[6];
                //Create the data array with all the data ordered by year/month/day
                $data[(int)$key][] = $info;
                unset($info);
            }
        }
    }
    [...]
}

function extract_info($file, &$existing){
    $full_path_file = $file;
    $file = basename($file);

    $info_file = explode("-", $file, 4);

    $filetitle = explode(".", $info_file[3]);
    $info[0] = $filetitle[0];

    if (!isSet($existing[$info[0]]))
        $existing[$info[0]] = -1;
    $existing[$info[0]] += 1;
    if ($existing[$info[0]] > 0)
        //We have already found a post with this title
        //the creation of the cache is based on info[4] data for the filename
        //so we need to tune it
        $info[0] = $info[0]."-".$existing[$info[0]];

    $info[1] = $info_file[3];
    $info[2] = $full_path_file;
    $post_content = file(ROOT_DIR.'/'.$file, FILE_IGNORE_NEW_LINES | FILE_SKIP_EMPTY_LINES);
    $info[3] = $post_content[0]; //first line of the files
    unset($post_content);

    $info[4] = filemtime(ROOT_DIR.'/'.$file);

    $info[5] = $info_file[0]; //year
    $info[6] = $info_file[1]; //month
    $info[7] = $info_file[2]; //day
    return $info;
}

So in my script I only call create_data_info(PERIOD) (PERIOD being "year", "month", etc..)

It returns an array filled with the info I need, and then I can loop throught it to create my statistics files.
This process is done everytime the PHP script is launched.

My question is : is this code optimal (certainly not) and what can I do to squeeze some juice from my code ?
I don't know how I can cache this (even if it's possible), as there is a lot of I/O involved.

I can change the tree structure if it could change things compared to a flat structure, but from what I found out with my tests it seems flat is the best.

I already thought about creating a little "booster" in C doing only the crunching, but I since it's I/O bound, I don't think it would make a huge difference and the application would be a lot less compatible for shared hosting users.

Thank you very much for your input, I hope I was clear enough here. Let me know if you need clarification (and forget my english mistakes).

分享到QQ

分享到微博