压缩大量文件以生成统计文件

发布于 2024-11-04 17:09:18 字数 2711 浏览 0 评论 0原文

我有一堆文件需要处理,我担心可扩展性和速度。

文件名和文件数据(仅第一行)存储在 RAM 中的数组中,以便稍后在脚本中创建一些静态文件。 这些文件必须保留为文件,不能放入数据库中。

文件名的格式如下: YMD-title.ext(其中 Y 是年,M 是月,D 是日)

我实际上使用 glob 列出所有文件并创建我的数组: 这是创建“年”或“月”数组的代码示例(它在只有一个参数的函数中使用 -> $period)

[...]
function create_data_info($period=NULL){
    $data = array();
    $files = glob(ROOT_DIR.'/'.'*.ext');
    $size = sizeOf($files);
    $existing_title = array(); //Used so we can handle having the same titles two times at different date.

    if (isSet($period)){
        if ( "year" === $period ){
            for ($i = 0; $i < $size; $i++) {
                $info = extract_info($files[$i], $existing_file);
                //Create the data array with all the data ordered by year/month/day
                $data[(int)$info[5]][] = $info;
                unset($info);
            }
        }elseif ( "month" === $period ){
            for ($i = 0; $i < $size; $i++) {
                $info = extract_info($files[$i], $existing_file);
                $key = $info[5].$info[6];
                //Create the data array with all the data ordered by year/month/day
                $data[(int)$key][] = $info;
                unset($info);
            }
        }
    }
    [...]
}

function extract_info($file, &$existing){
    $full_path_file = $file;
    $file = basename($file);

    $info_file = explode("-", $file, 4);

    $filetitle = explode(".", $info_file[3]);
    $info[0] = $filetitle[0];

    if (!isSet($existing[$info[0]]))
        $existing[$info[0]] = -1;
    $existing[$info[0]] += 1;
    if ($existing[$info[0]] > 0)
        //We have already found a post with this title
        //the creation of the cache is based on info[4] data for the filename
        //so we need to tune it
        $info[0] = $info[0]."-".$existing[$info[0]];

    $info[1] = $info_file[3];
    $info[2] = $full_path_file;
    $post_content = file(ROOT_DIR.'/'.$file, FILE_IGNORE_NEW_LINES | FILE_SKIP_EMPTY_LINES);
    $info[3] = $post_content[0]; //first line of the files
    unset($post_content);

    $info[4] = filemtime(ROOT_DIR.'/'.$file);

    $info[5] = $info_file[0]; //year
    $info[6] = $info_file[1]; //month
    $info[7] = $info_file[2]; //day
    return $info;
}

所以在我的脚本中我只调用 create_data_info(PERIOD) (PERIOD 是“年”、“月”等。)

它返回一个填充了我需要的信息的数组,然后我可以循环它来创建我的统计文件。 每次启动 PHP 脚本时都会执行此过程。


我的问题是:这段代码是否是最佳的(当然不是)?我可以做什么来从我的代码中榨取一些汁液? 我不知道如何缓存它(即使可能),因为涉及大量 I/O。

如果树结构与扁平结构相比可以改变的话,我可以改变树结构,但从我的测试发现,扁平结构似乎是最好的。

我已经考虑过在 C 中创建一个小“助推器”,只进行运算,但由于它是 I/O 限制,我认为这不会产生巨大的差异,并且应用程序对于共享托管用户的兼容性会大大降低。

非常感谢您的意见,我希望我在这里说得足够清楚。如果您需要澄清,请告诉我(并忘记我的英语错误)。

I have a bunch of files I need to crunch and I'm worrying about scalability and speed.

The filename and filedata(only the first line) is stored into an array in RAM to create some statical files later in the script.
The files must remain files and can't be put into a databases.

The filename are formatted in the following fashion :
Y-M-D-title.ext (where Y is Year, M for Month and D for Day)

I'm actually using glob to list all the files and create my array :
Here is a sample of the code creating the array "for year" or "month" (It's used in a function with only one parameter -> $period)

[...]
function create_data_info($period=NULL){
    $data = array();
    $files = glob(ROOT_DIR.'/'.'*.ext');
    $size = sizeOf($files);
    $existing_title = array(); //Used so we can handle having the same titles two times at different date.

    if (isSet($period)){
        if ( "year" === $period ){
            for ($i = 0; $i < $size; $i++) {
                $info = extract_info($files[$i], $existing_file);
                //Create the data array with all the data ordered by year/month/day
                $data[(int)$info[5]][] = $info;
                unset($info);
            }
        }elseif ( "month" === $period ){
            for ($i = 0; $i < $size; $i++) {
                $info = extract_info($files[$i], $existing_file);
                $key = $info[5].$info[6];
                //Create the data array with all the data ordered by year/month/day
                $data[(int)$key][] = $info;
                unset($info);
            }
        }
    }
    [...]
}

function extract_info($file, &$existing){
    $full_path_file = $file;
    $file = basename($file);

    $info_file = explode("-", $file, 4);

    $filetitle = explode(".", $info_file[3]);
    $info[0] = $filetitle[0];

    if (!isSet($existing[$info[0]]))
        $existing[$info[0]] = -1;
    $existing[$info[0]] += 1;
    if ($existing[$info[0]] > 0)
        //We have already found a post with this title
        //the creation of the cache is based on info[4] data for the filename
        //so we need to tune it
        $info[0] = $info[0]."-".$existing[$info[0]];

    $info[1] = $info_file[3];
    $info[2] = $full_path_file;
    $post_content = file(ROOT_DIR.'/'.$file, FILE_IGNORE_NEW_LINES | FILE_SKIP_EMPTY_LINES);
    $info[3] = $post_content[0]; //first line of the files
    unset($post_content);

    $info[4] = filemtime(ROOT_DIR.'/'.$file);

    $info[5] = $info_file[0]; //year
    $info[6] = $info_file[1]; //month
    $info[7] = $info_file[2]; //day
    return $info;
}

So in my script I only call create_data_info(PERIOD) (PERIOD being "year", "month", etc..)

It returns an array filled with the info I need, and then I can loop throught it to create my statistics files.
This process is done everytime the PHP script is launched.


My question is : is this code optimal (certainly not) and what can I do to squeeze some juice from my code ?
I don't know how I can cache this (even if it's possible), as there is a lot of I/O involved.

I can change the tree structure if it could change things compared to a flat structure, but from what I found out with my tests it seems flat is the best.

I already thought about creating a little "booster" in C doing only the crunching, but I since it's I/O bound, I don't think it would make a huge difference and the application would be a lot less compatible for shared hosting users.

Thank you very much for your input, I hope I was clear enough here. Let me know if you need clarification (and forget my english mistakes).

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

情独悲 2024-11-11 17:09:18

首先,您应该使用 DirectoryIterator 而不是 glob 函数。当谈到 scandir 与 opendir 与 glob 时,glob 是最慢的。

另外,当您处理大量文件时,您应该尝试在一个循环内完成所有处理,php 函数调用相当慢。

我看到你正在使用 unset($info);然而在你进行的每一个循环中,$info都会获得新的值。如果您担心的话,Php 会进行自己的垃圾收集。 Unset 是一种语言构造而不是函数,应该相当快,但是当不需要使用时,它仍然会使整个过程变慢一些。

您正在传递 $existing 作为参考。这有实际成果吗?根据我的经验,引用会让事情变得更慢。

最后你的脚本似乎处理了大量的字符串处理。您可能想要考虑某种“序列化数据和 Base64 编码/解码”解决方案,但您应该具体进行基准测试,可能更快,也可能更慢,具体取决于您的整个代码。 (我的想法是,序列化/反序列化可能运行得更快,因为这些是本机 php 函数,而带有字符串处理的自定义函数速度较慢)。

我的回答与 I/O 不太相关,但我希望它有帮助。

To begin with you should use DirectoryIterator instead of glob function. When it comes to scandir vs opendir vs glob, glob is as slow as it gets.

Also, when you are dealing with a large amount of files you should try to do all your processing inside one loop, php function calls are rather slow.

I see you are using unset($info); yet in every loop you make, $info gets new value. Php does its own garbage collection, if thats your concern. Unset is a language construct not a function and should be pretty fast, but when using not needed, it still makes whole thing a bit slower.

You are passing $existing as a reference. Is there practical outcome for this? In my experience references make things slower.

And at last your script seems to deal with a lot of string processing. You might want to consider somekind of "serialize data and base64 encode/decode" solution, but you should benchmark that specifically, might be faster, might be slower depenging on your whole code. (My idea is that, serialize/unserialize MIGHT run faster as these are native php functions and custom functions with string processing are slower).

My answer was not very I/O related but I hope it was helpful.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文