当前位置：文江博客话题详情

在 PHP 中高效解析 Apache 日志

发布于 2024-12-28 02:34:52 字数 4653 浏览 1 评论 0原文

好吧，这是这样的场景：我需要解析我的日志来查找图像缩略图被下载了多少次，而没有实际观看“大图像”页面...... 这基本上是一个基于“缩略图”与“完整”图像视图比率的热链接保护系统

考虑到服务器不断受到缩略图请求的轰炸，最有效的解决方案似乎是使用缓冲的 apache 日志，每次写入磁盘一次，比如说，1Mb，然后定期解析日志

我的问题是：如何在 PHP 中解析 apache 日志来保存数据，满足以下条件：

日志将被使用并实时更新，我需要我的PHP脚本以便能够在执行此操作时读取它
php 脚本必须“记住”它读取了日志的哪些部分，以免两次读取同一部分并导致数据倾斜
内存消耗应最小化，因为日志可以在几个小时内轻松达到 10Gb 的数据

。 php 记录器脚本将每 60 秒调用一次，并在这段时间内处理尽可能多的日志行。

我尝试将一些代码组合在一起，但使用时遇到问题最小内存量，找到一种使用“移动”文件大小跟踪指针的方法

这是日志的一部分：

212.180.168.244 - - [18/Jan/2012:20:06:57 +0100] "GET /t/0/11/11441/11441268.jpg HTTP/1.1" 200 3072 "-" "Opera/9.80 (Windows NT 6.1; U; pl) Presto/2.10.229 Version/11.60" "-"
122.53.168.123 - - [18/Jan/2012:20:06:57 +0100] "GET /t/0/11/11441/11441276.jpg HTTP/1.1" 200 3007 "-" "Opera/9.80 (Windows NT 6.1; U; pl) Presto/2.10.229 Version/11.60" "-"
143.22.203.211 - - [18/Jan/2012:20:06:57 +0100] "GET /t/0/11/11441/11441282.jpg HTTP/1.1" 200 4670 "-" "Opera/9.80 (Windows NT 6.1; U; pl) Presto/2.10.229 Version/11.60" "-"

在此处附加代码以供您查看：

<?php
//limit for running it every minute
error_reporting(E_ALL);
ini_set('display_errors',1);
set_time_limit(0);
include(dirname(__FILE__).'/../kframework/kcore.class.php');
$aj = new kajaxpage;
$aj->use_db=1;
$aj->init();
$db=kdbhandler::getInstance();
$d=kdebug::getInstance();
$d->debug=TRUE;
$d->verbose=TRUE;

$log_file = "/var/log/nginx/access.log"; //full path to log file when run by cron
$pid_file = dirname(__FILE__)."/../kframework/cron/cron_log.pid";
//$images_id = array("8308086", "7485151", "6666231", "8343336");

if (file_exists($pid_file)) {
    $pid = file_get_contents($pid_file);
    $temp = explode(" ", $pid);
    $pid_timestamp = $temp[0];
    $now_timestamp = strtotime("now");
    //if (($now_timestamp - $pid_timestamp) < 90) return;
    $pointer = $temp[1];
    if ($pointer > filesize($log_file)) $pointer = 0;
}
else $pointer = 0;

$pattern = "/([0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3})[^\[]*\[([^\]]*)\][^\"]*\"([^\"]*)\"\s([0-9]*)\s([0-9]*)(.*)/";
$last_time = 0;
$lines_processed=0;

if ($fp = fopen($log_file, "r+")) {
    fseek($fp, $pointer);
    while (!feof($fp)) {
        //if ($lines_processed>100) exit;
        $lines_processed++;
        $log_line = trim(fgets($fp));
        if (!empty($log_line)) {
            preg_match_all($pattern, $log_line, $matches);
            //print_r($matches);
            $size = $matches[5][0];
            $matches[3][0] = str_replace("GET ", "", $matches[3][0]);
            $matches[3][0] = str_replace("HTTP/1.1", "", $matches[3][0]);
            $matches[3][0] = str_replace(".jpg/", ".jpg", $matches[3][0]);
            if (substr($matches[3][0],0,3) == "/t/") {
                $get = explode("-",end(explode("/",$matches[3][0])));
                $imgid = $get[0];
                $type='thumb';
            }
            elseif (substr($matches[3][0], 0, 5) == "/img/") {
                $get1 = explode("/", $matches[3][0]);
                $get2 = explode("-", $get1[2]);
                $imgid = $get2[0];
                $type='raw';
            }
            echo $matches[3][0];
            // put here your sql insert or update
            $imgid=(int) $imgid;
            if (isset($type) && $imgid!=1) {
                switch ($type) {
                    case 'thumb':
                        //use the second slave in the registry
                        $sql=$db->slave_query("INSERT INTO hotlink SET thumbviews=1, imageid=".$imgid." ON DUPLICATE KEY UPDATE thumbviews=thumbviews+1 ",2);
                        echo "INSERT INTO hotlink SET thumbviews=1, imageid=".$imgid." ON DUPLICATE KEY UPDATE thumbviews=thumbviews+1";
                    break;
                    case 'raw':
                        //use the second slave in the registry
                        $sql=$db->slave_query("INSERT INTO hotlink SET rawviews=1, imageid=".$imgid." ON DUPLICATE KEY UPDATE rawviews=rawviews+1",2);
                        echo "INSERT INTO hotlink SET rawviews=1, imageid=".$imgid." ON DUPLICATE KEY UPDATE rawviews=rawviews+1";
                    break;
                }
            }

            // $imgid - image ID
            // $size - image size

            $timestamp = strtotime("now");
            if (($timestamp - $last_time) > 30) {
                file_put_contents($pid_file, $timestamp . " " . ftell($fp));
                $last_time = $timestamp;
            }
        }
    }
    file_put_contents($pid_file, (strtotime("now") - 95) . " " . ftell($fp));
    fclose($fp);
}

?>

原文

Ok, this is the scenario: I need to parse my logs to find how many times image thumbnails have been downloaded without actually watching the "large image" page...
This is basically a hotlink protection system based on a ratio of "thumb" to "full" image views

Considering the server is bombarded constantly by requests to the thumbnails, the most efficient solution seems to use buffered apache logs that write to disk once every, say, 1Mb, and then parse the logs periodically

My question is this: how do I parse an apache log in PHP to save the data, with the following being true:

The log will be used and update in real time, and I need my PHP script to be able to read it while this is being done
The php script will have to "remember" which parts of the log it read, so as not to read the same part twice and skew data
Memory consumption should be at a minimum, since the logs can easily reach 10Gb of data in a few hours

The php logger script would be called once every 60 seconds and process whatever amount of log lines it can during that time..

I've tried hacking some code together but I have problems using a minimum amount of memory, finding a way to keep track of the pointer with a "moving" filesize

Here's a part of the log:

212.180.168.244 - - [18/Jan/2012:20:06:57 +0100] "GET /t/0/11/11441/11441268.jpg HTTP/1.1" 200 3072 "-" "Opera/9.80 (Windows NT 6.1; U; pl) Presto/2.10.229 Version/11.60" "-"
122.53.168.123 - - [18/Jan/2012:20:06:57 +0100] "GET /t/0/11/11441/11441276.jpg HTTP/1.1" 200 3007 "-" "Opera/9.80 (Windows NT 6.1; U; pl) Presto/2.10.229 Version/11.60" "-"
143.22.203.211 - - [18/Jan/2012:20:06:57 +0100] "GET /t/0/11/11441/11441282.jpg HTTP/1.1" 200 4670 "-" "Opera/9.80 (Windows NT 6.1; U; pl) Presto/2.10.229 Version/11.60" "-"

Attaching the code for your review here:

<?php
//limit for running it every minute
error_reporting(E_ALL);
ini_set('display_errors',1);
set_time_limit(0);
include(dirname(__FILE__).'/../kframework/kcore.class.php');
$aj = new kajaxpage;
$aj->use_db=1;
$aj->init();
$db=kdbhandler::getInstance();
$d=kdebug::getInstance();
$d->debug=TRUE;
$d->verbose=TRUE;

$log_file = "/var/log/nginx/access.log"; //full path to log file when run by cron
$pid_file = dirname(__FILE__)."/../kframework/cron/cron_log.pid";
//$images_id = array("8308086", "7485151", "6666231", "8343336");

if (file_exists($pid_file)) {
    $pid = file_get_contents($pid_file);
    $temp = explode(" ", $pid);
    $pid_timestamp = $temp[0];
    $now_timestamp = strtotime("now");
    //if (($now_timestamp - $pid_timestamp) < 90) return;
    $pointer = $temp[1];
    if ($pointer > filesize($log_file)) $pointer = 0;
}
else $pointer = 0;

$pattern = "/([0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3})[^\[]*\[([^\]]*)\][^\"]*\"([^\"]*)\"\s([0-9]*)\s([0-9]*)(.*)/";
$last_time = 0;
$lines_processed=0;

if ($fp = fopen($log_file, "r+")) {
    fseek($fp, $pointer);
    while (!feof($fp)) {
        //if ($lines_processed>100) exit;
        $lines_processed++;
        $log_line = trim(fgets($fp));
        if (!empty($log_line)) {
            preg_match_all($pattern, $log_line, $matches);
            //print_r($matches);
            $size = $matches[5][0];
            $matches[3][0] = str_replace("GET ", "", $matches[3][0]);
            $matches[3][0] = str_replace("HTTP/1.1", "", $matches[3][0]);
            $matches[3][0] = str_replace(".jpg/", ".jpg", $matches[3][0]);
            if (substr($matches[3][0],0,3) == "/t/") {
                $get = explode("-",end(explode("/",$matches[3][0])));
                $imgid = $get[0];
                $type='thumb';
            }
            elseif (substr($matches[3][0], 0, 5) == "/img/") {
                $get1 = explode("/", $matches[3][0]);
                $get2 = explode("-", $get1[2]);
                $imgid = $get2[0];
                $type='raw';
            }
            echo $matches[3][0];
            // put here your sql insert or update
            $imgid=(int) $imgid;
            if (isset($type) && $imgid!=1) {
                switch ($type) {
                    case 'thumb':
                        //use the second slave in the registry
                        $sql=$db->slave_query("INSERT INTO hotlink SET thumbviews=1, imageid=".$imgid." ON DUPLICATE KEY UPDATE thumbviews=thumbviews+1 ",2);
                        echo "INSERT INTO hotlink SET thumbviews=1, imageid=".$imgid." ON DUPLICATE KEY UPDATE thumbviews=thumbviews+1";
                    break;
                    case 'raw':
                        //use the second slave in the registry
                        $sql=$db->slave_query("INSERT INTO hotlink SET rawviews=1, imageid=".$imgid." ON DUPLICATE KEY UPDATE rawviews=rawviews+1",2);
                        echo "INSERT INTO hotlink SET rawviews=1, imageid=".$imgid." ON DUPLICATE KEY UPDATE rawviews=rawviews+1";
                    break;
                }
            }

            // $imgid - image ID
            // $size - image size

            $timestamp = strtotime("now");
            if (($timestamp - $last_time) > 30) {
                file_put_contents($pid_file, $timestamp . " " . ftell($fp));
                $last_time = $timestamp;
            }
        }
    }
    file_put_contents($pid_file, (strtotime("now") - 95) . " " . ftell($fp));
    fclose($fp);
}

?>

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

饮惑 2025-01-04 02:34:52

也许你可以调整我的 PHP 版本的 tail 来搜索你的最后一个时间戳，而不是计算行数，然后从该点读取行，逐一处理它们？

尾部处理大文件

会给它一个自己尝试一下，因为我有点好奇，但不幸的是现在无法这样做:(

回复收藏 0 原文

零度° 2025-01-04 02:34:52

我知道这个答案已经晚了，但仍然可以提供帮助（代码总是可以改进）。

10Gb 文件大小和所需内存听起来像是您的主要问题。 Apache 确实支持多个日志文件，并且多个日志文件的真正威力来自于以不同格式创建日志文件的能力http://httpd.apache.org/docs/1.3/multilogs.html

创建第二个日志文件，其中仅包含实时日志监控所需的最少数据。在这种情况下，您可以首先从日志中删除用户代理字符串等。

根据您的示例日志行，这可能会使 PHP 必须加载的数据量减半。

回复收藏 0 原文

原来分手还会想你 2025-01-04 02:34:52

解决方案是将日志存储到 mysql 数据库中。也许你可以编写一个C语言程序来解析日志文件并将其存储到mysql中。这会快一个数量级，而且不是很困难。另一种选择是使用 phyton，但我认为使用数据库是必要的。您可以使用全文索引来匹配您的字符串。 Python 也可以编译为二进制文件。这使得它更加有效。根据请求：日志文件增量堆栈。不是一下子给10GB。

回复收藏 0 原文