在 PHP 中高效解析 Apache 日志
好吧,这是这样的场景:我需要解析我的日志来查找图像缩略图被下载了多少次,而没有实际观看“大图像”页面...... 这基本上是一个基于“缩略图”与“完整”图像视图比率的热链接保护系统
考虑到服务器不断受到缩略图请求的轰炸,最有效的解决方案似乎是使用缓冲的 apache 日志,每次写入磁盘一次,比如说,1Mb,然后定期解析日志
我的问题是:如何在 PHP 中解析 apache 日志来保存数据,满足以下条件:
- 日志将被使用并实时更新,我需要我的PHP脚本以便能够在执行此操作时读取它
- php 脚本必须“记住”它读取了日志的哪些部分,以免两次读取同一部分并导致数据倾斜
- 内存消耗应最小化,因为日志可以在几个小时内轻松达到 10Gb 的数据
。 php 记录器脚本将每 60 秒调用一次,并在这段时间内处理尽可能多的日志行。
我尝试将一些代码组合在一起,但使用时遇到问题最小内存量,找到一种使用“移动”文件大小跟踪指针的方法
这是日志的一部分:
212.180.168.244 - - [18/Jan/2012:20:06:57 +0100] "GET /t/0/11/11441/11441268.jpg HTTP/1.1" 200 3072 "-" "Opera/9.80 (Windows NT 6.1; U; pl) Presto/2.10.229 Version/11.60" "-"
122.53.168.123 - - [18/Jan/2012:20:06:57 +0100] "GET /t/0/11/11441/11441276.jpg HTTP/1.1" 200 3007 "-" "Opera/9.80 (Windows NT 6.1; U; pl) Presto/2.10.229 Version/11.60" "-"
143.22.203.211 - - [18/Jan/2012:20:06:57 +0100] "GET /t/0/11/11441/11441282.jpg HTTP/1.1" 200 4670 "-" "Opera/9.80 (Windows NT 6.1; U; pl) Presto/2.10.229 Version/11.60" "-"
在此处附加代码以供您查看:
<?php
//limit for running it every minute
error_reporting(E_ALL);
ini_set('display_errors',1);
set_time_limit(0);
include(dirname(__FILE__).'/../kframework/kcore.class.php');
$aj = new kajaxpage;
$aj->use_db=1;
$aj->init();
$db=kdbhandler::getInstance();
$d=kdebug::getInstance();
$d->debug=TRUE;
$d->verbose=TRUE;
$log_file = "/var/log/nginx/access.log"; //full path to log file when run by cron
$pid_file = dirname(__FILE__)."/../kframework/cron/cron_log.pid";
//$images_id = array("8308086", "7485151", "6666231", "8343336");
if (file_exists($pid_file)) {
$pid = file_get_contents($pid_file);
$temp = explode(" ", $pid);
$pid_timestamp = $temp[0];
$now_timestamp = strtotime("now");
//if (($now_timestamp - $pid_timestamp) < 90) return;
$pointer = $temp[1];
if ($pointer > filesize($log_file)) $pointer = 0;
}
else $pointer = 0;
$pattern = "/([0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3})[^\[]*\[([^\]]*)\][^\"]*\"([^\"]*)\"\s([0-9]*)\s([0-9]*)(.*)/";
$last_time = 0;
$lines_processed=0;
if ($fp = fopen($log_file, "r+")) {
fseek($fp, $pointer);
while (!feof($fp)) {
//if ($lines_processed>100) exit;
$lines_processed++;
$log_line = trim(fgets($fp));
if (!empty($log_line)) {
preg_match_all($pattern, $log_line, $matches);
//print_r($matches);
$size = $matches[5][0];
$matches[3][0] = str_replace("GET ", "", $matches[3][0]);
$matches[3][0] = str_replace("HTTP/1.1", "", $matches[3][0]);
$matches[3][0] = str_replace(".jpg/", ".jpg", $matches[3][0]);
if (substr($matches[3][0],0,3) == "/t/") {
$get = explode("-",end(explode("/",$matches[3][0])));
$imgid = $get[0];
$type='thumb';
}
elseif (substr($matches[3][0], 0, 5) == "/img/") {
$get1 = explode("/", $matches[3][0]);
$get2 = explode("-", $get1[2]);
$imgid = $get2[0];
$type='raw';
}
echo $matches[3][0];
// put here your sql insert or update
$imgid=(int) $imgid;
if (isset($type) && $imgid!=1) {
switch ($type) {
case 'thumb':
//use the second slave in the registry
$sql=$db->slave_query("INSERT INTO hotlink SET thumbviews=1, imageid=".$imgid." ON DUPLICATE KEY UPDATE thumbviews=thumbviews+1 ",2);
echo "INSERT INTO hotlink SET thumbviews=1, imageid=".$imgid." ON DUPLICATE KEY UPDATE thumbviews=thumbviews+1";
break;
case 'raw':
//use the second slave in the registry
$sql=$db->slave_query("INSERT INTO hotlink SET rawviews=1, imageid=".$imgid." ON DUPLICATE KEY UPDATE rawviews=rawviews+1",2);
echo "INSERT INTO hotlink SET rawviews=1, imageid=".$imgid." ON DUPLICATE KEY UPDATE rawviews=rawviews+1";
break;
}
}
// $imgid - image ID
// $size - image size
$timestamp = strtotime("now");
if (($timestamp - $last_time) > 30) {
file_put_contents($pid_file, $timestamp . " " . ftell($fp));
$last_time = $timestamp;
}
}
}
file_put_contents($pid_file, (strtotime("now") - 95) . " " . ftell($fp));
fclose($fp);
}
?>
Ok, this is the scenario: I need to parse my logs to find how many times image thumbnails have been downloaded without actually watching the "large image" page...
This is basically a hotlink protection system based on a ratio of "thumb" to "full" image views
Considering the server is bombarded constantly by requests to the thumbnails, the most efficient solution seems to use buffered apache logs that write to disk once every, say, 1Mb, and then parse the logs periodically
My question is this: how do I parse an apache log in PHP to save the data, with the following being true:
- The log will be used and update in real time, and I need my PHP script to be able to read it while this is being done
- The php script will have to "remember" which parts of the log it read, so as not to read the same part twice and skew data
- Memory consumption should be at a minimum, since the logs can easily reach 10Gb of data in a few hours
The php logger script would be called once every 60 seconds and process whatever amount of log lines it can during that time..
I've tried hacking some code together but I have problems using a minimum amount of memory, finding a way to keep track of the pointer with a "moving" filesize
Here's a part of the log:
212.180.168.244 - - [18/Jan/2012:20:06:57 +0100] "GET /t/0/11/11441/11441268.jpg HTTP/1.1" 200 3072 "-" "Opera/9.80 (Windows NT 6.1; U; pl) Presto/2.10.229 Version/11.60" "-"
122.53.168.123 - - [18/Jan/2012:20:06:57 +0100] "GET /t/0/11/11441/11441276.jpg HTTP/1.1" 200 3007 "-" "Opera/9.80 (Windows NT 6.1; U; pl) Presto/2.10.229 Version/11.60" "-"
143.22.203.211 - - [18/Jan/2012:20:06:57 +0100] "GET /t/0/11/11441/11441282.jpg HTTP/1.1" 200 4670 "-" "Opera/9.80 (Windows NT 6.1; U; pl) Presto/2.10.229 Version/11.60" "-"
Attaching the code for your review here:
<?php
//limit for running it every minute
error_reporting(E_ALL);
ini_set('display_errors',1);
set_time_limit(0);
include(dirname(__FILE__).'/../kframework/kcore.class.php');
$aj = new kajaxpage;
$aj->use_db=1;
$aj->init();
$db=kdbhandler::getInstance();
$d=kdebug::getInstance();
$d->debug=TRUE;
$d->verbose=TRUE;
$log_file = "/var/log/nginx/access.log"; //full path to log file when run by cron
$pid_file = dirname(__FILE__)."/../kframework/cron/cron_log.pid";
//$images_id = array("8308086", "7485151", "6666231", "8343336");
if (file_exists($pid_file)) {
$pid = file_get_contents($pid_file);
$temp = explode(" ", $pid);
$pid_timestamp = $temp[0];
$now_timestamp = strtotime("now");
//if (($now_timestamp - $pid_timestamp) < 90) return;
$pointer = $temp[1];
if ($pointer > filesize($log_file)) $pointer = 0;
}
else $pointer = 0;
$pattern = "/([0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3})[^\[]*\[([^\]]*)\][^\"]*\"([^\"]*)\"\s([0-9]*)\s([0-9]*)(.*)/";
$last_time = 0;
$lines_processed=0;
if ($fp = fopen($log_file, "r+")) {
fseek($fp, $pointer);
while (!feof($fp)) {
//if ($lines_processed>100) exit;
$lines_processed++;
$log_line = trim(fgets($fp));
if (!empty($log_line)) {
preg_match_all($pattern, $log_line, $matches);
//print_r($matches);
$size = $matches[5][0];
$matches[3][0] = str_replace("GET ", "", $matches[3][0]);
$matches[3][0] = str_replace("HTTP/1.1", "", $matches[3][0]);
$matches[3][0] = str_replace(".jpg/", ".jpg", $matches[3][0]);
if (substr($matches[3][0],0,3) == "/t/") {
$get = explode("-",end(explode("/",$matches[3][0])));
$imgid = $get[0];
$type='thumb';
}
elseif (substr($matches[3][0], 0, 5) == "/img/") {
$get1 = explode("/", $matches[3][0]);
$get2 = explode("-", $get1[2]);
$imgid = $get2[0];
$type='raw';
}
echo $matches[3][0];
// put here your sql insert or update
$imgid=(int) $imgid;
if (isset($type) && $imgid!=1) {
switch ($type) {
case 'thumb':
//use the second slave in the registry
$sql=$db->slave_query("INSERT INTO hotlink SET thumbviews=1, imageid=".$imgid." ON DUPLICATE KEY UPDATE thumbviews=thumbviews+1 ",2);
echo "INSERT INTO hotlink SET thumbviews=1, imageid=".$imgid." ON DUPLICATE KEY UPDATE thumbviews=thumbviews+1";
break;
case 'raw':
//use the second slave in the registry
$sql=$db->slave_query("INSERT INTO hotlink SET rawviews=1, imageid=".$imgid." ON DUPLICATE KEY UPDATE rawviews=rawviews+1",2);
echo "INSERT INTO hotlink SET rawviews=1, imageid=".$imgid." ON DUPLICATE KEY UPDATE rawviews=rawviews+1";
break;
}
}
// $imgid - image ID
// $size - image size
$timestamp = strtotime("now");
if (($timestamp - $last_time) > 30) {
file_put_contents($pid_file, $timestamp . " " . ftell($fp));
$last_time = $timestamp;
}
}
}
file_put_contents($pid_file, (strtotime("now") - 95) . " " . ftell($fp));
fclose($fp);
}
?>
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
也许你可以调整我的 PHP 版本的 tail 来搜索你的最后一个时间戳,而不是计算行数,然后从该点读取行,逐一处理它们?
会给它一个自己尝试一下,因为我有点好奇,但不幸的是现在无法这样做:(
Maybe you can tweak my PHP version of tail to search for your last timestamp rather than counting lines, and then reading lines from that point, dealing with them one by one?
Would give it a try myself as I'm a bit curious, but unfortunately unable to do so right now :(
我知道这个答案已经晚了,但仍然可以提供帮助(代码总是可以改进)。
10Gb 文件大小和所需内存听起来像是您的主要问题。 Apache 确实支持多个日志文件,并且多个日志文件的真正威力来自于以不同格式创建日志文件的能力http://httpd.apache.org/docs/1.3/multilogs.html
创建第二个日志文件,其中仅包含实时日志监控所需的最少数据。在这种情况下,您可以首先从日志中删除用户代理字符串等。
根据您的示例日志行,这可能会使 PHP 必须加载的数据量减半。
I know this answer is late, but could still help (code can always be improved).
The 10Gb file size and memory required sound like your main problems. Apache does support multiple log files and the real power of multiple log files come from the ability to create log files in different formats http://httpd.apache.org/docs/1.3/multilogs.html
Create a second log file with only the minimal data you need for your real time log monitoring. In this case you might be able to strip the user-agent string etc from being in the log in the first place.
Based on your example log lines this may halve the amount of data required PHP has to load.
解决方案是将日志存储到 mysql 数据库中。也许你可以编写一个C语言程序来解析日志文件并将其存储到mysql中。这会快一个数量级,而且不是很困难。另一种选择是使用 phyton,但我认为使用数据库是必要的。您可以使用全文索引来匹配您的字符串。 Python 也可以编译为二进制文件。这使得它更加有效。根据请求:日志文件增量堆栈。不是一下子给10GB。
A solution would be to store the log into a mysql database. Maybe you can write a C language program to parse the log file after storing it in mysql. It would be a magnitude more faster and it's not very difficult. Another option would be to use phyton but I think using a database is necessary. You can use a full text index to match your string. Python can be compiled to a binary either. This makes it more efficiently. As per request: The log file stacks incremental. It's not that you give 10GB at once.
我会亲自将日志条目发送到正在运行的脚本。 Apache 将通过以管道 (|) 开头的日志文件名来允许此操作。如果这不起作用,您也可以创建一个 fifo(请参阅 mkfifo)。
正在运行的脚本(无论它是什么)可以缓冲 x 行并根据该行执行需要执行的操作。读取数据并不那么困难,也不应该成为瓶颈所在。
我确实怀疑您在数据库上的 INSERT 语句会遇到问题。
I'd personally send the log entries to a running script instead. Apache will allow this with by starting the filename for the log with a pipe (|). If this doesn't work, you can create a fifo as well (see mkfifo).
The running script (whatever it is) can buffer x lines and do what it needs to do based on that. Reading the data isn't all that hard, and shouldn't be where your bottleneck will be.
I do suspect that you will run into issues with your INSERT statements on the database.