显示访问日志分析

发布于 2024-08-14 07:24:49 字数 1561 浏览 12 评论 0原文

我正在做一些工作来分析 Catalyst Web 应用程序的访问日志。数据来自网络场前面的负载均衡器，每天总计约 35Gb。它存储在 Hadoop HDFS 文件系统中，我使用 MapReduce （通过 Dumbo，这很棒）来处理数字。

分析的目的是尝试建立一个使用情况配置文件——哪些操作最常用，每个操作的平均响应时间是多少，响应是从后端还是缓存提供——用于容量规划、优化和设置监控系统的阈值。像 Analog 这样的传统工具会给我提供最需要的 URL 或最常用的浏览器，但这些对我来说都没有用。我不需要知道 /controller/foo?id=1984 是最流行的 URL；我需要知道对 /controller/foo 的所有命中的命中率和响应时间是多少，以便我可以查看是否有优化或缓存的空间，并尝试估计如果突然命中此操作可能会发生什么双倍的。

我可以通过 MapReduce 轻松地将数据分解为每个周期每个操作的请求。问题是以易于理解的形式显示它并找出重要的趋势或异常情况。我的输出采用以下形式：

('2009-12-08T08:30', '/ctrl_a/action_a') (2440, 895)
('2009-12-08T08:30', '/ctrl_a/action_b') (2369, 1549)
('2009-12-08T08:30', '/ctrl_b/action_a') (2167, 0)
('2009-12-08T08:30', '/ctrl_b/action_b') (1713, 1184)
('2009-12-08T08:31', '/ctrl_a/action_a') (2317, 790)
('2009-12-08T08:31', '/ctrl_a/action_b') (2254, 1497)
('2009-12-08T08:31', '/ctrl_b/action_a') (2112, 0)
('2009-12-08T08:31', '/ctrl_b/action_b') (1644, 1089)

即，键是时间段，值是每个时间段的（操作、命中、缓存命中） 元组。（我不必坚持这一点；这只是我到目前为止所拥有的。）

大约有 250 个操作。它们可以组合成数量较少的组，但在同一个图表上绘制每个操作随时间变化的请求数量（或响应时间等）可能行不通。首先，它的噪音太大，其次，绝对数字并不重要——对常用、轻量级、可缓存响应的请求每分钟增加 100 个请求，远不如每分钟增加 100 个请求那么重要。在很少使用但昂贵（可能会影响数据库）的不可缓存响应中。在同一张图表中，我们不会看到很少使用的操作的请求发生变化。

静态报告不太好——巨大的数字表很难消化。如果我按小时汇总，我们可能会错过重要的每分钟变化。

有什么建议吗？你是如何处理这个问题的？我想一种方法是以某种方式突出显示请求率或每个操作的响应时间的显着变化。滚动平均值和标准差可能会表明这一点，但我可以做得更好吗？

我还可以生成哪些其他指标或数据？

原文

I'm doing some work to analyse the access logs from a Catalyst web application. The data is from the load balancers in front of the web farm and totals about 35Gb per day. It's stored in a Hadoop HDFS filesystem and I use MapReduce (via Dumbo, which is great) to crunch the numbers.

The purpose of the analysis is try to establish a usage profile -- which actions are used most, what the average response time for each action is, whether the response was served from a backend or cache -- for capacity planning, optimisation and to set thresholds for monitoring systems. Traditional tools like Analog will give me the most-requested URL or most-used browser but none of that's useful for me. I don't need to know that /controller/foo?id=1984 is the most popular URL; I need to know what hit rate and response time for all hits to /controller/foo is so I can see if there's room for optimisation or caching and try to estimate what might happen if hits for this action suddenly double.

I can easily break the data down into requests per action per period via MapReduce. The problem is displaying it in a digestable form and picking out important trends or anomalies. My output is of the form:

('2009-12-08T08:30', '/ctrl_a/action_a') (2440, 895)
('2009-12-08T08:30', '/ctrl_a/action_b') (2369, 1549)
('2009-12-08T08:30', '/ctrl_b/action_a') (2167, 0)
('2009-12-08T08:30', '/ctrl_b/action_b') (1713, 1184)
('2009-12-08T08:31', '/ctrl_a/action_a') (2317, 790)
('2009-12-08T08:31', '/ctrl_a/action_b') (2254, 1497)
('2009-12-08T08:31', '/ctrl_b/action_a') (2112, 0)
('2009-12-08T08:31', '/ctrl_b/action_b') (1644, 1089)

i.e., the keys are time periods and the values are tuples of (action, hits, cache hits) per time period. (I don't have to stick with this; it's just what I have so far.)

There are about 250 actions. They could be combined into a smaller number of groups but plotting the number of requests (or response time, etc) for each action over time on the same graph probably won't work. Firstly it'll be way too noisy and secondly the absolute numbers don't matter too much -- a 100 req/min rise in requests for a often-used, lightweight, cacheable response is much less important than a 100 req/min rise in a seldom-used but expensive (maybe hits the DB) uncacheable response. One the same graph we wouldn't see the changes in requests for the little-used action.

A static report isn't much good -- a huge table of numbers is hard to digest. If I aggregate by the hour we might miss important minute-by-minute changes.

Any suggestions? How're you handling this problem? I guess one way would be to somehow highlight significant changes in the rate of requests or response time per action. A rolling average and standard deviation might show this, but could I do something better?

What other metrics or figures could I generate?

分享到QQ

分享到微博