免费实现从网络服务器日志中计算用户会话？

发布于 2024-09-24 17:06:52 字数 1376 浏览 5 评论 0原文

Web 服务器日志分析器（例如 Urchin）通常会显示多个“会话”。会话被定义为个人在有限的连续时间段内进行的一系列页面访问/点击。尝试使用 IP 地址以及通常的补充信息（例如用户代理和操作系统）以及会话超时阈值（例如 15 或 30 分钟）来识别这些分段。

对于某些网站和应用程序，可以使用 cookie 登录和/或跟踪用户，这意味着服务器可以准确地知道会话何时开始。我不是在谈论这个，而是关于启发式推断会话（“会话重建< /a>") 当 Web 服务器不跟踪它们时。

我可以用Python编写一些代码来尝试根据上述标准重建会话，但我不想重新发明轮子。我正在查看大小约为 400K 行的日志文件，因此我必须小心使用可扩展算法。

我的目标是从日志文件中提取唯一 IP 地址的列表，并针对每个 IP 地址从该日志推断出会话数。绝对的精度和准确度不是必需的……相当好的估计就可以了。

基于此说明：

新的请求被放入现有的请求中如果两个条件都有效，则会话：
IP 地址和用户代理与已请求的请求相同
插入到会话中，
该请求是在上次请求后十五分钟内完成的插入请求。

理论上来说，编写一个 Python 程序来建立一个字典（由 IP 为键）的字典（由用户代理为键）的字典很简单，其值为一对：（会话数、最新会话的最新请求）。

但我宁愿尝试使用现有的实现（如果可用），因为否则我可能会冒着花费大量时间调整性能的风险。

仅供参考，以免有人要求提供示例输入，这是我们的日志文件中的一行（已清理）：

#Fields: date time s-ip cs-method cs-uri-stem cs-uri-query s-port cs-username c-ip cs(User-Agent) cs(Referer) sc-status sc-substatus sc-win32-status 
2010-09-21 23:59:59 215.51.1.119 GET /graphics/foo.gif - 80 - 128.123.114.141 Mozilla/5.0+(Windows;+U;+Windows+NT+5.1;+en-US;+rv:1.9.2)+Gecko/20100115+Firefox/3.6+(.NET+CLR+3.5.30729) http://www.mysite.org/blarg.htm 200 0 0

原文

Web server log analyzers (e.g. Urchin) often display a number of "sessions". A session is defined as a series of page visits / clicks made by an individual within a limited, continuous time segment. The attempt is made to identify these segments using IP addresses, and often supplementary info like user agent and OS, along with a session timeout threshold such as 15 or 30 minutes.

For certain web sites and applications, a user can be logged in and/or tracked with a cookie, which means the server can precisely know when a session begins. I'm not talking about that, but about inferring sessions heuristically ("session reconstruction") when the web server does not track them.

I could write some code e.g. in Python to try to reconstruct sessions based on the criteria mentioned above, but I'd rather not reinvent the wheel. I'm looking at log files of a size around 400K lines, so I'd have to be careful to use a scalable algorithm.

My goal here is to extract a list of unique IP addresses from a log file, and for each IP address, to have the number of sessions inferred from that log. Absolute precision and accuracy are not necessary... pretty-good estimates are ok.

Based on this description:

a new request is put in an existing
session if two conditions are valid:
the IP address and the user-agent are the same of the requests already
inserted in the session,
the request is done less than fifteen minutes after the last
request inserted.

it would be simple in theory to write a Python program to build up a dictionary (keyed by IP) of dictionaries (keyed by user-agent) whose value is a pair: (number of sessions, latest request of latest session).

But I would rather try to use an existing implementation if one's available, since I might otherwise risk spending a lot of time tuning performance.

FYI lest someone ask for sample input, here is a line of our log file (sanitized):

#Fields: date time s-ip cs-method cs-uri-stem cs-uri-query s-port cs-username c-ip cs(User-Agent) cs(Referer) sc-status sc-substatus sc-win32-status 
2010-09-21 23:59:59 215.51.1.119 GET /graphics/foo.gif - 80 - 128.123.114.141 Mozilla/5.0+(Windows;+U;+Windows+NT+5.1;+en-US;+rv:1.9.2)+Gecko/20100115+Firefox/3.6+(.NET+CLR+3.5.30729) http://www.mysite.org/blarg.htm 200 0 0

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

旧伤还要旧人安 2024-10-01 17:06:52

好的，在没有任何其他答案的情况下，这是我的 Python 实现。我不是 Python 专家。欢迎提出改进建议。

#!/usr/bin/env python

"""Reconstruct sessions: Take a space-delimited web server access log
including IP addresses, timestamps, and User Agent,
and output a list of the IPs, and the number of inferred sessions for each."""

## Input looks like:
# Fields: date time s-ip cs-method cs-uri-stem cs-uri-query s-port cs-username c-ip cs(User-Agent) cs(Referer) sc-status sc-substatus sc-win32-status
# 2010-09-21 23:59:59 172.21.1.119 GET /graphics/foo.gif - 80 - 128.123.114.141 Mozilla/5.0+(Windows;+U;+Windows+NT+5.1;+en-US;+rv:1.9.2)+Gecko/20100115+Firefox/3.6+(.NET+CLR+3.5.30729) http://www.site.org//baz.htm 200 0 0

import datetime
import operator

infileName = "ex100922.log"
outfileName = "visitor-ips.csv"

ipDict = {}

def inputRecords():
    infile = open(infileName, "r")

    recordsRead = 0
    progressThreshold = 100
    sessionTimeout = datetime.timedelta(minutes=30)

    for line in infile:
        if (line[0] == '#'):
            continue
        else:
            recordsRead += 1

            fields = line.split()
            # print "line of %d records: %s\n" % (len(fields), line)
            if (recordsRead >= progressThreshold):
                print "Read %d records" % recordsRead
                progressThreshold *= 2

            # http://www.dblab.ntua.gr/persdl2007/papers/72.pdf
            #   "a new request is put in an existing session if two conditions are valid:
            #    * the IP address and the user-agent are the same of the requests already
            #      inserted in the session,
            #    * the request is done less than fifteen minutes after the last request inserted."

            theDate, theTime = fields[0], fields[1]
            newRequestTime = datetime.datetime.strptime(theDate + " " + theTime, "%Y-%m-%d %H:%M:%S")

            ipAddr, userAgent = fields[8], fields[9]

            if ipAddr not in ipDict:
                ipDict[ipAddr] = {userAgent: [1, newRequestTime]}
            else:
                if userAgent not in ipDict[ipAddr]:
                    ipDict[ipAddr][userAgent] = [1, newRequestTime]
                else:
                    ipdipaua = ipDict[ipAddr][userAgent]
                    if newRequestTime - ipdipaua[1] >= sessionTimeout:
                        ipdipaua[0] += 1
                    ipdipaua[1] = newRequestTime
    infile.close()
    return recordsRead

def outputSessions():
    outfile = open(outfileName, "w")
    outfile.write("#Fields: IPAddr Sessions\n")
    recordsWritten = len(ipDict)

    # ipDict[ip] is { userAgent1: [numSessions, lastTimeStamp], ... }
    for ip, val in ipDict.iteritems():
        # TODO: sum over on all keys' values  [(v, k) for (k, v) in d.iteritems()].
        totalSessions = reduce(operator.add, [v2[0] for v2 in val.itervalues()])
        outfile.write("%s\t%d\n" % (ip, totalSessions))

    outfile.close()
    return recordsWritten

recordsRead = inputRecords()

recordsWritten = outputSessions()

print "Finished session reconstruction: read %d records, wrote %d\n" % (recordsRead, recordsWritten)

更新：输入和处理 342K 记录并写入 21K 记录花了 39 秒。对于我的目的来说，这已经足够好了。显然 3/4 的时间花在了 strptime() 上！

OK, in the absence of any other answer, here's my Python implementation. I'm not a Python expert. Suggestions for improvement are welcome.

#!/usr/bin/env python

"""Reconstruct sessions: Take a space-delimited web server access log
including IP addresses, timestamps, and User Agent,
and output a list of the IPs, and the number of inferred sessions for each."""

## Input looks like:
# Fields: date time s-ip cs-method cs-uri-stem cs-uri-query s-port cs-username c-ip cs(User-Agent) cs(Referer) sc-status sc-substatus sc-win32-status
# 2010-09-21 23:59:59 172.21.1.119 GET /graphics/foo.gif - 80 - 128.123.114.141 Mozilla/5.0+(Windows;+U;+Windows+NT+5.1;+en-US;+rv:1.9.2)+Gecko/20100115+Firefox/3.6+(.NET+CLR+3.5.30729) http://www.site.org//baz.htm 200 0 0

import datetime
import operator

infileName = "ex100922.log"
outfileName = "visitor-ips.csv"

ipDict = {}

def inputRecords():
    infile = open(infileName, "r")

    recordsRead = 0
    progressThreshold = 100
    sessionTimeout = datetime.timedelta(minutes=30)

    for line in infile:
        if (line[0] == '#'):
            continue
        else:
            recordsRead += 1

            fields = line.split()
            # print "line of %d records: %s\n" % (len(fields), line)
            if (recordsRead >= progressThreshold):
                print "Read %d records" % recordsRead
                progressThreshold *= 2

            # http://www.dblab.ntua.gr/persdl2007/papers/72.pdf
            #   "a new request is put in an existing session if two conditions are valid:
            #    * the IP address and the user-agent are the same of the requests already
            #      inserted in the session,
            #    * the request is done less than fifteen minutes after the last request inserted."

            theDate, theTime = fields[0], fields[1]
            newRequestTime = datetime.datetime.strptime(theDate + " " + theTime, "%Y-%m-%d %H:%M:%S")

            ipAddr, userAgent = fields[8], fields[9]

            if ipAddr not in ipDict:
                ipDict[ipAddr] = {userAgent: [1, newRequestTime]}
            else:
                if userAgent not in ipDict[ipAddr]:
                    ipDict[ipAddr][userAgent] = [1, newRequestTime]
                else:
                    ipdipaua = ipDict[ipAddr][userAgent]
                    if newRequestTime - ipdipaua[1] >= sessionTimeout:
                        ipdipaua[0] += 1
                    ipdipaua[1] = newRequestTime
    infile.close()
    return recordsRead

def outputSessions():
    outfile = open(outfileName, "w")
    outfile.write("#Fields: IPAddr Sessions\n")
    recordsWritten = len(ipDict)

    # ipDict[ip] is { userAgent1: [numSessions, lastTimeStamp], ... }
    for ip, val in ipDict.iteritems():
        # TODO: sum over on all keys' values  [(v, k) for (k, v) in d.iteritems()].
        totalSessions = reduce(operator.add, [v2[0] for v2 in val.itervalues()])
        outfile.write("%s\t%d\n" % (ip, totalSessions))

    outfile.close()
    return recordsWritten

recordsRead = inputRecords()

recordsWritten = outputSessions()

print "Finished session reconstruction: read %d records, wrote %d\n" % (recordsRead, recordsWritten)

Update: This took 39 seconds to input and process 342K records and write 21K records. That's good enough speed for my purposes. Apparently 3/4 of that time was spent in strptime()!

回复收藏 0 原文

~没有更多了~