C++使用 MMap/OpenMP 的 AccessLog 解析器 - 正确计算媒体流 Web 访问事件的问题
您好,首先,如果问题描述听起来很奇怪且不准确,我们深表歉意。对我来说用英语描述我的复杂问题并不容易,但我希望你能理解我的意思。
我制作了一个 CLI 工具来解析 Web 服务器访问日志。我专注于性能和使用灵活性。
因此,我使用 MMap 将日志文件读入内存,然后将内存映射的 char* 传递到并行 OpenMP 处理循环。
在 omp 并行 for 循环中,我只是使用 boost::regex_search 解析每个 LogString 中的几个信息性子字符串,并将事件数据存储在线程本地自定义 LogEvent 类型对象中。
从当前字符串创建此 LogEvent 对象后,我将 LogEvent 附加到向量并继续解析下一个字符串,依此类推。
棘手的是我在程序启动时解析用户配置文件。用户可以通过指定字段名称和与数据匹配的正则表达式来定义多个“数据字段”。
例如:
Time = \d{2}\/\w{3}\/\d{4}
IP = \d{1,3}\.\d{1,3}.\d{1,3}.\d{1,3}
Object = \d{2,8}\_w\d{1,3}.mp4|\d{2,10}.flv
此外,用户可以指定生成输出报告数据的顺序
例如:
field_0 = %IP%
field_1 = %Object%
field_2 = %Time%
输出字符串可能如下所示:
10.20.30.1;video_xyz.flv;Jul/23/2011:11:12;3
10.20.30.1;video_xyz.flv;Jul/23/2011:11:17;1
10.20.30.1;video_xyz.flv;Jul/23/2011:11:18;12
10.11.30.1;video_xyz.blabla.mp4;Jul/23/2011:11:12;3
我遇到的问题是,流式传输视频文件会导致日志中出现多个访问事件。我无法真正识别出只是重新加载/缓冲流的人,因为不同的客户端平台在生成服务器响应代码时有不同的行为。
现在我对事件进行了多次计数,这通常是错误的。
我该如何处理这个问题?我知道这很笼统,但是如果您考虑一下我的程序以及我如何描述它,您很快就会发现这个问题很难用我的程序设计来解决。
我找到了一种或另一种解决方法,但它总是对性能产生非常糟糕的影响,而不是合法的解决方案。
不知何故,我必须避免在解析时将这些 LogEvents 附加到 LogEvent-Objects 向量中,因为在那之前,字符串仍按正确的时间顺序排列,因此我可以将当前字符串与前一个字符串进行比较,依此类推。
之后,omp 关键阶段开始,并且线程本地结果被合并,如果我想检查错误的多次命中计数,我将不得不搜索整个数据数组,即 nogo。
我希望我的问题足够清楚。有什么想法吗? (不知道示例代码是否有帮助,因为我认为这更多是一个设计问题)...
Hello and first of all sorry if the problem description sounds strange and inprecise. It's not that easy for me to describe my complex problem in english, but I hope you will understand what I mean.
I made a CLI-tool for parsing Webserver Access Logs. I focussed on performance and flexibility in usage.
Therefore I use MMap to read LogFiles into Memory and then I pass the memory mapped char* to a parallel OpenMP processing loop.
In the omp parallel for loop I just parse the several informative substrings from every single LogString using boost::regex_search and I store the event-data in a thread-local custom LogEvent-type Object.
After creating this LogEvent-Object from the current string, I append the LogEvent to a vector and proceed with parsing the next String and so on.
The tricky thing is that I parse a user configuration file on program start. The user can define multiple "data-fields" by specifying a Field-name and a RegEx that will match the data.
E.g.:
Time = \d{2}\/\w{3}\/\d{4}
IP = \d{1,3}\.\d{1,3}.\d{1,3}.\d{1,3}
Object = \d{2,8}\_w\d{1,3}.mp4|\d{2,10}.flv
Further the user can specify the order that the output report data will be generated
E.g.:
field_0 = %IP%
field_1 = %Object%
field_2 = %Time%
The output strings could look like:
10.20.30.1;video_xyz.flv;Jul/23/2011:11:12;3
10.20.30.1;video_xyz.flv;Jul/23/2011:11:17;1
10.20.30.1;video_xyz.flv;Jul/23/2011:11:18;12
10.11.30.1;video_xyz.blabla.mp4;Jul/23/2011:11:12;3
The problem I have is, that streaming a video-file causes several access events in the log. I cannot really recognize someone just reloading/buffering the stream because different client platform have different kind of behaviour at generating server response codes.
Right now I count events multiple times which is often wrong.
How can I handle this problem? It's pretty general I know, but if you think about my programm and how I described it, you will soon see the problem is hard to solve with my program design.
I found the one or another way to workaround but it always is a really bad performance impact and not a legit solution.
Somehow I must avoid to append those LogEvents to the vector of LogEvent-Objects at parsing time because until that point the strings are still in the correct chronological order so I can compare the current string with the previous and so on.
After that point the omp critical phase begins and the thread local results are combined and if I want to check for wrong multiple hit counts, I will have to search through the whole data array which is nogo.
I hope my problem is clear enough. Any Ideas? (dunno if sample code would help, because it's more a problem of design i think)...
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
好吧,我终于找到了一个可以忍受一段时间的解决方法。
在解析字符串时,我现在总是从每个日志字符串中获取 IP 地址和目标项。
我有一个线程本地映射,将 IP 地址存储为键,将目标项(例如视频流)存储为值。
每当我想对日志事件进行计数时,我都会检查当前处理的 LogsStrings 的 IP 地址是否已经是我的线程本地映射的键。
如果不是,则可以安全地对事件进行计数。我将当前 IP 添加为键,将对象添加为值,这意味着我更新了该特定 IP 的最后访问的对象。
如果它已经是我的地图的键,我会检查该键的值(目标项)是否与我当前的 LogStrings 目标相同。
如果是这样,这可能意味着该用户上次访问我的服务器上的任何内容是在访问相同的视频流时。
当对象发生更改时,我只会继续计算来自该 IP 地址的事件。
因为用户不太可能从一个流切换到另一个流,然后再返回(即使他愿意,计数也是正确的),所以看起来我们在这里收到了一个新事件,我们确实想要计数。
这在某种程度上类似于反向灰名单。任何 ip 只被统计一次,然后就被阻止统计,直到由于新对象而生成新的签名。
当然这也是对性能的影响,所以如果您有更好的想法,请随时回答:P
Ok finally I found a workaround that I can live with for a while.
While parsing the strings I always grab the IP address and the target item from every logstring now.
I have a thread local map that stores IP-Adress as key and Target Item (e.g. Video Stream) as value.
whenever I want to count a logevent I check before, if the currently processed LogsStrings's IP Adress is already a Key of my thread local Map.
If it's not, it's safe to count the event. And I add the current IP as key and Object as value which means I update the last accessed object for this specific IP.
If it is already a key of my map, I check if this key's value (the target item) is identical with my current LogStrings Target.
If so, that can just mean that the last time this user accessed anything on my server was when accessing the same video stream.
I will only continue to count events from this IP adress, when the object has changed.
Because it is very unlikely that a user switches from one stream to another and then back (even if he would it would be correct to count it) so it looks like we got a new event here, that we really want to count.
This works somehow like reversed greylisting. any ip is just counted once and then blocked from counting until a new signature is generated because of the new object.
Of course this is a performance impact too so if you have any better ideas, feel free to answer:P