当前位置：文江博客话题详情

C++使用 MMap/OpenMP 的 AccessLog 解析器 - 正确计算媒体流 Web 访问事件的问题

发布于 2024-12-01 23:18:49 字数 1425 浏览 0 评论 0原文

您好，首先，如果问题描述听起来很奇怪且不准确，我们深表歉意。对我来说用英语描述我的复杂问题并不容易，但我希望你能理解我的意思。

我制作了一个 CLI 工具来解析 Web 服务器访问日志。我专注于性能和使用灵活性。

因此，我使用 MMap 将日志文件读入内存，然后将内存映射的 char* 传递到并行 OpenMP 处理循环。

在 omp 并行 for 循环中，我只是使用 boost::regex_search 解析每个 LogString 中的几个信息性子字符串，并将事件数据存储在线程本地自定义 LogEvent 类型对象中。

从当前字符串创建此 LogEvent 对象后，我将 LogEvent 附加到向量并继续解析下一个字符串，依此类推。

棘手的是我在程序启动时解析用户配置文件。用户可以通过指定字段名称和与数据匹配的正则表达式来定义多个“数据字段”。

例如：

Time = \d{2}\/\w{3}\/\d{4}
IP = \d{1,3}\.\d{1,3}.\d{1,3}.\d{1,3}
Object = \d{2,8}\_w\d{1,3}.mp4|\d{2,10}.flv

此外，用户可以指定生成输出报告数据的顺序

例如：

field_0 = %IP%
field_1 = %Object%
field_2 = %Time%

输出字符串可能如下所示：

10.20.30.1;video_xyz.flv;Jul/23/2011:11:12;3 
10.20.30.1;video_xyz.flv;Jul/23/2011:11:17;1 
10.20.30.1;video_xyz.flv;Jul/23/2011:11:18;12
10.11.30.1;video_xyz.blabla.mp4;Jul/23/2011:11:12;3

我遇到的问题是，流式传输视频文件会导致日志中出现多个访问事件。我无法真正识别出只是重新加载/缓冲流的人，因为不同的客户端平台在生成服务器响应代码时有不同的行为。

现在我对事件进行了多次计数，这通常是错误的。

我该如何处理这个问题？我知道这很笼统，但是如果您考虑一下我的程序以及我如何描述它，您很快就会发现这个问题很难用我的程序设计来解决。

我找到了一种或另一种解决方法，但它总是对性能产生非常糟糕的影响，而不是合法的解决方案。

不知何故，我必须避免在解析时将这些 LogEvents 附加到 LogEvent-Objects 向量中，因为在那之前，字符串仍按正确的时间顺序排列，因此我可以将当前字符串与前一个字符串进行比较，依此类推。

之后，omp 关键阶段开始，并且线程本地结果被合并，如果我想检查错误的多次命中计数，我将不得不搜索整个数据数组，即 nogo。

我希望我的问题足够清楚。有什么想法吗？（不知道示例代码是否有帮助，因为我认为这更多是一个设计问题）...

原文

Hello and first of all sorry if the problem description sounds strange and inprecise. It's not that easy for me to describe my complex problem in english, but I hope you will understand what I mean.

I made a CLI-tool for parsing Webserver Access Logs. I focussed on performance and flexibility in usage.

Therefore I use MMap to read LogFiles into Memory and then I pass the memory mapped char* to a parallel OpenMP processing loop.

In the omp parallel for loop I just parse the several informative substrings from every single LogString using boost::regex_search and I store the event-data in a thread-local custom LogEvent-type Object.

After creating this LogEvent-Object from the current string, I append the LogEvent to a vector and proceed with parsing the next String and so on.

The tricky thing is that I parse a user configuration file on program start. The user can define multiple "data-fields" by specifying a Field-name and a RegEx that will match the data.

E.g.:

Time = \d{2}\/\w{3}\/\d{4}
IP = \d{1,3}\.\d{1,3}.\d{1,3}.\d{1,3}
Object = \d{2,8}\_w\d{1,3}.mp4|\d{2,10}.flv

Further the user can specify the order that the output report data will be generated

E.g.:

field_0 = %IP%
field_1 = %Object%
field_2 = %Time%

The output strings could look like:

10.20.30.1;video_xyz.flv;Jul/23/2011:11:12;3 
10.20.30.1;video_xyz.flv;Jul/23/2011:11:17;1 
10.20.30.1;video_xyz.flv;Jul/23/2011:11:18;12
10.11.30.1;video_xyz.blabla.mp4;Jul/23/2011:11:12;3

The problem I have is, that streaming a video-file causes several access events in the log. I cannot really recognize someone just reloading/buffering the stream because different client platform have different kind of behaviour at generating server response codes.

Right now I count events multiple times which is often wrong.

How can I handle this problem? It's pretty general I know, but if you think about my programm and how I described it, you will soon see the problem is hard to solve with my program design.

I found the one or another way to workaround but it always is a really bad performance impact and not a legit solution.

Somehow I must avoid to append those LogEvents to the vector of LogEvent-Objects at parsing time because until that point the strings are still in the correct chronological order so I can compare the current string with the previous and so on.

After that point the omp critical phase begins and the thread local results are combined and if I want to check for wrong multiple hit counts, I will have to search through the whole data array which is nogo.

I hope my problem is clear enough. Any Ideas? (dunno if sample code would help, because it's more a problem of design i think)...

分享到QQ

分享到微博