Perl:“嘈杂的日志问题”从多个数组/哈希创建正则表达式查询数组
问题: 我需要从身份验证日志中提取大约的数据。 30 个地点。日志采用 CSV 格式。 为了使分析有用,日志条目必须与位置的运行时间相匹配。数据存储在以数据涵盖的时间段命名的目录中:例如,data/june1-june30/。 CSV 文件仅使用位置代码命名 例如,LOC1.csv、LOC2.csv。以下是典型日志的示例:
2010-06-01, 08:30:00 , 0
2010-06-01, 09:30:00 , 1
2010-06-01, 10:30:00 , 10
2010-06-01, 11:30:00 , 7
2010-06-01, 12:30:00 , 8
2010-06-01, 13:30:00 , 6
2010-06-01, 14:30:00 , 3
2010-06-01, 15:30:00 , 8
2010-06-01, 16:30:00 , 11
条目显示在第三个字段中指示的时间段内成功验证的会话数。日志代表 24 小时的数据,对于分析来说毫无用处,因为不同地点的运行时间不同。现在的问题是如何只提取与运营时间匹配的数据。分析必须显示营业时间内的活动才有效。
设置 - 到目前为止 我决定使用 YAML 创建一个配置文件,并为每个位置添加数组/哈希。
例如,
- branch: headquarters
abbrev: HQ
months: [04, 06]
DOW: [M, T, W, Th]
hours:
M: [12, 13, 14, 15, 16, 17, 18]
T: [12, 13, 14, 15, 16, 17, 18]
W: [09, 10, 11, 12, 13, 14, 15,
16, 17, 18]
Th: [12, 13, 14, 15, 16, 17, 18,
19, 20]
月份指定显示最繁忙的月份,因为这就是我们所关心的。
我在哪里 该代码将使用 Months 数组找到适当的目录,然后使用 abbrev 数组提取正确的 CSV 文件。所以我将需要的文件存储在数组@files 中。我的问题归结于设计。结果必须与每个月的适当日期相匹配。周一、周二...等我是否创建存储一周中每一天的日期的月份数组? 我被困住了,不知道从这里该去哪里。
澄清一下:代码已经提取了正确的 文件并将它们加载到每个分支的数组中(使用 globbing 和 Find::File )。现在的问题是遍历每个分支的 @files 数组并提取信息。
编辑: 根据要求:我将提供一些代码。这是获取这些文件的物品 按哈希值中指示的月份。这是最简单的部分。
foreach my $branch (@$config) {
my $name = $branch->{'branch'};
my $months = $branch->{'months'};
my $abbrev = $branch->{'abbrev'};
# find directories for busy months, load in @dirs
my @dirs;
foreach my $month (@$months) {
my $regex2 = qr(stats_2010-$month.*);
map { push(@dirs, $_) if $_ =~ $regex2 } @stats_dir;
}
# find csv files within directories, load in @files
my @files;
find(\&wanted, @dirs);
sub wanted {
push(@files, $_) if $_ =~ /$abbrev\.csv/;
}
输出: 我希望得到的输出是: 每个文件中的行代表该分支的运行时间。我认为为了简单起见,它们可以输出到单独的文件中。并且格式相同。困难的是你必须匹配周一、周二等。以某种方式与日期。这是因为不同日期的营业时间不同。
我是否让问题变得比实际需要的更加困难?我已经在这个问题上坐得太久了,希望有一双新的眼睛来纠正我的错误。我的 Perl 还可以,但我需要设计/算法部门的一些帮助。我想我能弄清楚如何将它放大。但请随意发布 Perl。我喜欢阅读优秀的 Perl!
最终我将对周一、周二等的活动进行平均。每个月的。
谢谢〜
布布诺夫
Problem:
I need to pull data from auth logs for approx. 30 locations. The logs are in CSV format.
In order for the analysis to be useful the log entries must be matched up with the hours of operation of the locations. The data is stored in directories named for the time period the data covers: eg., data/june1-june30/. The CSV files are simply named with the location code
eg., LOC1.csv , LOC2.csv. Here is a sample of a typical log:
2010-06-01, 08:30:00 , 0
2010-06-01, 09:30:00 , 1
2010-06-01, 10:30:00 , 10
2010-06-01, 11:30:00 , 7
2010-06-01, 12:30:00 , 8
2010-06-01, 13:30:00 , 6
2010-06-01, 14:30:00 , 3
2010-06-01, 15:30:00 , 8
2010-06-01, 16:30:00 , 11
The entries show the number of successful authenticated sessions during the time period indicated in the 3rd field. The logs represent 24 hours of data which is useless for analysis since the hours of operation differ from location to location. The problem now becomes how to pull only the data that matches the hours of operation. The analysis must show activity for the hours of operation to be useful.
Setup - so far
I decided to create a config file using YAML with arrays/hashes for each location.
eg.,
- branch: headquarters
abbrev: HQ
months: [04, 06]
DOW: [M, T, W, Th]
hours:
M: [12, 13, 14, 15, 16, 17, 18]
T: [12, 13, 14, 15, 16, 17, 18]
W: [09, 10, 11, 12, 13, 14, 15,
16, 17, 18]
Th: [12, 13, 14, 15, 16, 17, 18,
19, 20]
The months designation shows the busiest months, as that's all we care about.
Where I'm at
The code will find the appropriate directories using the months array, then it pulls the correct CSV files using the abbrev array. So I have the files I need stored in an array @files. My question comes down to design. The results must be matched to the appropriate dates for each month. Mondays, Tuesdays ...etc. Do I create month arrays storing the dates for each day of the week?
I'm stuck and unsure where to go from here.
To clarify: The code already pulls the correct
files and loads them into an array ( using globbing and Find::File ) for each branch. The question is now about iterating through the @files array for each branch and pulling the info.
EDIT:
as per request: I will put up some code. This is the goods for getting a hold of those files
by the months indicated in the hash. That's the easy part.
foreach my $branch (@$config) {
my $name = $branch->{'branch'};
my $months = $branch->{'months'};
my $abbrev = $branch->{'abbrev'};
# find directories for busy months, load in @dirs
my @dirs;
foreach my $month (@$months) {
my $regex2 = qr(stats_2010-$month.*);
map { push(@dirs, $_) if $_ =~ $regex2 } @stats_dir;
}
# find csv files within directories, load in @files
my @files;
find(\&wanted, @dirs);
sub wanted {
push(@files, $_) if $_ =~ /$abbrev\.csv/;
}
Output:
The output I'm hoping to get is: The lines from each file representing the hours of operation for that branch. I think they could be output to a separate file for the sake of simplicity. And in the same format. What makes it hard is that you have to match Mondays,Tuesdays ..etc. with dates somehow. This is due to different hours of operation for different days.
Am I making the problem harder than it needs to be? I've sat with this too long and am hoping for a fresh set of eyes to set me straight. My Perl is OK, but I need some help in the design/algorithm dept. I can figure out how to Perlify it, I think. But feel free to post Perl. I love reading good Perl!
Eventually I will average the activity for the Mondays, Tuesdays ...etc. of each month.
Thanks ~
Bubnoff
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
我使用的解决方案来自 dlamblin (再次感谢您的帮助!!)。
这是调整后的 YAML 配置:
这是 Perl:
The solution I'm using is from dlamblin ( Thanks again for your help!! ).
Here's the adjusted YAML config:
Here's the Perl:
当星期一为 1、星期日为 7 时,将星期几转换为数字。然后创建一个类似于
1=>{12=>1,13=>1,14=>1, 15=>1,16=>1,17=>1,18=>1},2=>{12=>1,13=>1,14=>1,15= >1,16=>1,17=>1,18=>1},...
(注意 YAML 中的 DOW 是多余的)。迄今为止:
Convert the Day of Week to a number when Monday is 1 and Sunday is 7. Then create a hash that looks like
1=>{12=>1,13=>1,14=>1,15=>1,16=>1,17=>1,18=>1},2=>{12=>1,13=>1,14=>1,15=>1,16=>1,17=>1,18=>1},...
(notice how DOW in your YAML is redundant).so far:
有些模块可以根据您拥有的日期和时间为您提供统计数据,但如果它们太重,您可以
使用 Time::Local
。解析日期——我想你无论如何都需要这样做——在每一行上通过timelocal
输入它,然后通过localtime
输入它,这将为你提供道琼斯指数。您必须适当地调整 $mon 和 $year 。一旦你获得了该条目的道琼斯指数,你就会知道如何处理它。
There are modules that will give you the dow from the date and time you have, but if they are too heavy you can
use Time::Local
. Parse the date--you'll need to do this anyway I think--on each line and feed it throughtimelocal
, then throughlocaltime
which will give you the dow. You'll have to massage $mon and $year appropriately.Once you've got the dow for the entry you'll know what to do with it.