R 中的日志文件分析?
我知道还有其他工具,例如 awstats 或 splunk,但我想知道 R 中是否正在进行一些认真的(网络)服务器日志文件分析。我可能不是第一个想到在 R 中执行此操作的人,但 R 仍然具有很好的可视化功能功能以及不错的空间包。你知道吗?或者是否有一个 R 包/代码可以处理可以构建的最常见的日志文件格式?或者这只是一个非常糟糕的主意?
I know there are other tools around like awstats or splunk, but I wonder whether there is some serious (web)server logfile analysis going on in R. I might not be the first thought to do it in R, but still R has nice visualization capabilities and also nice spatial packages. Do you know of any? Or is there a R package / code that handles the most common log file formats that one could build on? Or is it simply a very bad idea?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
与为我们的网络运营人员构建分析工具箱的项目相关,
我大约两个月前建造了其中一个。如果我将其开源,我的雇主不会有任何问题,因此如果有人感兴趣,我可以将其放在我的 github 存储库上。我认为如果我构建一个 R 包,这对这个组来说是最有用的。但我无法立即做到这一点
因为我需要研究有关使用非 R 代码构建包的文档(这可能就像将 python 字节码文件与合适的 python 运行时一起扔到 /exec 中一样简单,但我不知道)。
事实上,我很惊讶我需要承担这样的项目。至少有几个优秀的开源和免费日志文件解析器/查看器(包括优秀的 Webalyzer 和 AWStats),但都没有解析服务器错误日志(解析服务器访问日志是两者的主要用例)。
如果您不熟悉错误日志或它们与访问之间的区别
总而言之,Apache 服务器(如 wsie、nginx 和 IIS)记录两个不同的日志,并将它们默认存储到磁盘的同一目录中,彼此相邻。在 Mac OS X 上,
/var 中的该目录,位于根目录下方:
对于网络诊断,错误日志通常比访问日志有用得多。
由于许多领域数据的非结构化性质,更重要的是,由于数据文件
解析后留下的是不规则的时间序列——您可能有多个条目键入单个时间戳,然后下一个条目是三秒后,依此类推。
我想要一个应用程序,我可以将原始错误日志(任何大小,但通常一次几百MB)扔进去,并在另一端输出一些有用的东西——在这种情况下,必须是一些预先打包的分析和R 中还有一个用于命令行分析的数据立方体。鉴于此,我用 python 编写了原始日志解析器,而处理器(例如,网格化解析器输出以创建常规时间序列)和所有分析和数据可视化,我用 R 编写
。很长一段时间,但只是过去
我使用 R 已有四年了。因此,我的第一印象是,在解析原始日志文件并在 R 中加载数据帧后,使用 R 是一种乐趣,而且它非常适合此类任务。一些令人欢迎的惊喜:
序列化。在 R 中保存工作数据只需一个命令
(节省)。我知道这一点,但我不知道这个二进制文件的效率如何
格式。实际数据:对于每解析 50 MB 的原始日志文件,
.RData 表示约为 500 KB--100 : 1 压缩。 (注:我
通过使用 data.table 将其进一步降低至约 300 : 1
库并手动设置保存的压缩级别参数
函数);
IO。我的数据仓库严重依赖轻量级数据结构
完全驻留在 RAM 中并写入磁盘的服务器
异步的,称为redis。项目本身只有大约两个
已经有很多年了,但 CRAN 中已经有一个用于 R 的 redis 客户端(由 BW 提供)
Lewis,截至本文为止的版本为 1.6.1);
主要数据分析。该项目的目的是建立一个
供我们网络运营人员使用的库。我的目标是“一个命令 =
一个数据视图”类型的界面。例如,我使用了优秀的
googleVis 包创建专业外观
具有可排序列的可滚动/分页 HTML 表格,其中我
加载聚合数据的数据框(> 5,000 行)。就那几个
交互式元素——例如,对列进行排序——提供了有用的
描述性分析。再举个例子,我写了很多薄薄的
一些基本数据处理和类似表格的函数的包装器;每个
例如,我将这些功能绑定到可点击的按钮
在选项卡式网页上。再说一遍,这在 R 中是一件令人愉快的事情,部分原因是
因为函数通常不需要包装器,单个
带有提供的参数的命令足以生成一个有用的
数据视图。
最后一个项目符号的几个示例:
使用 googleVis 显示用于交互式分析的主要数据立方体:
使用 googleVis 显示的列联表(来自 xtab 函数调用)
In connection with a project to build an analytics toolbox for our Network Ops guys,
i built one of these about two months ago. My employer has no problem if i open source it, so if anyone is interested i can put it up on my github repo. I assume it's most useful to this group if i build an R Package. I won't be able to do that straight away though
because i need to research the docs on package building with non-R code (it might be as simple as tossing the python bytecode files in /exec along with a suitable python runtime, but i have no idea).
I was actually suprised that i needed to undertake a project of this sort. There are at least several excellent open source and free log file parsers/viewers (including the excellent Webalyzer and AWStats) but neither parse server error logs (parsing server access logs is the primary use case for both).
If you are not familiar with error logs or with the difference between them and access
logs, in sum, Apache servers (likewsie, nginx and IIS) record two distinct logs and store them to disk by default next to each other in the same directory. On Mac OS X,
that directory in /var, just below root:
For network diagnostics, error logs are often far more useful than the access logs.
They also happen to be significantly more difficult to process because of the unstructured nature of the data in many of the fields and more significantly, because the data file
you are left with after parsing is an irregular time series--you might have multiple entries keyed to a single timestamp, then the next entry is three seconds later, and so forth.
i wanted an app that i could toss in raw error logs (of any size, but usually several hundred MB at a time) have something useful come out the other end--which in this case, had to be some pre-packaged analytics and also a data cube available inside R for command-line analytics. Given this, i coded the raw-log parser in python, while the processor (e.g., gridding the parser output to create a regular time series) and all analytics and data visualization, i coded in R.
I have been building analytics tools for a long time, but only in the past
four years have i been using R. So my first impression--immediately upon parsing a raw log file and loading the data frame in R is what a pleasure R is to work with and how it is so well suited for tasks of this sort. A few welcome suprises:
Serialization. To persist working data in R is a single command
(save). I knew this, but i didn't know how efficient is this binary
format. Thee actual data: for every 50 MB of raw logfiles parsed, the
.RData representation was about 500 KB--100 : 1 compression. (Note: i
pushed this down further to about 300 : 1 by using the data.table
library and manually setting compression level argument to the save
function);
IO. My Data Warehouse relies heavily on a lightweight datastructure
server that resides entirely in RAM and writes to disk
asynchronously, called redis. The proect itself is only about two
years old, yet there's already a redis client for R in CRAN (by B.W.
Lewis, version 1.6.1 as of this post);
Primary Data Analysis. The purpose of this Project was to build a
Library for our Network Ops guys to use. My goal was a "one command =
one data view" type interface. So for instance, i used the excellent
googleVis Package to create a professional-looking
scrollable/paginated HTML tables with sortable columns, in which i
loaded a data frame of aggregated data (>5,000 lines). Just those few
interactive elments--e.g., sorting a column--delivered useful
descriptive analytics. Another example, i wrote a lot of thin
wrappers over some basic data juggling and table-like functions; each
of these functions i would for instance, bind to a clickable button
on a tabbed web page. Again, this was a pleasure to do in R, in part
becasue quite often the function required no wrapper, the single
command with the arguments supplied was enough to generate a useful
view of the data.
A couple of examples of the last bullet:
The Primary Data Cube Displayed for Interactive Analysis Using googleVis:
A contingency table (from an xtab function call) displayed using googleVis)
事实上这是一个很棒的主意。 R 还具有非常好的日期/时间功能,可以进行聚类分析或使用任何类型的机器学习算法,具有三种不同的正则表达式引擎来解析等。
这可能不是一个新想法。几年前,我与使用 R 进行主动(而不是被动)日志文件分析的某人进行了简短的电子邮件联系:阅读日志,(在他们的例子中)构建时间序列模型,预测热点。这显然是个好主意。这是能源部实验室之一,但我不再有 URL。即使在时间模式之外,人们也可以在这里做很多事情。
It is in fact an excellent idea. R also has very good date/time capabilities, can do cluster analysis or use any variety of machine learning alogorithms, has three different regexp engines to parse etc pp.
And it may not be a novel idea. A few years ago I was in brief email contact with someone using R for proactive (rather than reactive) logfile analysis: Read the logs, (in their case) build time-series models, predict hot spots. That is so obviously a good idea. It was one of the Department of Energy labs but I no longer have a URL. Even outside of temporal patterns there is a lot one could do here.
我已经使用 R 加载和解析 IIS 日志文件并取得了一些成功,这是我的代码。
I have used R to load and parse IIS Log files with some success here is my code.
我最近使用 R 进行了日志文件分析。这并不是真正复杂的事情,主要是描述性表格。 R 的内置函数足以完成这项工作。
问题是数据存储,因为我的日志文件大约有 10 GB。 Revolutions R 确实提供了处理此类大数据的新方法,但我最终决定使用 MySQL 数据库作为后端(实际上,通过标准化将大小减少到 2 GB)。
这也可以解决您在 R 中读取日志文件的问题。
I did a logfile-analysis recently using R. It was no real komplex thing, mostly descriptive tables. R's build-in functions were sufficient for this job.
The problem was the data storage as my logfiles were about 10 GB. Revolutions R does offer new methods to handle such big data, but I at last decided to use a MySQL-database as a backend (which in fact reduced the size to 2 GB though normalization).
That could also solve your problem in reading logfiles in R.
演示输出:
可以使用 read.csv 轻松地将这种格式读入 R。而且,它不需要任何第三方库。
Demo output:
This format can easily be read into R using read.csv. And, it doesn't require any 3rd party libraries.