我知道这个答案目前可能显得有些过分,但我参与这个问题空间已经有一段时间了,当我在 MS 工作的时候,我已经看过沃森博士的许多在线崩溃报告,我可以告诉您这些要求是存在的,它们是合理的担忧,并且在实施后解决方案会提供巨大帮助。最终,你无法解决无法衡量的问题。大型组织依赖于对其应用程序库存的良好管理和监控,包括日志记录和审计。
One world of caution: at 100+ apps in a big shop, with hundreds perhaps thousands of hosts running those apps, steer clear of anything that induces a tight coupling. This pretty much rules out connect directly to SQL Server or any database solution, because your application logging will be dependent on the availability of the log repository.
Availability of the central repository is a little more complicated than just 'if you can't connect, don't log it' because usually the most interesting events occur when there are problems, not when things go smooth. If your logging drops entries exactly when things turn interesting, it will never be trusted to solve incidents and as such will fail to gain traction and support for other stake holders (ie. the application owners). If you decide that you can implement retention and retry failed log info delivery on your own, you are facing an uphill battle: it is not a trivial task and is much more complex than it sounds, starting from eficient and reliable storage of the retained information and ending with putting in place good retry and inteligent fallback logic.
You also must have an answer to the problems of authentication and security. Large orgs have multiple domains with various trust relations, employees venture in via VPN or Direct Access from home, some applications run unattended, some services are configured to run as local users, some machines are not joined to the domain etc etc. You better have an asnwer to the question how is the logging module of each application, everywhere is deployed, going to authenticate with the central repository (and what situations are going to be unsuported).
Ideally you would use an out-of-the box delivery mechanism for your logging module. MSMQ is probably the most appropiate fit: robust asynchronous reliable delivery (at least to the extent of most use cases), available on every Windows host when is installed (optional). Which is the major pain point, your applications will take a dependency on a non-default OS component.
The central repository storage has to be able to deliver the information requested, perhaps:
the application developers investigating incidents
customer support team investigating a lost transaction reported by a customer complaint
the security org doing forensics
the business managers demanding statistics, trends and aggregated info (BI).
The only storage capable of delivering this for any serious org (size, lifetime) is a relational engine, so probably SQL Server. Doing analysis over text files is really not going to go the distance.
So I would recommend a messaging based log transport/delivery (MSMQ) and a relational central repository (SQL Server) perhaps with aanalitycal component on top of it (Analysis Services Data Mining). as you see, this is clearly no small feat and it covers slightly more than just configuring log4net.
As for what to log, you say you already give a thought but I'd like to chime in my extra 2c: often times, specially on incident investigation, you will like the ability to request extra information. This means you would like to know certain files content from the incident machine, or some registry keys, or some performance counter values, or a full process dump. It is very useful to be able to request this information from the central repository interface, but is impractical to always collect this information, just in case is needed. Which implies there has to be some sort of bidirectional communication between the applictaion and the central repository, when the application reports an incident it can be asked to add extra information (eg a dump of the process at fault). There has to be a lot of infrastructure in place for something like this to occur, from the protocol between application logging and the central repository, to the ability of the central repository to recognize an incident repeat, to the capacity of the loggin library to collect the extra information required and not least the ability of an operator to mark incidents as needing extra information on next occurence.
I understand this answer goes probably seems overkill at the moment, but I was involved with this problem space for quite a while, I had looked at many online crash reports from Dr. Watson back in the day when I was with MS, and I can tell you that these requirement exists, they are valid concerns and when implemented the solution helps tremendously. Ultimately, you can't fix what you cannot measure. A large organisation depends on good management and monitoring of its application stock, including logging and auditing.
There are some third party vendors that offer solutions, some even integrated with log4net, like bugcollect.com (Full disclosure: that's my own company), Error Traffic Controller or Exceptioneer and other.
The limit for the modern "Syslog Protocol" is implementation-dependent but MUST be at least 480 bytes, SHOULD be at least 2048 bytes, and MAY be even higher.
That the asker's organisation is "a Microsoft house" where "UNIX solutions are no good" should not prevent less discriminatory readers from getting accurate information.
SQL would work, but I've used Splunk to aggregate logs. I was able to find some surprising information based on the way Splunk allows you to set up indexes on your data, and then use their query tools to make some nice graphs. You can download a basic version of it for free too.
As the other responses have pointed out, the closest thing to an industry standard is syslog. But don't despair because you're living in a Windows world. Kiwi have a syslog daemaon which runs on Windows, and it is free. Find out more.
update As @MichaelFreidgeim points out, Kiwi now charge for their syslog daemon. However there are other free alternatives available. This other SO answer links to a couple of them.
As others already pointed out, directing logs from magnitude of apps and hosts directly to the database isn't a good idea. I just wanted to add one more advantage in favor of using dedicated centralized log server - it's decoupling of your apps from the log infrastructure. Since you're in .Net, there are couple of good choices - log4net and NLog. Both are very good products, but I particularly like the NLog, it proved to be much better performer with heavier loads, has much better configuration options and being actively maintained. Log4Net as far as I know hasn't been changed for a few years and have some issues, but still very robust solution as well. So, once you use such framework, you control on app level as to how, what and when it transmits its logs to centralized server. If at all.
Have a look at logFaces which was built specifically for a situations you describe - to aggregate logs from magnitude of apps and hosts providing centralized storage and source for analysis and monitoring. And doing all this not intrusively with zero changes in your existing code base. It will handle massive load of apps and hosts and let you specify what you want to do with the data. On the other hand, you've got very nice GUI for monitoring in real-time or digging into the data. You don't have to deal with databases directly at all. There are many databases to chose from - both SQL and NoSQL. BTW, RDBS are not the best performers with very large data stores. logFaces can work with MongoDB - this setup normally outperforms best traditional RDBS brands ten fold or so. Particularly when used with capped collections.
发布评论
评论(9)
一个需要谨慎的世界:一家大商店里有 100 多个应用程序,有数百甚至数千台主机运行这些应用程序,请避开任何会导致紧密耦合的东西。这几乎排除了直接连接到 SQL Server 或任何数据库解决方案的可能性,因为您的应用程序日志记录将取决于日志存储库的可用性。
中央存储库的可用性比“如果无法连接,请勿记录它”要复杂一些,因为通常最有趣的事件发生在出现问题时,而不是事情进展顺利时。如果您的日志记录在事情变得有趣时恰好删除条目,那么它永远不会被信任来解决事件,因此将无法获得其他利益相关者(即应用程序所有者)的关注和支持。
如果您决定自己实施保留并重试失败的日志信息传送,那么您将面临一场艰苦的战斗:这不是一项微不足道的任务,而且比听起来复杂得多,首先是保留信息的高效可靠存储并以良好的重试和智能的后备逻辑结束。
您还必须解决身份验证和安全问题。大型组织拥有多个具有不同信任关系的域,员工通过 VPN 或从家里直接访问进行冒险,某些应用程序在无人值守的情况下运行,某些服务配置为以本地用户身份运行,某些计算机未加入域等。您最好有回答以下问题:每个应用程序的日志记录模块是如何部署的,如何通过中央存储库进行身份验证(以及哪些情况将不受支持)。
理想情况下,您将为日志记录模块使用开箱即用的交付机制。 MSMQ 可能是最合适的选择:强大的异步可靠交付(至少在大多数用例的范围内),安装后可在每台 Windows 主机上使用(可选)。这是主要的痛点,您的应用程序将依赖于非默认操作系统组件。
中央存储库存储必须能够提供所请求的信息,也许:
能够为任何严肃的组织(规模、生命周期)提供此服务的唯一存储是关系引擎,因此可能是 SQL Server。对文本文件进行分析确实不会走得太远。
因此,我建议使用基于消息传递的日志传输/传递 (MSMQ) 和关系型中央存储库 (SQL Server),也许在其之上还带有分析组件(分析服务数据挖掘)。正如您所看到的,这显然是一个不小的壮举,它涵盖的内容不仅仅是配置 log4net。
至于记录什么,您说您已经考虑过了,但我想补充一下我的额外 2c:很多时候,特别是在事件调查方面,您会希望能够请求额外的信息。这意味着您想了解事件计算机中的某些文件内容、某些注册表项、某些性能计数器值或完整的进程转储。能够从中央存储库接口请求此信息非常有用,但总是收集此信息以备不时之需是不切实际的。这意味着应用程序和中央存储库之间必须存在某种双向通信,当应用程序报告事件时,可以要求它添加额外的信息(例如,错误进程的转储)。必须有很多基础设施才能发生这样的事情,从应用程序日志记录和中央存储库之间的协议,到中央存储库识别事件重复的能力,到登录库收集信息的能力所需的额外信息,尤其是操作员将事件标记为需要下次发生时的额外信息的能力。
我知道这个答案目前可能显得有些过分,但我参与这个问题空间已经有一段时间了,当我在 MS 工作的时候,我已经看过沃森博士的许多在线崩溃报告,我可以告诉您这些要求是存在的,它们是合理的担忧,并且在实施后解决方案会提供巨大帮助。最终,你无法解决无法衡量的问题。大型组织依赖于对其应用程序库存的良好管理和监控,包括日志记录和审计。
有一些第三方供应商提供解决方案,有些甚至与 log4net 集成,例如 bugcollect.com (完全披露:这是我自己的公司)、错误流量控制器 或 Exceptioneer 等。
One world of caution: at 100+ apps in a big shop, with hundreds perhaps thousands of hosts running those apps, steer clear of anything that induces a tight coupling. This pretty much rules out connect directly to SQL Server or any database solution, because your application logging will be dependent on the availability of the log repository.
Availability of the central repository is a little more complicated than just 'if you can't connect, don't log it' because usually the most interesting events occur when there are problems, not when things go smooth. If your logging drops entries exactly when things turn interesting, it will never be trusted to solve incidents and as such will fail to gain traction and support for other stake holders (ie. the application owners).
If you decide that you can implement retention and retry failed log info delivery on your own, you are facing an uphill battle: it is not a trivial task and is much more complex than it sounds, starting from eficient and reliable storage of the retained information and ending with putting in place good retry and inteligent fallback logic.
You also must have an answer to the problems of authentication and security. Large orgs have multiple domains with various trust relations, employees venture in via VPN or Direct Access from home, some applications run unattended, some services are configured to run as local users, some machines are not joined to the domain etc etc. You better have an asnwer to the question how is the logging module of each application, everywhere is deployed, going to authenticate with the central repository (and what situations are going to be unsuported).
Ideally you would use an out-of-the box delivery mechanism for your logging module. MSMQ is probably the most appropiate fit: robust asynchronous reliable delivery (at least to the extent of most use cases), available on every Windows host when is installed (optional). Which is the major pain point, your applications will take a dependency on a non-default OS component.
The central repository storage has to be able to deliver the information requested, perhaps:
The only storage capable of delivering this for any serious org (size, lifetime) is a relational engine, so probably SQL Server. Doing analysis over text files is really not going to go the distance.
So I would recommend a messaging based log transport/delivery (MSMQ) and a relational central repository (SQL Server) perhaps with aanalitycal component on top of it (Analysis Services Data Mining). as you see, this is clearly no small feat and it covers slightly more than just configuring log4net.
As for what to log, you say you already give a thought but I'd like to chime in my extra 2c: often times, specially on incident investigation, you will like the ability to request extra information. This means you would like to know certain files content from the incident machine, or some registry keys, or some performance counter values, or a full process dump. It is very useful to be able to request this information from the central repository interface, but is impractical to always collect this information, just in case is needed. Which implies there has to be some sort of bidirectional communication between the applictaion and the central repository, when the application reports an incident it can be asked to add extra information (eg a dump of the process at fault). There has to be a lot of infrastructure in place for something like this to occur, from the protocol between application logging and the central repository, to the ability of the central repository to recognize an incident repeat, to the capacity of the loggin library to collect the extra information required and not least the ability of an operator to mark incidents as needing extra information on next occurence.
I understand this answer goes probably seems overkill at the moment, but I was involved with this problem space for quite a while, I had looked at many online crash reports from Dr. Watson back in the day when I was with MS, and I can tell you that these requirement exists, they are valid concerns and when implemented the solution helps tremendously. Ultimately, you can't fix what you cannot measure. A large organisation depends on good management and monitoring of its application stock, including logging and auditing.
There are some third party vendors that offer solutions, some even integrated with log4net, like bugcollect.com (Full disclosure: that's my own company), Error Traffic Controller or Exceptioneer and other.
Logstash + Elasticsearch + Kibana + Redis 或 RabbitMQ + NLog 或 Log4net
存储 + 搜索和存储分析:Elasticsearch
收集&解析:Logstash
可视化:Kibana
队列和缓冲区:Redis
应用中:NLog
Logstash + Elasticsearch + Kibana + Redis or RabbitMQ + NLog or Log4net
Storage + Search & Analytics: Elasticsearch
Collecting & Parsing : Logstash
Visualize: Kibana
Queue&Buffer: Redis
In Application: NLog
到目前为止提到的 1024 字节 Syslog 消息长度限制具有误导性,并且错误地偏向基于 Syslog 的问题解决方案。
过时的“BSD Syslog 协议”的限制确实是 1024 字节。
BSD syslog 协议 - 4.1 syslog 消息部分
< em>现代“Syslog 协议”取决于实现,但必须至少为 480 字节,应该至少为 2048 字节,甚至可以更高。
BSD syslog 协议 - 6.1。消息长度
例如,Rsyslog 的配置设置称为
MaxMessageSize
,文档建议它至少可以设置为 64kb。rsyslog - 配置指令
询问者的组织是“ “UNIX 解决方案不好”的“微软之家”不应该阻止歧视性较小的读者获得准确的信息。
The 1024 byte Syslog message length limit mentioned so far is misleading and incorrectly biases against Syslog-based solutions to the problem.
The limit for the obsolete "BSD Syslog Protocol" is indeed 1024 bytes.
The BSD syslog Protocol - 4.1 syslog Message Parts
The limit for the modern "Syslog Protocol" is implementation-dependent but MUST be at least 480 bytes, SHOULD be at least 2048 bytes, and MAY be even higher.
The BSD syslog Protocol - 6.1. Message Length
As an example, Rsyslog's configuration setting is called
MaxMessageSize
, which the documentation suggests can be set at least as high as 64kb.rsyslog - Configuration Directives
That the asker's organisation is "a Microsoft house" where "UNIX solutions are no good" should not prevent less discriminatory readers from getting accurate information.
SQL 可以工作,但我使用 Splunk 来聚合日志。基于 Splunk 允许您在数据上设置索引的方式,我能够找到一些令人惊讶的信息,然后使用他们的查询工具制作一些漂亮的图表。您也可以免费下载它的基本版本。
SQL would work, but I've used Splunk to aggregate logs. I was able to find some surprising information based on the way Splunk allows you to set up indexes on your data, and then use their query tools to make some nice graphs. You can download a basic version of it for free too.
正如其他回复所指出的,最接近行业标准的是 syslog。但不要绝望,因为您生活在 Windows 世界中。
Kiwi 有一个在 Windows 上运行的 syslog 守护进程,并且它是免费的。 了解更多。更新
正如@MichaelFreidgeim 指出的那样,Kiwi 现在对其系统日志守护进程收费。不过,还有其他免费替代方案可用。这个其他答案链接到其中的几个。
As the other responses have pointed out, the closest thing to an industry standard is syslog. But don't despair because you're living in a Windows world.
Kiwi have a syslog daemaon which runs on Windows, and it is free. Find out more.update
As @MichaelFreidgeim points out, Kiwi now charge for their syslog daemon. However there are other free alternatives available. This other SO answer links to a couple of them.
在 Unix 上,有 syslog。
另外,您可能还想查看此案例研究。
On Unix, there's syslog.
Also, you might want to check out this case study.
如果您有 log4net 日志到本地 EventViewer,您可以在 Windows 2008 机器上挖掘这些日志,请参阅此 集中审计文章。
在该框中,您可以轻松导入这些事件并在其之上提供一些管理和挖掘工具。
If you have log4net log to the local EventViewer, you can mine these logs on a Windows 2008 box, see this centralized auditing article.
On that box, you can then easily import these events and provide some management and mining tools on top of it.
正如其他人已经指出的那样,将应用程序和主机级别的日志直接定向到数据库并不是一个好主意。我只是想添加一个有利于使用专用集中式日志服务器的优势 - 它将您的应用程序与日志基础设施解耦。由于您使用的是 .Net,因此有几个不错的选择 - log4net 和 NLog。两者都是非常好的产品,但我特别喜欢 NLog,事实证明它的性能更好,负载更重,具有更好的配置选项并且得到积极维护。据我所知,Log4Net 已经有几年没有改变了,并且存在一些问题,但仍然是非常强大的解决方案。因此,一旦您使用这样的框架,您就可以在应用程序级别控制如何、什么以及何时将其日志传输到集中式服务器。如果有的话。
看一下logFaces,它是专门针对您描述的情况构建的 - 从提供的应用程序和主机的大小聚合日志用于分析和监控的集中存储和来源。并且在现有代码库中进行零更改的情况下完成所有这一切。它将处理大量应用程序和主机负载,并让您指定要如何处理数据。另一方面,你有非常好的GUI< /a> 用于实时监控或挖掘数据。您根本不必直接处理数据库。有许多数据库可供选择 - SQL 和 NoSQL 都有。顺便说一句,对于非常大的数据存储,RDBS 并不是表现最好的。 logFaces 可以与 MongoDB 配合使用 - 这种设置通常比最好的传统 RDBS 品牌性能好十倍左右。特别是与上限集合一起使用时。
(为了披露,我是 logFaces 的作者)
As others already pointed out, directing logs from magnitude of apps and hosts directly to the database isn't a good idea. I just wanted to add one more advantage in favor of using dedicated centralized log server - it's decoupling of your apps from the log infrastructure. Since you're in .Net, there are couple of good choices - log4net and NLog. Both are very good products, but I particularly like the NLog, it proved to be much better performer with heavier loads, has much better configuration options and being actively maintained. Log4Net as far as I know hasn't been changed for a few years and have some issues, but still very robust solution as well. So, once you use such framework, you control on app level as to how, what and when it transmits its logs to centralized server. If at all.
Have a look at logFaces which was built specifically for a situations you describe - to aggregate logs from magnitude of apps and hosts providing centralized storage and source for analysis and monitoring. And doing all this not intrusively with zero changes in your existing code base. It will handle massive load of apps and hosts and let you specify what you want to do with the data. On the other hand, you've got very nice GUI for monitoring in real-time or digging into the data. You don't have to deal with databases directly at all. There are many databases to chose from - both SQL and NoSQL. BTW, RDBS are not the best performers with very large data stores. logFaces can work with MongoDB - this setup normally outperforms best traditional RDBS brands ten fold or so. Particularly when used with capped collections.
(for the disclosure, I am the author of logFaces)
如果您在 *nix 机器上运行,传统的解决方案是 syslog。
If your running on *nix machines, the traditional solution is syslog.