管理分布在多台计算机上的大量日志文件
我们已经开始使用第三方平台(GigaSpaces)来帮助我们进行分布式计算。我们现在试图解决的主要问题之一是如何在这个分布式环境中管理我们的日志文件。目前我们有以下设置。
我们的平台分布在 8 台机器上。每台机器上都有 12-15 个进程,它们使用 java.util.logging 记录到单独的日志文件中。在此平台之上,我们有自己的应用程序,它们使用 log4j 并将日志记录到单独的文件中。我们还将标准输出重定向到一个单独的文件以捕获线程转储和类似内容。
这会产生大约 200 个不同的日志文件。
截至目前,我们还没有工具来帮助管理这些文件。在以下情况下,这会引起我们严重的头痛。
当我们事先不知道问题发生在哪个进程时进行故障排除。在这种情况下,我们当前使用 ssh 登录每台计算机并开始使用
grep
。尝试积极主动地定期检查日志是否有任何异常情况。在这种情况下,我们当前还使用
less
和tail
登录到所有计算机并查看不同的日志。设置警报。我们希望针对超过阈值的事件设置警报。要检查 200 个日志文件,这看起来很痛苦。
目前,我们每秒只有大约 5 个日志事件,但随着我们将越来越多的代码迁移到新平台,这一数字将会增加。
我想向社区询问以下问题。
- 您如何处理类似的情况,许多日志文件分布在通过不同框架记录的多台计算机上?
- 您为什么选择该特定解决方案?
- 你们的解决方案效果如何?您发现什么是好的,什么是坏的?
非常感谢。
更新
我们最终评估了 Splunk 的试用版。我们对其工作原理非常满意并决定购买它。易于设置、快速搜索以及大量适合技术爱好者的功能。我可以推荐任何有类似情况的人都去看看。
We have started using a third party platform (GigaSpaces) that helps us with distributed computing. One of the major problems we are trying to solve now is how to manage our log files in this distributed environment. We have the following setup currently.
Our platform is distributed over 8 machines. On each machine we have 12-15 processes that log to separate log files using java.util.logging. On top of this platform we have our own applications that use log4j and log to separate files. We also redirect stdout to a separate file to catch thread dumps and similar.
This results in about 200 different log files.
As of now we have no tooling to assist in managing these files. In the following cases this causes us serious headaches.
Troubleshooting when we do not beforehand know in which process the problem occurred. In this case we currently log into each machine using ssh and start using
grep
.Trying to be proactive by regularly checking the logs for anything out of the ordinary. In this case we also currently log in to all machines and look at different logs using
less
andtail
.Setting up alerts. We are looking to setup alerts on events over a threshold. This is looking to be a pain with 200 log files to check.
Today we have only about 5 log events per second, but that will increase as we migrate more and more code to the new platform.
I would like to ask the community the following questions.
- How have you handled similar cases with many log files distributed over several machines logged through different frameworks?
- Why did you choose that particular solution?
- How did your solutions work out? What did you find good and what did you find bad?
Many thanks.
Update
We ended up evaluating a trial version of Splunk. We are very happy with how it works and have decided to purchase it. Easy to set up, fast searches and a ton of features for the technically inclined. I can recommend anyone in similar situations to check it out.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
我建议将所有 java 日志记录通过管道传输到 Simple Logging Facade for Java (SLF4J),然后重定向所有日志从 SLF4J 到 LogBack。 SLF4J 特别支持处理所有流行的遗留 API(log4j、commons-logging、java.util.logging 等),请参阅 这里。
将日志放入 LogBack 后,您可以使用它的众多附加程序之一来聚合多台计算机上的日志,有关详细信息,请参阅手册 有关附加程序的部分。 Socket、JMS 和 SMTP 似乎是最明显的候选者。
LogBack 还内置支持监控日志文件中的特殊条件以及过滤事件发送到特定的附加程序。因此,您可以设置 SMTP 附加程序,以便在日志中每次出现错误级别事件时向您发送电子邮件。
最后,为了简化故障排除,请务必在所有传入的“请求”中添加某种requestID,请参阅我对此问题了解详细信息。
编辑:您还可以实现自己的自定义 LogBack 附加程序并将所有日志重定向到 抄写员。
I would recommend to pipe all your java logging to Simple Logging Facade for Java (SLF4J) and then redirect all logs from SLF4J to LogBack. SLF4J has special support for handling all popular legacy APIs (log4j, commons-logging, java.util.logging, etc), see here.
Once you have your logs in LogBack you can use one of it's many appenders to aggregate logs over several machines, for details, see the manual section about appenders. Socket, JMS and SMTP seem to be the most obvious candidates.
LogBack also has built-in support for monitoring for special conditions in log files and filtering events sent to particular appender. So you could set up SMTP appender to send you an e-mail every time there is an ERROR level event in logs.
Finally, to ease troubleshooting, be sure to add some sort of requestID to all your incoming "requests", see my answer to this question for details.
EDIT: you could also implement your own custom LogBack appender and redirect all logs to Scribe.
一个值得探索的有趣选项是在这些上运行 Hadoop 集群节点并编写一个自定义 MapReduce 作业来搜索和聚合特定于您的应用程序。
An interesting option to explore would be to run Hadoop Cluster on those nodes and write a custom Map Reduce job for searching and aggregating results specific for your applications.
我建议看一下日志聚合工具,例如 Splunk 或 抄写员。
(另外,我认为这更像是一个 ServerFault 问题,因为它与应用程序及其数据的管理有关,而不是与创建应用程序有关。)
I'd suggest taking a look at a log aggregation tool like Splunk or Scribe.
(Also, I think this is more of a ServerFault question, as it has to do with administration of your app and it's data, not so much about creating the app.)
我可以给您的唯一建议是确保通过代码传递事务 ID,并确保在记录时记录它,以便稍后可以将不同的调用关联在一起。
The only piece of advice I can give you is to make sure you pass a transaction ID through your code and to make sure you log it when you do log, so that you can later correlate the different calls together.
我会将文件传输到中央机器以在其上运行分析器机制。也许您可以使用 Hadoop 集群来执行此操作并运行 map/reduce 作业来进行分析...将其复制到 haddop 集群等 5 分钟内。我不确定这是否适合您的需求。在这种关系中,如前所述,看看 Scribe 可能是个好主意。
I would transfer the file to a centralized machine to run a analyzer mechanism on it. May be you can use a Hadoop Cluster to do that and run map/reduce jobs to do the analyzing...Copy it very 5 minutes to the haddop cluster etc. I'm not sure if this fits your needs. In that relationship it might be a good idea to look at Scribe as already mentioned.