如何在分布式系统中分析日志?

发布于 2025-02-06 15:58:42 字数 230 浏览 0 评论 0原文

当分布式系统(如筏节点)中发生意外行为时,请求的逻辑趋势或数据流的逻辑趋势通常只能通过日志分析。但是,由于分布式系统,这很困难。我发现有一些工具,例如 shiviz 可以通过日志可视化请求或数据流,但是需要修改源代码。还有其他类似的入侵工具吗?

When an unexpected behavior occurs in a distributed system(like raft nodes), the logical trend of the request or data flow usually can only be analyzed by logs. However, due to the distributed systems, this is difficult. I found that there are tools like shiviz that can visualize requests or data flow through logs, but require modification of the source code. Are there any other similar invasive tools?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

眼眸 2025-02-13 15:58:42

有两种主要方法。一个是拥有一个可以转到每个服务器并搜索日志的工具。另一个选项是要有一个用于日志的中心位置,并且所有节点都将日志推向该存储 - 这就是AWS CloudWatch的工作方式。

无论哪种情况,从操作员的角度来看,都有一个工具可以在其中搜索所有日志。

您问题的第二部分 - 如何使此分析有效。

首先,日志应该具有良好的质量。这是一件天真的事情,但这非常重要。我无法计算我分析了多少次详细的次数,但毫无用处的日志。

第二个挑战 - 如何分析跨越几个节点的过程。这更复杂。这里有两个主要功能:

  1. 如何查找与相同的“事件”相关的所有日志 - 例如,可以说,API调用是5个服务 - 我们如何在这些服务中跟踪此呼叫。这里的典型解决方案是在第一个服务上生成唯一的请求ID,然后通过所有服务传播此ID。
  2. 如何重新组装跨节点的呼叫顺序。从“理论”的角度来看,这个问题是关于总顺序 - 我们需要能够进行任何两个日志事件,然后说出哪一个发生了。在这里,我们不能使用时间戳,因为它们还不够准确。幸运的是,我们有一种众所周知且简单的算法可以处理:Lamport Timestamp。当然,开发人员必须将其添加到代码中才能使其正常工作。它可以是服务代码,也可以是日志代理代码(日志代理是汇总所有日志的工具)。值得一提的是,如果您的分布式系统具有诸如呼叫结构之类的树,EG A始终收到用户的请求,然后呼叫服务B和C-在这种情况下,携带请求ID就足够了,则总订单可能是过度杀伤已经知道订单了。在像筏这样的情况下,需要总订单,在这种情况下,并不总是清楚谁打电话给谁。

There are two major approaches. The one is to have a tool which can go to every server and search their logs. The other option is to have a central location for logs and all nodes push their logs to that storage - this is the way how AWS CloudWatch works.

In either case, from an operator point of view, there is a tool where they can search all logs.

The second part of your question - how to make this analysis effective.

First of all, logs should be of a good quality. This is a naive thing to say, but it is very important. I can't count how many times I analyzed detailed, but useless logs.

The second challenge - how to analyze processes which span across several nodes. This is more complicated. There are two main features here:

  1. how to find all logs related to the same "event" - e.g. let's say an api call is resulted in 5 services being called - how can we trace this call across these services. Typical solution here is to generate unique request id on the first service and then propagate this id through all services.
  2. how to reassemble the order of calls across nodes. From "theoretical" point of view - this problem is about Total Order - we need to be able to take any two log events and say which one happened first. Here we can't use timestamps as they are not accurate enough. Luckily for us there is a well known and simple algorithm to handle this: Lamport timestamp. Of course, the developer has to add it to the code to make it working. It could be either service code, or the log agent code (log agent is that tool which aggregates all logs). Worth mentioning, that total order may be an overkill if your distributed system has a tree like call structure, e.g. service A always receives requests form users and then calls service B and C - in this case carrying over the request id is enough - as you know the order already. Total order is needed in cases like Raft, where it is not always clear who calls whom.
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文