应用程序健康监控系统有哪些要求?

发布于 2024-07-05 13:28:15 字数 241 浏览 8 评论 0原文

应用程序运行状况监控系统至少应该为您(开发人员)和/或您的老板(IT 经理)和/或运营(待命)人员做什么?

除了最低要求之外,它还应该做什么?

监视“基础设施”应用程序(ms-exchange、apache 等)是否足够,还是还需要监视单个用户应用程序、网站和数据库?

如果是后者,您需要了解哪些信息?

附录:感谢您的意见,我真的在寻找应用程序级监控而不是基础设施监控,但了解两者是很好的

What, at a minimum, should an application health-monitoring system do for you (the developer) and/or your boss (the IT Manager) and/or the operations (on-call) staff?

What else should it do above the minimum requirements?

Is monitoring the 'infrastructure' applications (ms-exchange, apache, etc.) sufficient or do individual user applications, web sites, and databases also need to be monitored?

if the latter, what do you need to know about them?

ADDENDUM: thanks for the input, i was really looking for application-level monitoring not infrastructure monitoring, but it is good to know about both

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(9

情话墙 2024-07-12 13:28:15

最低限度:确保它正在运行:)

但是,其他一些东西会非常有用。 例如,CPU 负载、RAM 使用情况以及(在多用户系统中)哪个用户正在运行什么。 此外,对于访问网络的应用程序,每个应用程序的网络连接列表。 而且(如果您有权访问客户端计算机)能够看到应用程序的“窗口标题”会很酷 - 也许每 2-3 分钟检查一次是否发生更改并保存。 此外,应用程序打开的文件列表可能非常有用,但这不是必须的。

Minimum: make sure it is running :)

However, some other stuff would be very useful. For example, the CPU load, RAM usage and (in multiuser systems) which user is running what. Also, for applications that access network, a list of network connections for each app. And (if you have access to client computer(s)) it would be cool to be able to see the 'window title' of the app - maybe check each 2-3 minutes if it changed and save it. Also, a list of files open by the application could be very useful, but it is not a must.

〆一缕阳光ご 2024-07-12 13:28:15

我认为这相当简单 - 进行监控,以便在出现问题之前尽早收到警告。 这意味着监视依赖关系和应用程序本身。

如果您不打算提供有关您正在监视的应用程序的详细信息,那么很难提供具体信息,因此我建议将其用作一般规则。

I think this is fairly simple - monitor so that you can be warned early enough before something goes wrong. That means monitor dependencies and the application itself.

It's really hard to provide specifics if you're not going to give details on the application you're monitoring, so I'd say use that as a general rule.

强辩 2024-07-12 13:28:15

这是一个开放式问题,但我会从物理测量开始。
1. 我认为托管此站点的所有计算机都可以 ping 通吗?
2. 所有应该提供内容的机器实际上都在提供某些内容吗? (理想情况下,这会受到外部网络的影响。)
3. 每台机器上的每个预期服务是否都在运行?
3a. 这些服务最近运行过吗?
4、每台机器还有剩余硬盘空间吗? (不要忘记数据库)
5. 这些机器是否已备份? 上次是什么时候?

一旦布置了系统的物理监控,就可以解决特定于系统的监控问题吗?

1.自动化脚本可以登录吗? 花了多长时间?
2. 有多少活跃用户? 是否新增了一百万个虚假帐户?
...
这类问题变得更加模糊,并且可能非常特定于系统。 它们通常也可以在响应物理测量时反应性地导出。 硬盘已满,也许网络服务器日志已满,因为一群代理创建了太多假用户。 那种事。

虽然 A 计划不一定是被动的,但这是许多站点设置监控系统的方式。

This is such an open ended question, but I would start with physical measurements.
1. Are all the machines I think are hosting this site pingable?
2. Are all the machines which should be serving content actually serving some content? (Ideally this would be hit from an external network.)
3. Is each expected service on each machine running?
3a. Have those services run recently?
4. Does each machine have hard drive space left? (Don't forget the db)
5. Have these machines been backed up? When was the last time?

Once one lays out the physical monitoring of the systems, one can address those specific to a system?

1. Can an automated script log in? How long did it take?
2. How many users are live? Have there been a million fake accounts added?
...
These sorts of questions get more nebulous, and can be very system specific. They also usually can be derived reactively when responding to phsyical measurements. Hard drive fill up, maybe the web server logs got filled up because a bunch of agents created too many fake users. That kind of thing.

While plan A shouldn't necessarily be reactive, it is the way many a site setup a monitoring system.

╄→承喏 2024-07-12 13:28:15

很好的问题。

前一段时间,我们一直在寻找一些应用程序级监控解决方案来满足我们的需求,但没有任何运气。 流行的监控解决方案主要用于监控基础设施,在我看来,它们对于大多数中小型公司的要求来说太复杂了。

我们需要(主要)以下功能:

  • 警报 - 我们想了解
    尽可能快地处理事件
  • 无痛管理 - 托管服务将是
    最好的
  • 可视化 - 很高兴知道正在发生什么并从数据中获取一些知识

因为我们没有找到合适的解决方案,所以我们开始编写自己的解决方案。 最后,我们结束了名为 AlertGrid 的启动并运行的服务。 (当然,您可以免费检查它。)

其背后的想法是提供一种简单的方法来处理自定义监控场景。 集成 API 非常简单(一个函数带有两个必需参数)。 目前,我们和其他人将其用于:

  • 监视计划任务(cron 作业)
  • 监视整个应用程序逻辑执行
  • 针对应用程序中的错误发出警报
  • 我们还在研究使用 AlertGrid 进行基本基础设施监视的示例

Great question.

We've been looking for some application-level monitoring solution for our needs some time ago without any luck. Popular monitoring solution are mostly addressed to monitor infrastrcture and - in my opinion - they are too complicated for a requirements of most of small and mid-sized companies.

We required (mainly) following features:

  • alerts - we wanted to know about
    incident as fast as possible
  • painless management - hosted service wouldbe
    the best
  • visualizations - it's good to know what is going on and take some knowledge from the data

Because we didn't find suitable solution we started to write our own. Finally we've ended with up-and-running service called AlertGrid. (You can check it for free of course.)

The idea behind it is to provide an easy way to handle custom monitoring scenarios. Integration API is very simple (one function with two required parameters). At the momment we and others are using it for:

  • monitor scheduled tasks (cron jobs)
  • monitor entire application logic execution
  • alert on errors in applications
  • we are also working on examples of basic infrastructure monitoring using AlertGrid
邮友 2024-07-12 13:28:15
  • 应用程序是否正在运行。
  • CPU/内存/网络使用异常。
  • 报告任何未处理的异常。
  • 各种模块的状态(如果适用)。
  • 外部组件(数据库、Web 服务、文件服务器等)的状态
  • 待处理后台任务的数量(如果适用)。
  • 也许可以跟踪应用程序的使用情况并报告最常用/较少使用的功能的统计信息,以便您知道哪些优化最有利。
  • Whether the application is running.
  • Unusual cpu/memory/network usage.
  • Report any unhandled exceptions.
  • Status of various modules (if applicable).
  • Status of external components (databases, webservices, fileservers, etc.)
  • Number of pending background tasks (if applicable).
  • Maybe track usage of the application and report statistics on most/less used functionalities so you know where optimizations are most beneficial.
萌吟 2024-07-12 13:28:15

答案是“视情况而定”。 为什么需要监控? 您的运营人员有多少? 需要报告吗? 应用环境是什么? 谁在乎申请是否失败? 谁关心是否发生异常? 任何错误都可以恢复吗? 我可以问这样的问题很长时间。

The answer is 'it depends'. Why do you need to monitor? How large is your operations staff? Do you need reporting? What is the application environment? Who cares if the application fails? Who cares if an exception happens? Are any of the errors recoverable? I could ask questions like these for a long time.

命比纸薄 2024-07-12 13:28:15

至少您想知道系统是否健康。 这对于定义您的系统是否健康是主观的。 计算机是否已启动、所需的资源是否存在、数据是否正在系统中流动、数据是否正确地产生结果等等。

在我的项目中,我们对其中的大部分进行监控,然后再进行一些监控。 这实际上取决于您可以用来分析一切正常的最高级别。 在我们的例子中,我们需要了解数据输出。 如果您只需要了解这些机器是否正常运行,那么您就无需向缺乏经验的最终用户展示问题所在。

如果您只是过度关注数据结果,还有一些“现成的”工具可以为您完成大量艰苦的工作。 当我环顾四周时,我特别喜欢 Nagios 但我们需要的东西比它可以轻松显示的更多,所以我编写了自己的监视系统。 基本上我们还会观察系统中的“特殊性”、内存/CPU 峰值等......

At a minimum you want to know that the system is healthy. This is subjective in what defines your system is healthy. Is it computers are up, the needed resources exist, the data is flowing through the system, the data is properly producing results, etc, etc.

In my project we do monitoring of most of this and then some. It really comes down to what is the highest level that you can use to analyze that everything is working. In our case we need to know down to the data output. If you just need to know down to the are these machines up it saves you on trying to show an inexperienced end user what is wrong.

There are also "off the shelf" tools that will do a lot of the hard work for you if you are just looking too hard into data results. I particularly liked Nagios when I was looking around but we needed more than it could easily show so I wrote our own monitoring system. Basically we also watch for "peculiarities" in the system, memory / cpu spikes, etc...

奢欲 2024-07-12 13:28:15

您需要做的是分解应用程序的业务流程,然后让软件在主要业务组件上发出事件。 此外,您还需要创建端到端综合交易(例如,模拟最终用户点击网站)。 所有这些数据都将被输入到监控工具中。 过去,我为流入 Tivoli Monitoring 的 JMX 适配器的应用程序完成了 JMX,然后我完成了实现“假用户”的脚本,然后将结果通过管道输入到 Tivoli Monitoring 的脚本适配器。 Tivoli Monitoring 获取数据,然后根据该原始数据创建应用程序运行状况和性能图表。

What you need to do is to break down the business process of the application and then have the software emit events at major business components. In addition, you'll need to create end to end synthetic transactions (eg. emulating end users clicking on a website). All that data would be fed into an monitoring tool. In the past, I've done JMX for applications of which flowed into Tivoli Monitoring's JMX Adapter and then I've done scripts that implement a "fake user" and then pipe in the results into Tivoli Monitoring's Script Adapter. Tivoli Monitoring takes the data and then creates application health and performance charts from that raw data.

给不了的爱 2024-07-12 13:28:15

感谢大家的意见,我真的在寻找应用程序级监控而不是基础设施监控,但很高兴了解两者的

区别是:

  • 基础设施监控将是服务器加上 MS Exchange Server、Apache、IIS 等
  • 应用程序监控将是用户机器和他们用来完成工作的特定程序,和/或服务器加上他们运行以保持数据流动的数据移动/后端应用程序,

有时很难划清界限 - 过于简单的定义可能是“如果你的团队编写了它,它就是一个应用程序;如果你购买了它,它就是基础设施”

我认为在实践中最好同时监控这两者

thanks everyone for the input, i was really looking for application-level monitoring not infrastructure monitoring, but it is good to know about both

the difference is:

  • infrastructure monitoring would be servers plus MS Exchange Server, Apache, IIS, and so forth
  • application monitoring would be user machines and the specific programs that they use to do their jobs, and/or servers plus the data-moving/backend applications that they run to keep the data flowing

sometimes it's hard to draw the line - an oversimplified definition might be "if your team wrote it, it's an application; if you bought it, it's infrastructure"

i think in practice it is best to monitor both

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文