如何使用 Nagios 监控日志文件

发布于 2024-08-24 11:15:58 字数 1372 浏览 14 评论 0原文

我们使用 Nagios 来监控我们的网络并取得了巨大成功。然而,我们有一个用于关键应用程序错误的系统日志,当我设置 check_log 时,它似乎不能很好地监控设备。

问题是:

  • 它只显示最后一个条目
  • 似乎没有办法确认严重错误并且 将监视器恢复到良好状态

nagios 是否是错误的工具,或者我们只是没有正确设置服务监视?

这是我的参赛作品

# log file
define command{
        command_name    check_log
        command_line    $USER1$/check_log -F /var/log/applications/appcrit.log -O /tmp/appcrit.log -q ?
}


# Define the log monitering service
define service{
        name                            logfile-check           ;
        use                             generic-service         ;
        check_period                    24x7                    ;
        max_check_attempts              1                       ;
        normal_check_interval           5                       ;
        retry_check_interval            1                       ;
        contact_groups                  admins                  ;
        notification_options            w,u,c,r                 ;
        notification_period             24x7                    ;
        register                        0                       ;
        }

define service{
        use                             logfile-check
        host_name                       localhost
        service_description             CritLogFile
        check_command                   check_log
}

We are using Nagios to monitor our network with great success. However, we have a syslog for critical application errors and while I set up check_log, it doesn't seem to work as well as monitering a device.

The issues are:

  • It only shows the last entry
  • There doesn't seem to be a way to acknowledge the critical error and
    return the monitor to a good state

Is nagios the wrong tool, or are we just not setting up the service monitering right?

Here are my entries

# log file
define command{
        command_name    check_log
        command_line    $USER1$/check_log -F /var/log/applications/appcrit.log -O /tmp/appcrit.log -q ?
}


# Define the log monitering service
define service{
        name                            logfile-check           ;
        use                             generic-service         ;
        check_period                    24x7                    ;
        max_check_attempts              1                       ;
        normal_check_interval           5                       ;
        retry_check_interval            1                       ;
        contact_groups                  admins                  ;
        notification_options            w,u,c,r                 ;
        notification_period             24x7                    ;
        register                        0                       ;
        }

define service{
        use                             logfile-check
        host_name                       localhost
        service_description             CritLogFile
        check_command                   check_log
}

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

吃不饱 2024-08-31 11:15:58

对于使用 Nagios 监视日志,日志检查器通常只会在每次调用时针对新发现的错误消息返回警告(因此它必须保留某些状态,以便知道在后续运行中忽略它们)。因此我通常设置:

max_check_attempts              1
is_volatile                     1

这会导致Nagios立即发出警报,但仅一次,然后恢复正常。

我最喜欢的日志检查器是 logwarn,但我有偏见,因为我在没有找到任何现有日志后自己编写了它我喜欢的。 logwarn 包包含一个 Nagios 插件。

For monitoring logs with Nagios, typically the log checker will return a warning only for newly discovered error messages each time it is invoked (so it must retain some state in order to know to ignore them on subsequent runs). Therefore I usually set:

max_check_attempts              1
is_volatile                     1

This causes Nagios to send out the alert immeidately, but only once, and then go back to normal.

My favorite log checker is logwarn, but I'm biased because I wrote it myself after not finding any existing ones that I liked. The logwarn package includes a Nagios plugin.

爱的那么颓废 2024-08-31 11:15:58

你的配置中没有任何内容让我觉得配置错误。

根据设计,check_log 将仅显示 OK 消息或触发警报的最后一个日志条目。如果您需要查看多个条目,则需要修改插件。

然而,我发现你没有得到康复的事实有点奇怪。 check_log 的工作方式(通过将当前日志与以前的版本进行比较),您应该在下一次服务检查时得到恢复。当然,自上次检查以来日志中添加了其他匹配条目的情况除外。

强制执行另一项(或多项)服务检查是否会导致其恢复?

另外,我并不是有意这样做,但要确保它确实发生了故障。
您的日志是否在检查之间获取了额外的匹配条目,导致其无法恢复?您的支票匹配“?”它将匹配日志中的任何新内容。是否有其他内容(非错误)被添加到日志中并无意中导致匹配?

如果以上都不是问题,我建议通过将 Nagios 排除在外来缩小范围。尝试手动运行 check_log (从命令行,但使用与 nagios 相同的用户),并使用不同的 oldlog。它应该是这样的 -

  1. 使用新的“oldlog”运行检查 - 获取初始化消息
  2. 运行检查 - 检查确定
  3. 更改日志
  4. 运行检查 - 检查失败
  5. 运行检查 - 检查确定

如果这不起作用,那么你知道要关注关于日志、oldlog 以及 check_log 如何进行检查。

如果它有效,那么它更多地表明您的 nagios 配置存在问题。

Nothing in your config jumps out at me as being misconfigured.

By design, check_log will only show either an OK message, or the last log entry that triggered an alert. If you need to see multiple entries, you'll need to modify the plugin.

However, I find the fact that you're not getting recoveries somewhat odd. The way check_log works (by comparing the current log to the previous version), you should get a recovery on the very next service check. Except of course, when there have been additional matching entries added to the log since the last check.

Does forcing another service check (or several) cause it to recover?

Also, I don't intend this in a mean way, but make sure it's really malfunctioning.
Is your log getting additional matching entries in between checks, causing it not to recover? Your check is matching "?" which will match anything new in the log. Is something else (a non-error) being added to the log and inadvertently causing a match?

If none of the above are the issue, I would suggest narrowing it down by taking Nagios out of the equation. Try running check_log manually (from the command line, but as the same user as nagios), and with a different oldlog. It should go something like this -

  1. run check with a new "oldlog" - get initialization message
  2. run check - check OK
  3. make change to log
  4. run check - check fails
  5. run check - check OK

If this doesn't work, then you know to focus on the log, the oldlog, and how the check_log is doing the check.

If it works, then it points more towards a problem with your nagios configuration.

怂人 2024-08-31 11:15:58

有一个 Nagios 插件可用于检查日志文件:它名为 check_logfiles 和它用于扫描文件行中的正则表达式。

以下链接显示了如何为 Nagios 和 Opsview 安装和配置 check_logfiles
https://www.opsview.com/resources/ nagios-alternative/博客/syslog-monitoring-nagios-opsview

There is a Nagios plugin that you can use to check the log files: it's called check_logfiles and it's used to scan the lines of a file for regular expressions.

The following link shows how to install and configure check_logfiles for Nagios and Opsview:
https://www.opsview.com/resources/nagios-alternative/blog/syslog-monitoring-nagios-opsview

澜川若宁 2024-08-31 11:15:58

由于实现目标的方法有很多,Consol 还提供了一个不错的插件:
https://labs.consol.de/lang/en/nagios/check_logfiles/

  • 支持正则表达式
  • 支持日志轮转

要使用它,你需要一个cfg文件,这是oracle数据库的示例

@searches = ({
  tag => 'oraalerts',
options => 'sticky=28800',
  logfile => '/u01/app/oracle/diag/rdbms/davmdkp/DAVMDKP1/trace/alert_DAVMDKP1.log',
  criticalpatterns => [
      'ORA\-0*204[^\d]',        # error in reading control file
      'ORA\-0*206[^\d]',        # error in writing control file
      'ORA\-0*210[^\d]',        # cannot open control file
      'ORA\-0*257[^\d]',        # archiver is stuck
      'ORA\-0*333[^\d]',        # redo log read error
      'ORA\-0*345[^\d]',        # redo log write error
      'ORA\-0*4[4-7][0-9][^\d]',# ORA-0440 - ORA-0485 background process failure
      'ORA\-0*48[0-5][^\d]',
      'ORA\-0*6[0-3][0-9][^\d]',# ORA-6000 - ORA-0639 internal errors
      'ORA\-0*1114[^\d]',        # datafile I/O write error
      'ORA\-0*1115[^\d]',        # datafile I/O read error
      'ORA\-0*1116[^\d]',        # cannot open datafile
      'ORA\-0*1118[^\d]',        # cannot add a data file
      'ORA\-0*1122[^\d]',       # database file 16 failed verification check
      'ORA\-0*1171[^\d]',       # datafile 16 going offline due to error advancing checkpoint
      'ORA\-0*1201[^\d]',       # file 16 header failed to write correctly
      'ORA\-0*1208[^\d]',       # data file is an old version - not accessing current version
      'ORA\-0*1578[^\d]',        # data block corruption
      'ORA\-0*1135[^\d]',        # file accessed for query is offline
      'ORA\-0*1547[^\d]',        # tablespace is full
      'ORA\-0*1555[^\d]',        # snapshot too old
      'ORA\-0*1562[^\d]',        # failed to extend rollback segment
      'ORA\-0*162[89][^\d]',     # ORA-1628 - ORA-1632 maximum extents exceeded
      'ORA\-0*163[0-2][^\d]',
      'ORA\-0*165[0-6][^\d]',    # ORA-1650 - ORA-1656 tablespace is full
      'ORA\-16014[^\d]',      # log cannot be archived, no available destinations
      'ORA\-16038[^\d]',      # log cannot be archived
      'ORA\-19502[^\d]',      # write error on datafile
      'ORA\-27063[^\d]',         # number of bytes read/written is incorrect
      'ORA\-0*4031[^\d]',        # out of shared memory.
      'No space left on device',
      'Archival Error',
  ],
  warningpatterns => [
      'ORA\-0*3113[^\d]',        # end of file on communication channel
      'ORA\-0*6501[^\d]',         # PL/SQL internal error
      'ORA\-0*1140[^\d]',         # follows WARNING: datafile #20 was not in online backup mode
      'Archival stopped, error occurred. Will continue retrying',
  ]
});

As there are many ways to achieve a goal, there is also a nice plugin from Consol available:
https://labs.consol.de/lang/en/nagios/check_logfiles/

  • supports regex
  • supports log rotation

To use it, you need a cfg file, this is an example for oracle databases

@searches = ({
  tag => 'oraalerts',
options => 'sticky=28800',
  logfile => '/u01/app/oracle/diag/rdbms/davmdkp/DAVMDKP1/trace/alert_DAVMDKP1.log',
  criticalpatterns => [
      'ORA\-0*204[^\d]',        # error in reading control file
      'ORA\-0*206[^\d]',        # error in writing control file
      'ORA\-0*210[^\d]',        # cannot open control file
      'ORA\-0*257[^\d]',        # archiver is stuck
      'ORA\-0*333[^\d]',        # redo log read error
      'ORA\-0*345[^\d]',        # redo log write error
      'ORA\-0*4[4-7][0-9][^\d]',# ORA-0440 - ORA-0485 background process failure
      'ORA\-0*48[0-5][^\d]',
      'ORA\-0*6[0-3][0-9][^\d]',# ORA-6000 - ORA-0639 internal errors
      'ORA\-0*1114[^\d]',        # datafile I/O write error
      'ORA\-0*1115[^\d]',        # datafile I/O read error
      'ORA\-0*1116[^\d]',        # cannot open datafile
      'ORA\-0*1118[^\d]',        # cannot add a data file
      'ORA\-0*1122[^\d]',       # database file 16 failed verification check
      'ORA\-0*1171[^\d]',       # datafile 16 going offline due to error advancing checkpoint
      'ORA\-0*1201[^\d]',       # file 16 header failed to write correctly
      'ORA\-0*1208[^\d]',       # data file is an old version - not accessing current version
      'ORA\-0*1578[^\d]',        # data block corruption
      'ORA\-0*1135[^\d]',        # file accessed for query is offline
      'ORA\-0*1547[^\d]',        # tablespace is full
      'ORA\-0*1555[^\d]',        # snapshot too old
      'ORA\-0*1562[^\d]',        # failed to extend rollback segment
      'ORA\-0*162[89][^\d]',     # ORA-1628 - ORA-1632 maximum extents exceeded
      'ORA\-0*163[0-2][^\d]',
      'ORA\-0*165[0-6][^\d]',    # ORA-1650 - ORA-1656 tablespace is full
      'ORA\-16014[^\d]',      # log cannot be archived, no available destinations
      'ORA\-16038[^\d]',      # log cannot be archived
      'ORA\-19502[^\d]',      # write error on datafile
      'ORA\-27063[^\d]',         # number of bytes read/written is incorrect
      'ORA\-0*4031[^\d]',        # out of shared memory.
      'No space left on device',
      'Archival Error',
  ],
  warningpatterns => [
      'ORA\-0*3113[^\d]',        # end of file on communication channel
      'ORA\-0*6501[^\d]',         # PL/SQL internal error
      'ORA\-0*1140[^\d]',         # follows WARNING: datafile #20 was not in online backup mode
      'Archival stopped, error occurred. Will continue retrying',
  ]
});
凉城 2024-08-31 11:15:58

我相信现在有一个真正的 Nagios 插件可以有效地监控日志。

http://support .nagios.com/forum/viewtopic.php?f=6&t=8851&p=42088&hilit=unixautomation#p42088

该页面上 Nagios 插件的主页是 Nagios 日志监控

Your [ commands.cfg file ] will contain:

define command {
                            command_name         NagiosLogMonitor
                            command_line            $USER1$/NagiosLogMonitor $HOSTNAME$ $ARG1$ $ARG2$ $ARG3$ $ARG4$ '$ARG5
 '$ARG6
 $ARG7$ $ARG8$ $ARG9$ $ARG10$
}


OR


define command {
                            command_name         NagiosLogMonitor
                            command_line            $USER1$/NagiosLogMonitor $HOSTADDRESS$ $ARG1$ $ARG2$ $ARG3$ $ARG4$ '$ARG5
 '$ARG6
 $ARG7$ $ARG8$ $ARG9$ $ARG10$
}




Your [ services.cfg file ] will look similar to:

define service {
                      check_command                         NagiosLogMonitor!logrobot!autofig!/var/log/proteus.log!15!500.html!500 Internal Server Error!1!2!-foundn
                      max_check_attempts                  1
                      service_description                     500_ERRORS_LOGCHECK
                      host_name                                  sky.blat-01.net,sky.blat-02.net,sky.blat-03.net
                      use                                              fifteen-minute-interval
 }

I believe there's now a real Nagios plugin that monitors logs effectively.

http://support.nagios.com/forum/viewtopic.php?f=6&t=8851&p=42088&hilit=unixautomation#p42088

The home page of the Nagios plugin on that page is Nagios Log Monitor

Your [ commands.cfg file ] will contain:

define command {
                            command_name         NagiosLogMonitor
                            command_line            $USER1$/NagiosLogMonitor $HOSTNAME$ $ARG1$ $ARG2$ $ARG3$ $ARG4$ '$ARG5
 '$ARG6
 $ARG7$ $ARG8$ $ARG9$ $ARG10$
}


OR


define command {
                            command_name         NagiosLogMonitor
                            command_line            $USER1$/NagiosLogMonitor $HOSTADDRESS$ $ARG1$ $ARG2$ $ARG3$ $ARG4$ '$ARG5
 '$ARG6
 $ARG7$ $ARG8$ $ARG9$ $ARG10$
}




Your [ services.cfg file ] will look similar to:

define service {
                      check_command                         NagiosLogMonitor!logrobot!autofig!/var/log/proteus.log!15!500.html!500 Internal Server Error!1!2!-foundn
                      max_check_attempts                  1
                      service_description                     500_ERRORS_LOGCHECK
                      host_name                                  sky.blat-01.net,sky.blat-02.net,sky.blat-03.net
                      use                                              fifteen-minute-interval
 }
凉城已无爱 2024-08-31 11:15:58

Nagios 现在有一个与 Nagios Core、XI 等紧密集成的解决方案。

Nagios 日志服务器 可以针对基础设施中任何系统上的任何日志文件的任何查询发出警报。

Nagios now has a solution that integrates tightly with Nagios Core, XI, etc.

Nagios Log Server which can alert on any query on any log file on any system in your infrastructure.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文