如何使用 Nagios 监控日志文件

发布于 2024-08-24 11:15:58 字数 1372 浏览 14 评论 0原文

我们使用 Nagios 来监控我们的网络并取得了巨大成功。然而，我们有一个用于关键应用程序错误的系统日志，当我设置 check_log 时，它似乎不能很好地监控设备。

问题是：

它只显示最后一个条目
似乎没有办法确认严重错误并且将监视器恢复到良好状态

nagios 是否是错误的工具，或者我们只是没有正确设置服务监视？

这是我的参赛作品

# log file
define command{
        command_name    check_log
        command_line    $USER1$/check_log -F /var/log/applications/appcrit.log -O /tmp/appcrit.log -q ?
}


# Define the log monitering service
define service{
        name                            logfile-check           ;
        use                             generic-service         ;
        check_period                    24x7                    ;
        max_check_attempts              1                       ;
        normal_check_interval           5                       ;
        retry_check_interval            1                       ;
        contact_groups                  admins                  ;
        notification_options            w,u,c,r                 ;
        notification_period             24x7                    ;
        register                        0                       ;
        }

define service{
        use                             logfile-check
        host_name                       localhost
        service_description             CritLogFile
        check_command                   check_log
}

原文

We are using Nagios to monitor our network with great success. However, we have a syslog for critical application errors and while I set up check_log, it doesn't seem to work as well as monitering a device.

The issues are:

It only shows the last entry
There doesn't seem to be a way to acknowledge the critical error and
return the monitor to a good state

Is nagios the wrong tool, or are we just not setting up the service monitering right?

Here are my entries

# log file
define command{
        command_name    check_log
        command_line    $USER1$/check_log -F /var/log/applications/appcrit.log -O /tmp/appcrit.log -q ?
}


# Define the log monitering service
define service{
        name                            logfile-check           ;
        use                             generic-service         ;
        check_period                    24x7                    ;
        max_check_attempts              1                       ;
        normal_check_interval           5                       ;
        retry_check_interval            1                       ;
        contact_groups                  admins                  ;
        notification_options            w,u,c,r                 ;
        notification_period             24x7                    ;
        register                        0                       ;
        }

define service{
        use                             logfile-check
        host_name                       localhost
        service_description             CritLogFile
        check_command                   check_log
}

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

吃不饱 2024-08-31 11:15:58

对于使用 Nagios 监视日志，日志检查器通常只会在每次调用时针对新发现的错误消息返回警告（因此它必须保留某些状态，以便知道在后续运行中忽略它们）。因此我通常设置：

max_check_attempts              1
is_volatile                     1

这会导致Nagios立即发出警报，但仅一次，然后恢复正常。

我最喜欢的日志检查器是 logwarn，但我有偏见，因为我在没有找到任何现有日志后自己编写了它我喜欢的。 logwarn 包包含一个 Nagios 插件。

For monitoring logs with Nagios, typically the log checker will return a warning only for newly discovered error messages each time it is invoked (so it must retain some state in order to know to ignore them on subsequent runs). Therefore I usually set:

max_check_attempts              1
is_volatile                     1

This causes Nagios to send out the alert immeidately, but only once, and then go back to normal.

My favorite log checker is logwarn, but I'm biased because I wrote it myself after not finding any existing ones that I liked. The logwarn package includes a Nagios plugin.

回复收藏 0 原文

爱的那么颓废 2024-08-31 11:15:58

你的配置中没有任何内容让我觉得配置错误。

根据设计，check_log 将仅显示 OK 消息或触发警报的最后一个日志条目。如果您需要查看多个条目，则需要修改插件。

然而，我发现你没有得到康复的事实有点奇怪。 check_log 的工作方式（通过将当前日志与以前的版本进行比较），您应该在下一次服务检查时得到恢复。当然，自上次检查以来日志中添加了其他匹配条目的情况除外。

强制执行另一项（或多项）服务检查是否会导致其恢复？

另外，我并不是有意这样做，但要确保它确实发生了故障。
您的日志是否在检查之间获取了额外的匹配条目，导致其无法恢复？您的支票匹配“？”它将匹配日志中的任何新内容。是否有其他内容（非错误）被添加到日志中并无意中导致匹配？

如果以上都不是问题，我建议通过将 Nagios 排除在外来缩小范围。尝试手动运行 check_log （从命令行，但使用与 nagios 相同的用户），并使用不同的 oldlog。它应该是这样的 -

使用新的“oldlog”运行检查 - 获取初始化消息
运行检查 - 检查确定
更改日志
运行检查 - 检查失败
运行检查 - 检查确定

如果这不起作用，那么你知道要关注关于日志、oldlog 以及 check_log 如何进行检查。

如果它有效，那么它更多地表明您的 nagios 配置存在问题。

回复收藏 0 原文

怂人 2024-08-31 11:15:58

有一个 Nagios 插件可用于检查日志文件：它名为 check_logfiles 和它用于扫描文件行中的正则表达式。

以下链接显示了如何为 Nagios 和 Opsview 安装和配置 check_logfiles：
https://www.opsview.com/resources/ nagios-alternative/博客/syslog-monitoring-nagios-opsview

回复收藏 0 原文

澜川若宁 2024-08-31 11:15:58

由于实现目标的方法有很多，Consol 还提供了一个不错的插件：
https://labs.consol.de/lang/en/nagios/check_logfiles/

支持正则表达式
支持日志轮转

要使用它，你需要一个cfg文件，这是oracle数据库的示例

@searches = ({
  tag => 'oraalerts',
options => 'sticky=28800',
  logfile => '/u01/app/oracle/diag/rdbms/davmdkp/DAVMDKP1/trace/alert_DAVMDKP1.log',
  criticalpatterns => [
      'ORA\-0*204[^\d]',        # error in reading control file
      'ORA\-0*206[^\d]',        # error in writing control file
      'ORA\-0*210[^\d]',        # cannot open control file
      'ORA\-0*257[^\d]',        # archiver is stuck
      'ORA\-0*333[^\d]',        # redo log read error
      'ORA\-0*345[^\d]',        # redo log write error
      'ORA\-0*4[4-7][0-9][^\d]',# ORA-0440 - ORA-0485 background process failure
      'ORA\-0*48[0-5][^\d]',
      'ORA\-0*6[0-3][0-9][^\d]',# ORA-6000 - ORA-0639 internal errors
      'ORA\-0*1114[^\d]',        # datafile I/O write error
      'ORA\-0*1115[^\d]',        # datafile I/O read error
      'ORA\-0*1116[^\d]',        # cannot open datafile
      'ORA\-0*1118[^\d]',        # cannot add a data file
      'ORA\-0*1122[^\d]',       # database file 16 failed verification check
      'ORA\-0*1171[^\d]',       # datafile 16 going offline due to error advancing checkpoint
      'ORA\-0*1201[^\d]',       # file 16 header failed to write correctly
      'ORA\-0*1208[^\d]',       # data file is an old version - not accessing current version
      'ORA\-0*1578[^\d]',        # data block corruption
      'ORA\-0*1135[^\d]',        # file accessed for query is offline
      'ORA\-0*1547[^\d]',        # tablespace is full
      'ORA\-0*1555[^\d]',        # snapshot too old
      'ORA\-0*1562[^\d]',        # failed to extend rollback segment
      'ORA\-0*162[89][^\d]',     # ORA-1628 - ORA-1632 maximum extents exceeded
      'ORA\-0*163[0-2][^\d]',
      'ORA\-0*165[0-6][^\d]',    # ORA-1650 - ORA-1656 tablespace is full
      'ORA\-16014[^\d]',      # log cannot be archived, no available destinations
      'ORA\-16038[^\d]',      # log cannot be archived
      'ORA\-19502[^\d]',      # write error on datafile
      'ORA\-27063[^\d]',         # number of bytes read/written is incorrect
      'ORA\-0*4031[^\d]',        # out of shared memory.
      'No space left on device',
      'Archival Error',
  ],
  warningpatterns => [
      'ORA\-0*3113[^\d]',        # end of file on communication channel
      'ORA\-0*6501[^\d]',         # PL/SQL internal error
      'ORA\-0*1140[^\d]',         # follows WARNING: datafile #20 was not in online backup mode
      'Archival stopped, error occurred. Will continue retrying',
  ]
});

As there are many ways to achieve a goal, there is also a nice plugin from Consol available:
https://labs.consol.de/lang/en/nagios/check_logfiles/

supports regex
supports log rotation

To use it, you need a cfg file, this is an example for oracle databases

@searches = ({
  tag => 'oraalerts',
options => 'sticky=28800',
  logfile => '/u01/app/oracle/diag/rdbms/davmdkp/DAVMDKP1/trace/alert_DAVMDKP1.log',
  criticalpatterns => [
      'ORA\-0*204[^\d]',        # error in reading control file
      'ORA\-0*206[^\d]',        # error in writing control file
      'ORA\-0*210[^\d]',        # cannot open control file
      'ORA\-0*257[^\d]',        # archiver is stuck
      'ORA\-0*333[^\d]',        # redo log read error
      'ORA\-0*345[^\d]',        # redo log write error
      'ORA\-0*4[4-7][0-9][^\d]',# ORA-0440 - ORA-0485 background process failure
      'ORA\-0*48[0-5][^\d]',
      'ORA\-0*6[0-3][0-9][^\d]',# ORA-6000 - ORA-0639 internal errors
      'ORA\-0*1114[^\d]',        # datafile I/O write error
      'ORA\-0*1115[^\d]',        # datafile I/O read error
      'ORA\-0*1116[^\d]',        # cannot open datafile
      'ORA\-0*1118[^\d]',        # cannot add a data file
      'ORA\-0*1122[^\d]',       # database file 16 failed verification check
      'ORA\-0*1171[^\d]',       # datafile 16 going offline due to error advancing checkpoint
      'ORA\-0*1201[^\d]',       # file 16 header failed to write correctly
      'ORA\-0*1208[^\d]',       # data file is an old version - not accessing current version
      'ORA\-0*1578[^\d]',        # data block corruption
      'ORA\-0*1135[^\d]',        # file accessed for query is offline
      'ORA\-0*1547[^\d]',        # tablespace is full
      'ORA\-0*1555[^\d]',        # snapshot too old
      'ORA\-0*1562[^\d]',        # failed to extend rollback segment
      'ORA\-0*162[89][^\d]',     # ORA-1628 - ORA-1632 maximum extents exceeded
      'ORA\-0*163[0-2][^\d]',
      'ORA\-0*165[0-6][^\d]',    # ORA-1650 - ORA-1656 tablespace is full
      'ORA\-16014[^\d]',      # log cannot be archived, no available destinations
      'ORA\-16038[^\d]',      # log cannot be archived
      'ORA\-19502[^\d]',      # write error on datafile
      'ORA\-27063[^\d]',         # number of bytes read/written is incorrect
      'ORA\-0*4031[^\d]',        # out of shared memory.
      'No space left on device',
      'Archival Error',
  ],
  warningpatterns => [
      'ORA\-0*3113[^\d]',        # end of file on communication channel
      'ORA\-0*6501[^\d]',         # PL/SQL internal error
      'ORA\-0*1140[^\d]',         # follows WARNING: datafile #20 was not in online backup mode
      'Archival stopped, error occurred. Will continue retrying',
  ]
});

回复收藏 0 原文

凉城 2024-08-31 11:15:58

我相信现在有一个真正的 Nagios 插件可以有效地监控日志。

http://support .nagios.com/forum/viewtopic.php?f=6&t=8851&p=42088&hilit=unixautomation#p42088

该页面上 Nagios 插件的主页是 Nagios 日志监控

Your [ commands.cfg file ] will contain:

define command {
                            command_name         NagiosLogMonitor
                            command_line            $USER1$/NagiosLogMonitor $HOSTNAME$ $ARG1$ $ARG2$ $ARG3$ $ARG4$ '$ARG5
 '$ARG6
 $ARG7$ $ARG8$ $ARG9$ $ARG10$
}


OR


define command {
                            command_name         NagiosLogMonitor
                            command_line            $USER1$/NagiosLogMonitor $HOSTADDRESS$ $ARG1$ $ARG2$ $ARG3$ $ARG4$ '$ARG5
 '$ARG6
 $ARG7$ $ARG8$ $ARG9$ $ARG10$
}




Your [ services.cfg file ] will look similar to:

define service {
                      check_command                         NagiosLogMonitor!logrobot!autofig!/var/log/proteus.log!15!500.html!500 Internal Server Error!1!2!-foundn
                      max_check_attempts                  1
                      service_description                     500_ERRORS_LOGCHECK
                      host_name                                  sky.blat-01.net,sky.blat-02.net,sky.blat-03.net
                      use                                              fifteen-minute-interval
 }

I believe there's now a real Nagios plugin that monitors logs effectively.

http://support.nagios.com/forum/viewtopic.php?f=6&t=8851&p=42088&hilit=unixautomation#p42088

The home page of the Nagios plugin on that page is Nagios Log Monitor

Your [ commands.cfg file ] will contain:

define command {
                            command_name         NagiosLogMonitor
                            command_line            $USER1$/NagiosLogMonitor $HOSTNAME$ $ARG1$ $ARG2$ $ARG3$ $ARG4$ '$ARG5
 '$ARG6
 $ARG7$ $ARG8$ $ARG9$ $ARG10$
}


OR


define command {
                            command_name         NagiosLogMonitor
                            command_line            $USER1$/NagiosLogMonitor $HOSTADDRESS$ $ARG1$ $ARG2$ $ARG3$ $ARG4$ '$ARG5
 '$ARG6
 $ARG7$ $ARG8$ $ARG9$ $ARG10$
}




Your [ services.cfg file ] will look similar to:

define service {
                      check_command                         NagiosLogMonitor!logrobot!autofig!/var/log/proteus.log!15!500.html!500 Internal Server Error!1!2!-foundn
                      max_check_attempts                  1
                      service_description                     500_ERRORS_LOGCHECK
                      host_name                                  sky.blat-01.net,sky.blat-02.net,sky.blat-03.net
                      use                                              fifteen-minute-interval
 }

回复收藏 0 原文