NodeJS CheckList

发布于 2024-04-08 23:57:35 字数 11327 浏览 45 评论 0

有一个醉汉半夜在路灯下徘徊,路过的人奇怪地问他: 你在路灯下找什么? 醉汉回答: 我在找我的 KEY,路人更奇怪了: 找钥匙为什么在路灯下? ,醉汉说: 因为这里最亮!

很多服务端的同学在说到检查服务器状态时只知道使用 top 命令,其实情况就和上面的笑话一样,因为对于他们而言 top 是最亮的那盏路灯。

对于服务端程序员而言,完整的服务器 checklist 首推 《性能之巅》 第二章中讲述的 USE 方法

The USE Method provides a strategy for performing a complete check of system health, identifying common bottlenecks and errors. For each system resource, metrics for utilization, saturation and errors are identified and checked. Any issues discovered are then investigated using further strategies.

This is an example USE-based metric list for Linux operating systems (eg, Ubuntu, CentOS, Fedora). This is primarily intended for system administrators of the physical systems, who are using command line tools. Some of these metrics can be found in remote monitoring tools.

Physical Resources

componenttypemetric
CPUutilizationsystem-wide: vmstat 1 , "us" + "sy" + "st"; sar -u , sum fields except "%idle" and "%iowait"; dstat -c , sum fields except "idl" and "wai"; per-cpu: mpstat -P ALL 1 , sum fields except "%idle" and "%iowait"; sar -P ALL , same as mpstat ; per-process: top , "%CPU"; htop , "CPU%"; ps -o pcpu ; pidstat 1 , "%CPU"; per-kernel-thread: top / htop ("K" to toggle), where VIRT == 0 (heuristic). [1]
CPUsaturationsystem-wide: vmstat 1 , "r" > CPU count [2]; sar -q , "runq-sz" > CPU count; dstat -p , "run" > CPU count; per-process: /proc/PID/schedstat 2nd field (sched_info.run_delay); perf sched latency (shows "Average" and "Maximum" delay per-schedule); dynamic tracing, eg, SystemTap schedtimes.stp "queued(us)" [3]
CPUerrorsperf (LPE) if processor specific error events (CPC) are available; eg, AMD64's "04Ah Single-bit ECC Errors Recorded by Scrubber" [4]
Memory capacityutilizationsystem-wide: free -m , "Mem:" (main memory), "Swap:" (virtual memory); vmstat 1 , "free" (main memory), "swap" (virtual memory); sar -r , "%memused"; dstat -m , "free"; slabtop -s c for kmem slab usage; per-process: top / htop , "RES" (resident main memory), "VIRT" (virtual memory), "Mem" for system-wide summary
Memory capacitysaturationsystem-wide: vmstat 1 , "si"/"so" (swapping); sar -B , "pgscank" + "pgscand" (scanning); sar -W ; per-process: 10th field (min_flt) from /proc/PID/stat for minor-fault rate, or dynamic tracing [5]; OOM killer: dmesg | grep killed
Memory capacityerrorsdmesg for physical failures; dynamic tracing, eg, SystemTap uprobes for failed malloc()s
Network Interfacesutilizationsar -n DEV 1 , "rxKB/s"/max "txKB/s"/max; ip -s link , RX/TX tput / max bandwidth; /proc/net/dev, "bytes" RX/TX tput/max; nicstat "%Util" [6]
Network Interfacessaturationifconfig , "overruns", "dropped"; netstat -s , "segments retransmited"; sar -n EDEV , *drop and *fifo metrics; /proc/net/dev, RX/TX "drop"; nicstat "Sat" [6]; dynamic tracing for other TCP/IP stack queueing [7]
Network Interfaceserrorsifconfig , "errors", "dropped"; netstat -i , "RX-ERR"/"TX-ERR"; ip -s link , "errors"; sar -n EDEV , "rxerr/s" "txerr/s"; /proc/net/dev, "errs", "drop"; extra counters may be under /sys/class/net/...; dynamic tracing of driver function returns 76]
Storage device I/Outilizationsystem-wide: iostat -xz 1 , "%util"; sar -d , "%util"; per-process: iotop; pidstat -d ; /proc/PID/sched "se.statistics.iowait_sum"
Storage device I/Osaturationiostat -xnz 1 , "avgqu-sz" > 1, or high "await"; sar -d same; LPE block probes for queue length/latency; dynamic/static tracing of I/O subsystem (incl. LPE block probes)
Storage device I/Oerrors/sys/devices/.../ioerr_cnt; smartctl ; dynamic/static tracing of I/O subsystem response codes [8]
Storage capacityutilizationswap: swapon -s ; free ; /proc/meminfo "SwapFree"/"SwapTotal"; file systems: "df -h"
Storage capacitysaturationnot sure this one makes sense - once it's full, ENOSPC
Storage capacityerrorsstrace for ENOSPC; dynamic tracing for ENOSPC; /var/log/messages errs, depending on FS
Storage controllerutilizationiostat -xz 1 , sum devices and compare to known IOPS/tput limits per-card
Storage controllersaturationsee storage device saturation, ...
Storage controllererrorssee storage device errors, ...
Network controllerutilizationinfer from ip -s link (or /proc/net/dev) and known controller max tput for its interfaces
Network controllersaturationsee network interface saturation, ...
Network controllererrorssee network interface errors, ...
CPU interconnectutilizationLPE (CPC) for CPU interconnect ports, tput / max
CPU interconnectsaturationLPE (CPC) for stall cycles
CPU interconnecterrorsLPE (CPC) for whatever is available
Memory interconnectutilizationLPE (CPC) for memory busses, tput / max; or CPI greater than, say, 5; CPC may also have local vs remote counters
Memory interconnectsaturationLPE (CPC) for stall cycles
Memory interconnecterrorsLPE (CPC) for whatever is available
I/O interconnectutilizationLPE (CPC) for tput / max if available; inference via known tput from iostat/ip/...
I/O interconnectsaturationLPE (CPC) for stall cycles
I/O interconnecterrorsLPE (CPC) for whatever is available

Software Resources

componenttypemetric
Kernel mutexutilizationWith CONFIG_LOCK_STATS=y, /proc/lock_stat "holdtime-totat" / "acquisitions" (also see "holdtime-min", "holdtime-max") [8]; dynamic tracing of lock functions or instructions (maybe)
Kernel mutexsaturationWith CONFIG_LOCK_STATS=y, /proc/lock_stat "waittime-total" / "contentions" (also see "waittime-min", "waittime-max"); dynamic tracing of lock functions or instructions (maybe); spinning shows up with profiling ( perf record -a -g -F 997 ... , oprofile , dynamic tracing)
Kernel mutexerrorsdynamic tracing (eg, recusive mutex enter); other errors can cause kernel lockup/panic, debug with kdump/ crash
User mutexutilizationvalgrind --tool=drd --exclusive-threshold=... (held time); dynamic tracing of lock to unlock function time
User mutexsaturationvalgrind --tool=drd to infer contention from held time; dynamic tracing of synchronization functions for wait time; profiling (oprofile, PEL, ...) user stacks for spins
User mutexerrorsvalgrind --tool=drd various errors; dynamic tracing of pthread_mutex_lock() for EAGAIN, EINVAL, EPERM, EDEADLK, ENOMEM, EOWNERDEAD, ...
Task capacityutilizationtop / htop , "Tasks" (current); sysctl kernel.threads-max , /proc/sys/kernel/threads-max (max)
Task capacitysaturationthreads blocking on memory allocation; at this point the page scanner should be running (sar -B "pgscan*"), else examine using dynamic tracing
Task capacityerrors"can't fork()" errors; user-level threads: pthread_create() failures with EAGAIN, EINVAL, ...; kernel: dynamic tracing of kernel_thread() ENOMEM
File descriptorsutilizationsystem-wide: sar -v , "file-nr" vs /proc/sys/fs/file-max; dstat --fs , "files"; or just /proc/sys/fs/file-nr; per-process: ls /proc/PID/fd | wc -l vs ulimit -n
File descriptorssaturationdoes this make sense? I don't think there is any queueing or blocking, other than on memory allocation.
File descriptorserrorsstrace errno == EMFILE on syscalls returning fds (eg, open(), accept(), ...).

ulimit

ulimit 用于管理用户对系统资源的访问。

-a   显示目前全部限制情况
-c   设定 core 文件的最大值,单位为区块
-d   <数据节区大小> 程序数据节区的最大值,单位为 KB
-f   <文件大小> shell 所能建立的最大文件,单位为区块
-H   设定资源的硬性限制,也就是管理员所设下的限制
-m   <内存大小> 指定可使用内存的上限,单位为 KB
-n   <文件描述符数目> 指定同一时间最多可开启的 fd 数
-p   <缓冲区大小> 指定管道缓冲区的大小,单位 512 字节
-s   <堆叠大小> 指定堆叠的上限,单位为 KB
-S   设定资源的弹性限制
-t   指定 CPU 使用时间的上限,单位为秒
-u   <进程数目> 用户最多可开启的进程数目
-v   <虚拟内存大小> 指定可使用的虚拟内存上限,单位为 KB

例如:

$ ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 127988
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 655360
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 4096
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

注意,open socket 等资源拿到的也是 fd,所以 ulimit -n 比较小除了文件打不开,还可能建立不了 socket 链接。

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。
列表为空,暂无数据

关于作者

冷情

暂无简介

文章
评论
27 人气
更多

推荐作者

櫻之舞

文章 0 评论 0

弥枳

文章 0 评论 0

m2429

文章 0 评论 0

野却迷人

文章 0 评论 0

我怀念的。

文章 0 评论 0

    我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
    原文