Prometheus执行超时

发布于 2025-02-07 23:45:39 字数 10404 浏览 2 评论 0原文

我使用Grafana来监视公司的基础设施。一切正常，直到本周为止，我开始看到Grafana上的警报，并带有错误消息：

request handler error: Post "http://prometheus-ip:9090/api/v1/query_range": dial tcp prometheus-ip:9090: i/o timeout

我尝试重新启动Prometheus Server，但似乎无法停止。我必须杀死-9服务器并重新启动。这是日志：

Jun 16 01:04:01 prometheus prometheus[18869]: time="2022-06-16T01:04:01+02:00" level=info msg="All requests for rebuilding the label indexes queued. (Actual processing may lag behind.)" source="crashrecovery.go:529"
Jun 16 01:04:01 prometheus prometheus[18869]: time="2022-06-16T01:04:01+02:00" level=info msg="Checkpointing fingerprint mappings..." source="persistence.go:1480"
Jun 16 01:04:02 prometheus prometheus[18869]: time="2022-06-16T01:04:02+02:00" level=info msg="Done checkpointing fingerprint mappings in 286.224481ms." source="persistence.go:1503"
Jun 16 01:04:02 prometheus prometheus[18869]: time="2022-06-16T01:04:02+02:00" level=warning msg="Crash recovery complete." source="crashrecovery.go:152"
Jun 16 01:04:02 prometheus prometheus[18869]: time="2022-06-16T01:04:02+02:00" level=info msg="362306 series loaded." source="storage.go:378"
Jun 16 01:04:02 prometheus prometheus[18869]: time="2022-06-16T01:04:02+02:00" level=info msg="Starting target manager..." source="targetmanager.go:61"
Jun 16 01:04:02 prometheus prometheus[18869]: time="2022-06-16T01:04:02+02:00" level=info msg="Listening on :9090" source="web.go:235"
Jun 16 01:04:15 prometheus prometheus[18869]: time="2022-06-16T01:04:15+02:00" level=warning msg="Storage has entered rushed mode." chunksToPersist=420483 maxChunksToPersist=524288 maxMemoryChunks=1048576 memoryChunks=655877 source="storage.go:1660" urgencyScore=0.8020076751708984
Jun 16 01:09:02 prometheus prometheus[18869]: time="2022-06-16T01:09:02+02:00" level=info msg="Checkpointing in-memory metrics and chunks..." source="persistence.go:612"
Jun 16 01:10:05 prometheus prometheus[18869]: time="2022-06-16T01:10:05+02:00" level=info msg="Done checkpointing in-memory metrics and chunks in 1m3.127365726s." source="persistence.go:639"
Jun 16 01:12:25 prometheus prometheus[18869]: time="2022-06-16T01:12:25+02:00" level=warning msg="Received SIGTERM, exiting gracefully..." source="main.go:230"
Jun 16 01:12:25 prometheus prometheus[18869]: time="2022-06-16T01:12:25+02:00" level=info msg="See you next time!" source="main.go:237"
Jun 16 01:12:25 prometheus prometheus[18869]: time="2022-06-16T01:12:25+02:00" level=info msg="Stopping target manager..." source="targetmanager.go:75"
Jun 16 01:12:28 prometheus prometheus[18869]: time="2022-06-16T01:12:28+02:00" level=info msg="Stopping rule manager..." source="manager.go:374"
Jun 16 01:12:28 prometheus prometheus[18869]: time="2022-06-16T01:12:28+02:00" level=info msg="Rule manager stopped." source="manager.go:380"
Jun 16 01:12:28 prometheus prometheus[18869]: time="2022-06-16T01:12:28+02:00" level=info msg="Stopping notification handler..." source="notifier.go:369"
Jun 16 01:12:28 prometheus prometheus[18869]: time="2022-06-16T01:12:28+02:00" level=info msg="Stopping local storage..." source="storage.go:396"
Jun 16 01:12:28 prometheus prometheus[18869]: time="2022-06-16T01:12:28+02:00" level=info msg="Stopping maintenance loop..." source="storage.go:398"
Jun 16 01:12:28 prometheus prometheus[18869]: time="2022-06-16T01:12:28+02:00" level=info msg="Maintenance loop stopped." source="storage.go:1259"
Jun 16 01:12:28 prometheus prometheus[18869]: time="2022-06-16T01:12:28+02:00" level=info msg="Stopping series quarantining..." source="storage.go:402"
Jun 16 01:12:28 prometheus prometheus[18869]: time="2022-06-16T01:12:28+02:00" level=info msg="Series quarantining stopped." source="storage.go:1701"
Jun 16 01:12:28 prometheus prometheus[18869]: time="2022-06-16T01:12:28+02:00" level=info msg="Stopping chunk eviction..." source="storage.go:406"
Jun 16 01:12:28 prometheus prometheus[18869]: time="2022-06-16T01:12:28+02:00" level=info msg="Chunk eviction stopped." source="storage.go:1079"
Jun 16 01:12:28 prometheus prometheus[18869]: time="2022-06-16T01:12:28+02:00" level=info msg="Checkpointing in-memory metrics and chunks..." source="persistence.go:612"
Jun 16 01:12:44 prometheus prometheus[18869]: time="2022-06-16T01:12:44+02:00" level=info msg="Done checkpointing in-memory metrics and chunks in 16.170119611s." source="persistence.go:639"
Jun 16 01:12:44 prometheus prometheus[18869]: time="2022-06-16T01:12:44+02:00" level=info msg="Checkpointing fingerprint mappings..." source="persistence.go:1480"
Jun 16 01:12:45 prometheus prometheus[18869]: time="2022-06-16T01:12:45+02:00" level=info msg="Done checkpointing fingerprint mappings in 651.409422ms." source="persistence.go:1503"
Jun 16 01:12:45 prometheus systemd[1]: prometheus.service: State 'stop-sigterm' timed out. Skipping SIGKILL.
Jun 16 01:13:06 prometheus systemd[1]: prometheus.service: State 'stop-final-sigterm' timed out. Skipping SIGKILL. Entering failed mode.
Jun 16 01:13:06 prometheus systemd[1]: prometheus.service: Unit entered failed state.
Jun 16 01:13:06 prometheus systemd[1]: prometheus.service: Failed with result 'timeout'.
Jun 16 01:13:24 prometheus prometheus[20547]: time="2022-06-16T01:13:24+02:00" level=info msg="Starting prometheus (version=1.5.2+ds, branch=debian/sid, revision=1.5.2+ds-2+b3)" source="main.go:75"
Jun 16 01:13:24 prometheus prometheus[20547]: time="2022-06-16T01:13:24+02:00" level=info msg="Build context (go=go1.7.4, [email protected], date=20170521-14:39:14)" source="main.go:76"
Jun 16 01:13:24 prometheus prometheus[20547]: time="2022-06-16T01:13:24+02:00" level=info msg="Loading configuration file /etc/prometheus/prometheus.yml" source="main.go:248"
Jun 16 01:13:24 prometheus prometheus[20547]: time="2022-06-16T01:13:24+02:00" level=error msg="Could not lock /path/to/prometheus/metrics/DIRTY, Prometheus already running?" source="persistence.go:198"
Jun 16 01:13:24 prometheus prometheus[20547]: time="2022-06-16T01:13:24+02:00" level=error msg="Error opening memory series storage: resource temporarily unavailable" source="main.go:182"
Jun 16 01:13:24 prometheus systemd[1]: prometheus.service: Main process exited, code=exited, status=1/FAILURE
Jun 16 01:13:44 prometheus systemd[1]: prometheus.service: State 'stop-sigterm' timed out. Skipping SIGKILL.
Jun 16 01:14:02 prometheus prometheus[18869]: time="2022-06-16T01:14:02+02:00" level=info msg="Local storage stopped." source="storage.go:421"
Jun 16 01:14:02 prometheus systemd[1]: prometheus.service: Unit entered failed state.
Jun 16 01:14:02 prometheus systemd[1]: prometheus.service: Failed with result 'exit-code'.
Jun 16 01:14:03 prometheus systemd[1]: prometheus.service: Service hold-off time over, scheduling restart.
Jun 16 01:14:03 prometheus prometheus[20564]: time="2022-06-16T01:14:03+02:00" level=info msg="Starting prometheus (version=1.5.2+ds, branch=debian/sid, revision=1.5.2+ds-2+b3)" source="main.go:75"
Jun 16 01:14:03 prometheus prometheus[20564]: time="2022-06-16T01:14:03+02:00" level=info msg="Build context (go=go1.7.4, [email protected], date=20170521-14:39:14)" source="main.go:76"
Jun 16 01:14:03 prometheus prometheus[20564]: time="2022-06-16T01:14:03+02:00" level=info msg="Loading configuration file /etc/prometheus/prometheus.yml" source="main.go:248"
Jun 16 01:14:04 prometheus prometheus[20564]: time="2022-06-16T01:14:04+02:00" level=info msg="Loading series map and head chunks..." source="storage.go:373"
Jun 16 01:14:08 prometheus prometheus[20564]: time="2022-06-16T01:14:08+02:00" level=info msg="364314 series loaded." source="storage.go:378"
Jun 16 01:14:08 prometheus prometheus[20564]: time="2022-06-16T01:14:08+02:00" level=info msg="Starting target manager..." source="targetmanager.go:61"
Jun 16 01:14:08 prometheus prometheus[20564]: time="2022-06-16T01:14:08+02:00" level=info msg="Listening on :9090" source="web.go:235"
Jun 16 01:14:08 prometheus prometheus[20564]: time="2022-06-16T01:14:08+02:00" level=warning msg="Storage has entered rushed mode." chunksToPersist=448681 maxChunksToPersist=524288 maxMemoryChunks=1048576 memoryChunks=687476 source="storage.go:1660" urgencyScore=0.8557910919189453

当像这样重新启动时，Prometheus进入恢复模式，该模式需要1H 30分钟才能完成。完成后，日志显示以下内容：

Jun 16 16:10:42 prometheus prometheus[32708]: time="2022-06-16T16:10:42+02:00" level=info msg="Storage does not need throttling anymore." chunksToPersist=524288 maxChunksToPersist=524288 maxToleratedMemChunks=1153433 memoryChunks=1049320 source="storage.go:935"
Jun 16 16:10:42 prometheus prometheus[32708]: time="2022-06-16T16:10:42+02:00" level=error msg="Storage needs throttling. Scrapes and rule evaluations will be skipped." chunksToPersist=525451 maxChunksToPersist=524288 maxToleratedMemChunks=1153433 memoryChunks=1050483 source="storage.go:927"
Jun 16 16:15:31 prometheus prometheus[32708]: time="2022-06-16T16:15:31+02:00" level=info msg="Checkpointing in-memory metrics and chunks..." source="persistence.go:612"
Jun 16 16:16:28 prometheus prometheus[32708]: time="2022-06-16T16:16:28+02:00" level=info msg="Done checkpointing in-memory metrics and chunks in 57.204367083s." source="persistence.go:639"

检查点经常重复，大约需要1分钟。

该服务器的监视显示以下内容：

这是使用的标志：

/usr/bin/prometheus --storage.local.path /path/to/prometheus/metrics --storage.local.retention=1460h0m0s --storage.local.series-file-shrink-ratio=0.3

Prometheus版本：

prometheus --version
prometheus, version 1.5.2+ds (branch: debian/sid, revision: 1.5.2+ds-2+b3)
  build user:       [email protected]
  build date:       20170521-14:39:14
  go version:       go1.7.4

我决定在另一台服务器上移动一些指标，因此该指标不像以前那样加载。但是，该服务器确实必须为其他50多个服务器刮擦指标。这可能是什么原因？

原文

I use Grafana to monitor my company's infrastructure. Everything worked fine until this week, I started to see alerts on Grafana with an error message :

request handler error: Post "http://prometheus-ip:9090/api/v1/query_range": dial tcp prometheus-ip:9090: i/o timeout

I tried to restart the prometheus server but it seems that it can't be stopped. I have to kill -9 the server and restart it. Here's the log :

Jun 16 01:04:01 prometheus prometheus[18869]: time="2022-06-16T01:04:01+02:00" level=info msg="All requests for rebuilding the label indexes queued. (Actual processing may lag behind.)" source="crashrecovery.go:529"
Jun 16 01:04:01 prometheus prometheus[18869]: time="2022-06-16T01:04:01+02:00" level=info msg="Checkpointing fingerprint mappings..." source="persistence.go:1480"
Jun 16 01:04:02 prometheus prometheus[18869]: time="2022-06-16T01:04:02+02:00" level=info msg="Done checkpointing fingerprint mappings in 286.224481ms." source="persistence.go:1503"
Jun 16 01:04:02 prometheus prometheus[18869]: time="2022-06-16T01:04:02+02:00" level=warning msg="Crash recovery complete." source="crashrecovery.go:152"
Jun 16 01:04:02 prometheus prometheus[18869]: time="2022-06-16T01:04:02+02:00" level=info msg="362306 series loaded." source="storage.go:378"
Jun 16 01:04:02 prometheus prometheus[18869]: time="2022-06-16T01:04:02+02:00" level=info msg="Starting target manager..." source="targetmanager.go:61"
Jun 16 01:04:02 prometheus prometheus[18869]: time="2022-06-16T01:04:02+02:00" level=info msg="Listening on :9090" source="web.go:235"
Jun 16 01:04:15 prometheus prometheus[18869]: time="2022-06-16T01:04:15+02:00" level=warning msg="Storage has entered rushed mode." chunksToPersist=420483 maxChunksToPersist=524288 maxMemoryChunks=1048576 memoryChunks=655877 source="storage.go:1660" urgencyScore=0.8020076751708984
Jun 16 01:09:02 prometheus prometheus[18869]: time="2022-06-16T01:09:02+02:00" level=info msg="Checkpointing in-memory metrics and chunks..." source="persistence.go:612"
Jun 16 01:10:05 prometheus prometheus[18869]: time="2022-06-16T01:10:05+02:00" level=info msg="Done checkpointing in-memory metrics and chunks in 1m3.127365726s." source="persistence.go:639"
Jun 16 01:12:25 prometheus prometheus[18869]: time="2022-06-16T01:12:25+02:00" level=warning msg="Received SIGTERM, exiting gracefully..." source="main.go:230"
Jun 16 01:12:25 prometheus prometheus[18869]: time="2022-06-16T01:12:25+02:00" level=info msg="See you next time!" source="main.go:237"
Jun 16 01:12:25 prometheus prometheus[18869]: time="2022-06-16T01:12:25+02:00" level=info msg="Stopping target manager..." source="targetmanager.go:75"
Jun 16 01:12:28 prometheus prometheus[18869]: time="2022-06-16T01:12:28+02:00" level=info msg="Stopping rule manager..." source="manager.go:374"
Jun 16 01:12:28 prometheus prometheus[18869]: time="2022-06-16T01:12:28+02:00" level=info msg="Rule manager stopped." source="manager.go:380"
Jun 16 01:12:28 prometheus prometheus[18869]: time="2022-06-16T01:12:28+02:00" level=info msg="Stopping notification handler..." source="notifier.go:369"
Jun 16 01:12:28 prometheus prometheus[18869]: time="2022-06-16T01:12:28+02:00" level=info msg="Stopping local storage..." source="storage.go:396"
Jun 16 01:12:28 prometheus prometheus[18869]: time="2022-06-16T01:12:28+02:00" level=info msg="Stopping maintenance loop..." source="storage.go:398"
Jun 16 01:12:28 prometheus prometheus[18869]: time="2022-06-16T01:12:28+02:00" level=info msg="Maintenance loop stopped." source="storage.go:1259"
Jun 16 01:12:28 prometheus prometheus[18869]: time="2022-06-16T01:12:28+02:00" level=info msg="Stopping series quarantining..." source="storage.go:402"
Jun 16 01:12:28 prometheus prometheus[18869]: time="2022-06-16T01:12:28+02:00" level=info msg="Series quarantining stopped." source="storage.go:1701"
Jun 16 01:12:28 prometheus prometheus[18869]: time="2022-06-16T01:12:28+02:00" level=info msg="Stopping chunk eviction..." source="storage.go:406"
Jun 16 01:12:28 prometheus prometheus[18869]: time="2022-06-16T01:12:28+02:00" level=info msg="Chunk eviction stopped." source="storage.go:1079"
Jun 16 01:12:28 prometheus prometheus[18869]: time="2022-06-16T01:12:28+02:00" level=info msg="Checkpointing in-memory metrics and chunks..." source="persistence.go:612"
Jun 16 01:12:44 prometheus prometheus[18869]: time="2022-06-16T01:12:44+02:00" level=info msg="Done checkpointing in-memory metrics and chunks in 16.170119611s." source="persistence.go:639"
Jun 16 01:12:44 prometheus prometheus[18869]: time="2022-06-16T01:12:44+02:00" level=info msg="Checkpointing fingerprint mappings..." source="persistence.go:1480"
Jun 16 01:12:45 prometheus prometheus[18869]: time="2022-06-16T01:12:45+02:00" level=info msg="Done checkpointing fingerprint mappings in 651.409422ms." source="persistence.go:1503"
Jun 16 01:12:45 prometheus systemd[1]: prometheus.service: State 'stop-sigterm' timed out. Skipping SIGKILL.
Jun 16 01:13:06 prometheus systemd[1]: prometheus.service: State 'stop-final-sigterm' timed out. Skipping SIGKILL. Entering failed mode.
Jun 16 01:13:06 prometheus systemd[1]: prometheus.service: Unit entered failed state.
Jun 16 01:13:06 prometheus systemd[1]: prometheus.service: Failed with result 'timeout'.
Jun 16 01:13:24 prometheus prometheus[20547]: time="2022-06-16T01:13:24+02:00" level=info msg="Starting prometheus (version=1.5.2+ds, branch=debian/sid, revision=1.5.2+ds-2+b3)" source="main.go:75"
Jun 16 01:13:24 prometheus prometheus[20547]: time="2022-06-16T01:13:24+02:00" level=info msg="Build context (go=go1.7.4, [email protected], date=20170521-14:39:14)" source="main.go:76"
Jun 16 01:13:24 prometheus prometheus[20547]: time="2022-06-16T01:13:24+02:00" level=info msg="Loading configuration file /etc/prometheus/prometheus.yml" source="main.go:248"
Jun 16 01:13:24 prometheus prometheus[20547]: time="2022-06-16T01:13:24+02:00" level=error msg="Could not lock /path/to/prometheus/metrics/DIRTY, Prometheus already running?" source="persistence.go:198"
Jun 16 01:13:24 prometheus prometheus[20547]: time="2022-06-16T01:13:24+02:00" level=error msg="Error opening memory series storage: resource temporarily unavailable" source="main.go:182"
Jun 16 01:13:24 prometheus systemd[1]: prometheus.service: Main process exited, code=exited, status=1/FAILURE
Jun 16 01:13:44 prometheus systemd[1]: prometheus.service: State 'stop-sigterm' timed out. Skipping SIGKILL.
Jun 16 01:14:02 prometheus prometheus[18869]: time="2022-06-16T01:14:02+02:00" level=info msg="Local storage stopped." source="storage.go:421"
Jun 16 01:14:02 prometheus systemd[1]: prometheus.service: Unit entered failed state.
Jun 16 01:14:02 prometheus systemd[1]: prometheus.service: Failed with result 'exit-code'.
Jun 16 01:14:03 prometheus systemd[1]: prometheus.service: Service hold-off time over, scheduling restart.
Jun 16 01:14:03 prometheus prometheus[20564]: time="2022-06-16T01:14:03+02:00" level=info msg="Starting prometheus (version=1.5.2+ds, branch=debian/sid, revision=1.5.2+ds-2+b3)" source="main.go:75"
Jun 16 01:14:03 prometheus prometheus[20564]: time="2022-06-16T01:14:03+02:00" level=info msg="Build context (go=go1.7.4, [email protected], date=20170521-14:39:14)" source="main.go:76"
Jun 16 01:14:03 prometheus prometheus[20564]: time="2022-06-16T01:14:03+02:00" level=info msg="Loading configuration file /etc/prometheus/prometheus.yml" source="main.go:248"
Jun 16 01:14:04 prometheus prometheus[20564]: time="2022-06-16T01:14:04+02:00" level=info msg="Loading series map and head chunks..." source="storage.go:373"
Jun 16 01:14:08 prometheus prometheus[20564]: time="2022-06-16T01:14:08+02:00" level=info msg="364314 series loaded." source="storage.go:378"
Jun 16 01:14:08 prometheus prometheus[20564]: time="2022-06-16T01:14:08+02:00" level=info msg="Starting target manager..." source="targetmanager.go:61"
Jun 16 01:14:08 prometheus prometheus[20564]: time="2022-06-16T01:14:08+02:00" level=info msg="Listening on :9090" source="web.go:235"
Jun 16 01:14:08 prometheus prometheus[20564]: time="2022-06-16T01:14:08+02:00" level=warning msg="Storage has entered rushed mode." chunksToPersist=448681 maxChunksToPersist=524288 maxMemoryChunks=1048576 memoryChunks=687476 source="storage.go:1660" urgencyScore=0.8557910919189453

When restarted like so, Prometheus enters Recovery Mode which takes 1h 30 min to complete. When it's done, the logs show the following :

Jun 16 16:10:42 prometheus prometheus[32708]: time="2022-06-16T16:10:42+02:00" level=info msg="Storage does not need throttling anymore." chunksToPersist=524288 maxChunksToPersist=524288 maxToleratedMemChunks=1153433 memoryChunks=1049320 source="storage.go:935"
Jun 16 16:10:42 prometheus prometheus[32708]: time="2022-06-16T16:10:42+02:00" level=error msg="Storage needs throttling. Scrapes and rule evaluations will be skipped." chunksToPersist=525451 maxChunksToPersist=524288 maxToleratedMemChunks=1153433 memoryChunks=1050483 source="storage.go:927"
Jun 16 16:15:31 prometheus prometheus[32708]: time="2022-06-16T16:15:31+02:00" level=info msg="Checkpointing in-memory metrics and chunks..." source="persistence.go:612"
Jun 16 16:16:28 prometheus prometheus[32708]: time="2022-06-16T16:16:28+02:00" level=info msg="Done checkpointing in-memory metrics and chunks in 57.204367083s." source="persistence.go:639"

The checkpointing is repeating often and takes about 1 min.

The monitoring for this server show the following :

Here are the flags used :

/usr/bin/prometheus --storage.local.path /path/to/prometheus/metrics --storage.local.retention=1460h0m0s --storage.local.series-file-shrink-ratio=0.3

Prometheus version :

prometheus --version
prometheus, version 1.5.2+ds (branch: debian/sid, revision: 1.5.2+ds-2+b3)
  build user:       [email protected]
  build date:       20170521-14:39:14
  go version:       go1.7.4

I decided to move some metrics on another server so this one is not as loaded as before. However, this server does have to scrape the metrics for 50+ other servers. What could be the cause of this ?

分享到QQ

分享到微博