AWS AutoScaling 不起作用/CPU 利用率保持在 30% 以下

发布于 2025-01-01 01:07:10 字数 418 浏览 3 评论 0原文

我已按如下方式设置 AWS AutoScaling:

1) 创建一个负载均衡器并向其注册一个实例;
2)为ELB添加了健康检查;
3) 添加了 2 个警报:
- CPU 使用情况 -> 60%,持续60秒,旋转1次;
- CPU使用率< 40%持续120秒,减速1次;
4) 编写了一个 jMeter 脚本来将流量发送到相关网站:250 个线程,200 秒的启动时间,循环计数 5。

我所看到的非常奇怪。

我预计 CPU 使用率会随着用户数量的增加而增加。但相反,CPU 使用率保持在 20-30% 之间(这就是新实例永远不会启动的原因),并且一旦用户数量超过 100 个,正在运行的实例就会开始抛出超时错误。

当网站实际上超时时,我不明白为什么 CPU 使用率如此低。

有想法吗?

I have setup AWS AutoScaling as following:

1) created a Load Balancer and registered one instance with it;

2) added Health Checks to the ELB;

3) added 2 Alarms:

- CPU Usage -> 60% for 60s, spin up 1 instance;

- CPU usage < 40% for 120s, spin down 1 instance;

4) wrote a jMeter script to send traffic to the website in question: 250 threads, 200 seconds ramp up time, loop count 5.

What I am seeing was very strange.

I expect the CPU usage to shoot up with the higher number of users. But instead the CPU usage stays between 20-30% (which is why the new instance never fires up) and running instance starts throwing timeout errors once it reaches anything more than 100 users.

I am at a loss to understand why CPU usage is so low when the website is in fact timing out.

Ideas?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

这样的小城市 2025-01-08 01:07:10

这可能是 ELB 的问题。 ELB 的扩展速度不是很快,需要有稳定的 ELB 流量才能让亚马逊知道您需要更大的流量。如果你一下子狠狠地打击它,那无助于它的扩展。因此 ELB 可能在处理所有连接时遇到问题。

这是 SSL 吗?你在 ELB 上使用 SSL 吗?这也会增加规模不足的 ELB 的开销。

老实说,我建议根本不要使用 ELB。 haproxy 是一个更好的产品,在大多数情况下速度更快。如果需要,我可以详细说明,但只需看看 Amazon 如何处理 cname 以及使用 haproxy 可以做什么...

This could be a problem with the ELB. The ELB does not scale very quickly, it takes a consistent amount of traffic to the ELB to let amazon know you need a bigger one. If you just hit it really hard all at once that does not help it scale. So the ELB could be having problems handling all the connections.

Is this SSL? Are you doing SSL on the ELB? That would add overhead to an underscaled ELB as well.

I would honestly recommend not using ELB at all. haproxy is a much better product and much faster in most cases. I can elaborate if needed, but just look at how Amazon handles the cname vs what you can do with haproxy...

來不及說愛妳 2025-01-08 01:07:10

听起来您正在测试 AutoScaling 以确保它能够满足您的需求。作为简单查看 AS 是否会启动新实例的第一遍,请尝试将 CPU 上升检查减少到 25% 时触发。我意识到这比您希望继续使用的值要低得多,但它将有助于验证您的初始配置是否有效。

第二步,您应该查看您的应用程序,看看 CPU 是否是 AS 监控扩展的最佳指标。您的应用程序中的其他地方可能存在不一定与 CPU 相关的瓶颈(Web 服务器调整、内存、数据库、存储等)。您没有提及您正在提供什么类型的内容;它是静态的还是由解释器(如 PHP 或其他)生成的?您还可以将自己的自定义指标数据发送到 CloudWatch 并使用该指标来触发扩展。

您可能还需要计算实例从冷启动准备好为流量提供服务所需的时间。如果花费的时间超过 60 秒,您可能需要适当调整监控阈值时间(或设置冷却时间)。正如 chantheman 指出的,ELB 注册实例也可能需要一些时间(如果新实例位于不同的可用区,则需要更长的时间)。

我希望所有这些都有帮助。

It sounds like you are testing AutoScaling to ensure it will work for your needs. As a first pass to simply see if AS will launch a new instance, try reducing your CPU up check to trigger at 25%. I realize this is a lot lower than you are hoping to use moving forward, but it will help validate that your initial configuration is working.

As a second step, you should take a look at your application and see if CPU is the best metric to have AS monitor for scaling. It is possible that you have a bottleneck somewhere else in your app that may not necessarily be CPU related (web server tuning, memory, databases, storage, etc). You didn't mention what type of content you're serving out; is it static or generated by an interpreter (like PHP or something else)? You could also send your own custom metric data into CloudWatch and use this metric to trigger the scaling.

You may also want to time how long it takes for an instance to be ready to serve traffic from a cold start. If it takes longer than 60 seconds, you may want to adjust your monitoring threshold time appropriately (or set cool down periods). As chantheman pointed out, it can take some time for the ELB to register the instance as well (and a longer amount of time if the new instance is in a different AZ).

I hope all of this helps.

梦与时光遇 2025-01-08 01:07:10

我们发现,当您在 t2 实例上使用自动缩放时,在重负载下,这些实例将耗尽 CPU 积分,然后它们的 CPU 被限制在 20%(从监控的角度来看,内部 htop 仍然是 100) %)。在内部,它们处于最大负载。

这会向自动缩放发送错误的指标,并且新闻实例不会触发。

您需要更改指标或开发自己的实例或移动到 m 个实例。

What we discovered is that when you are using autoscale on t2 instances, and under heavy load, those instances will run out of CPU credits and then they are limited to 20% of CPU (from the monitoring point of view, internal htop is still 100%). Internally they are at maximum load.

This sends false metric to Autoscaling and news instances will not fire.

You need to change metric or develop you own or move to m instances.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文