如何从 ELB 组中正常关闭或删除 AWS 实例

发布于 12-08 12:04 字数 296 浏览 2 评论 0原文

我有一个在 Amazon 上运行的服务器实例云,使用负载均衡器来分配流量。现在我正在寻找一种好方法来优雅地缩小网络规模,而不会导致浏览器端的连接错误。

据我所知,当从负载均衡器中删除实例时,任何连接都会被粗暴地终止。

我希望有一种方法可以在实例关闭前一分钟通知我的实例,或者让负载均衡器停止向即将死亡的实例发送流量,但又不会终止与其的现有连接。

我的应用程序是基于 Node.js,在 Ubuntu 上运行。我还运行了一些特殊的软件,所以我不喜欢使用许多提供 Node.js 托管的 PAAS。

感谢您的任何提示。

I have a cloud of server instances running at Amazon using their load balancer to distribute the traffic. Now I am looking for a good way to gracefully scale the network down, without causing connection errors on the browser's side.

As far as I know, any connections of an instance will be rudely terminated when removed from the load balancer.

I would like to have a way to inform my instance like one minute before it gets shut down or to have the load balancer stop sending traffic to the dying instance, but without terminating existing connections to it.

My app is node.js based running on Ubuntu. I also have some special software running on it, so I prefer not to use the many PAAS offering node.js hosting.

Thanks for any hints.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

蓝色星空2024-12-15 12:04:26

我知道这是一个老问题,但应该指出的是,亚马逊最近添加了对连接耗尽的支持,这意味着当从负载均衡器中删除实例时,该实例将完成从负载均衡器中删除实例之前的进度。不会将任何新请求路由到已删除的实例。您还可以为这些请求提供超时,这意味着任何运行时间超过超时窗口的请求都将被终止。

要启用此行为,请转到负载均衡器的实例选项卡并更改连接耗尽行为。

I know this is an old question, but it should be noted that Amazon has recently added support for connection draining, which means that when an instance is removed from the loadbalancer, the instance will complete requests that were in progress before the instance was removed from the loadbalancer. No new requests will be routed to the instance that was removed. You can also supply a timeout for these requests, meaning any requests that run longer than the timeout window will be terminated after all.

To enable this behaviour, go to the Instances tab of your loadbalancer and change the Connection Draining behaviour.

°如果伤别离去2024-12-15 12:04:26

这个想法使用 ELB 的能力来检测不健康的节点并将其从池中删除,但它依赖于 ELB 的行为符合以下假设中的预期。这是我一直想亲自测试的东西,但还没有时间。当我这样做时,我会更新答案。

流程概述

以下逻辑可以在需要关闭节点时包装并运行。

  1. 阻止与 nodeX 的新 HTTP 连接,但继续允许现有连接
  2. 通过监视与应用程序的现有连接或允许“安全”时间量,等待现有连接耗尽。
  3. 直接使用 EC2 API 或抽象脚本在 nodeX EC2 实例上启动关闭。

根据您的应用程序,“安全”,这对于某些应用程序可能无法确定。

需要测试的假设

我们知道 ELB 从池中删除不健康的实例 我希望这是优雅的,以便:

  1. 与最近关闭的端口的新连接将被优雅地重定向到池中的下一个节点
  2. 当节点被标记为“坏”时,与该节点已建立的连接不受影响。

可能的测试用例:

  • 在ELB上触发HTTP连接(例如来自curl脚本)记录
    脚本化打开或关闭其中一个节点期间的结果
    HTTP 端口。您需要进行实验才能找到
    允许 ELB 始终确定状态的可接受时间量
    改变。
  • 维持较长的 HTTP 会话(例如文件下载),同时阻止新的会话
    HTTP 连接,长会话有望继续。

1.如何阻止 HTTP 连接

使用 NodeX 上的本地防火墙阻止新会话,但继续允许已建立的会话。

例如IP表:

iptables -A INPUT -j DROP -p tcp --syn --destination-port <web service port>

This idea uses the ELB's capability to detect an unhealthy node and remove it from the pool BUT it relies upon the ELB behaving as expected in the assumptions below. This is something I've been meaning to test for myself but haven't had the time yet. I'll update the answer when I do.

Process Overview

The following logic could be wrapped and run at the time the node needs to be shut down.

  1. Block new HTTP connections to nodeX but continue to allow existing connections
  2. Wait for existing connections to drain, either by monitoring existing connections to your application or by allowing a "safe" amount of time.
  3. Initiate a shutdown on the nodeX EC2 instance using the EC2 API directly or Abstracted scripts.

"safe" according to your application, which may not be possible to determine for some applications.

Assumptions that need to be tested

We know that ELB removes unhealthy instances from it's pool I would expect this to be graceful, so that:

  1. A new connection to a recently closed port will be gracefully redirected to the next node in the pool
  2. When a node is marked Bad, the already established connections to that node are unaffected.

possible test cases:

  • Fire HTTP connections at ELB (E.g. from a curl script) logging the
    results during scripted opening an closing of one of the nodes
    HTTP ports. You would need to experiment to find an
    acceptable amount of time that allows ELB to always determine a state
    change.
  • Maintain a long HTTP session, (E.g. file download) while blocking new
    HTTP connections, the long session should hopefully continue.

1. How to block HTTP Connections

Use a local firewall on nodeX to block new sessions but continue to allow established sessions.

For example IP tables:

iptables -A INPUT -j DROP -p tcp --syn --destination-port <web service port>
携君以终年2024-12-15 12:04:26

分配来自 ELB 的流量的推荐方法是在多个可用区中拥有相同数量的实例。例如:

ELB

  • 实例 1 (us-east-a)
  • 实例 2 (us-east-a)
  • 实例 3 (us-east-b)
  • 实例 4 (us-east-b)

现在有两个感兴趣的 ELB API,前提是:允许您以编程方式(或通过控制面板)分离实例:

  1. 取消注册实例
  2. 禁用可用区域(随后禁用该区域内的实例)

ELB 开发人员指南 有一节介绍了禁用可用区的影响。该部分中的注释特别有趣:

您的负载均衡器始终将流量分配给所有已启用的
可用区。如果一个可用区中的所有实例都
在可用区被禁用之前已取消注册或运行状况不佳
对于负载均衡器,发送到该可用区的所有请求
将失败,直到DisableAvailabilityZonesForLoadBalancer 调用该操作
可用区。

上述注释的有趣之处在于,它可能意味着如果您调用DisableAvailabilityZonesForLoadBalancer,ELB 可能会立即开始仅向可用区域发送请求 - 当您在禁用的可用区域中的服务器上执行维护时,可能会导致 0 停机体验。

上述“理论”需要亚马逊云工程师的详细测试或确认。

The recommended way for distributing traffic from your ELB is to have an equal number of instances across multiple availability zones. For example:

ELB

  • Instance 1 (us-east-a)
  • Instance 2 (us-east-a)
  • Instance 3 (us-east-b)
  • Instance 4 (us-east-b)

Now there are two ELB APIs of interest provided that allow you to programmatically (or via the control panel) detach instances:

  1. Deregister an instance
  2. Disable an availability zone (which subsequently disables the instances within that zone)

The ELB Developer Guide has a section that describes the effects of disabling an availability zone. A note in that section is of particular interest:

Your load balancer always distributes traffic to all the enabled
Availability Zones. If all the instances in an Availability Zone are
deregistered or unhealthy before that Availability Zone is disabled
for the load balancer, all requests sent to that Availability Zone
will fail until DisableAvailabilityZonesForLoadBalancer calls for that
Availability Zone.

Whats interesting about the above note is that it could imply that if you call DisableAvailabilityZonesForLoadBalancer, the ELB could instantly start sending requests only to available zones - possibly resulting in a 0 downtime experience while you perform maintenance on the servers in the disabled availability zone.

The above 'theory' needs detailed testing or acknowledgement from an Amazon cloud engineer.

煮酒2024-12-15 12:04:26

似乎这里已经有很多回复,其中一些给出了很好的建议。但我认为总的来说你的设计是有缺陷的。无论您设计的关闭程序多么完美,以确保在关闭服务器之前关闭客户端连接,您仍然容易受到攻击。

  1. 服务器可能会断电。
  2. 硬件故障导致服务器出现故障。
  3. 连接可能会因网络问题而关闭。
  4. 客户端失去互联网或 WiFi。

我可以继续列出这个清单,但我的观点是,设计时不要让系统始终正常工作。设计它来处理故障。如果您设计的系统可以处理随时断电的服务器,那么您就创建了一个非常强大的系统。这不是 ELB 的问题,而是当前系统架构的问题。

Seems like there have already been a number of responses here and some of them have good advice. But I think that in general your design is flawed. No matter how perfect you design your shutdown procedure to make sure that a clients connection is closed before shutting down a server you're still vulnerable.

  1. The server could loose power.
  2. Hardware failure causes server to fail.
  3. Connection could be closed by a network issue.
  4. Client looses internet or wifi.

I could go on with the list, but my point is that instead of designing for the system to always work correctly. Design it to handle failures. If you design a system that can handle a server loosing power at any time then you've created a very robust system. This isn't a problem with the ELB this is a problem with the current system architecture you have.

卸妝后依然美2024-12-15 12:04:26

由于我的声誉较低,我无法发表评论。这是我制作的一些片段,可能对其他人非常有用。它利用 aws cli 工具来检查实例的连接何时被耗尽。

您需要一个 ec2 实例,并在 ELB 后面提供 python 服务器。

from flask import Flask
import time

app = Flask(__name__)

@app.route("/")
def index():
    return "ok\n"

@app.route("/wait/<int:secs>")
def wait(secs):
    time.sleep(secs)
    return str(secs) + "\n"

if __name__ == "__main__":
    app.run(
        host='0.0.0.0',
        debug=True)

然后从本地工作站向 ELB 运行以下脚本。

#!/bin/bash

which jq >> /dev/null || {
   echo "Get jq from http://stedolan.github.com/jq"
}

# Fill in following vars
lbname="ELBNAME"
lburl="http://ELBURL.REGION.elb.amazonaws.com/wait/30"
instanceid="i-XXXXXXX"

getState () {
    aws elb describe-instance-health \
        --load-balancer-name $lbname \
        --instance $instanceid | jq '.InstanceStates[0].State' -r
}

register () {
    aws elb register-instances-with-load-balancer \
        --load-balancer-name $lbname \
        --instance $instanceid | jq .
}

deregister () {
    aws elb deregister-instances-from-load-balancer \
        --load-balancer-name $lbname \
        --instance $instanceid | jq .
}

waitUntil () {
    echo -n "Wait until state is $1"
    while [ "$(getState)" != "$1" ]; do
        echo -n "."
        sleep 1
    done
    echo
}

# Actual Dance
# Make sure instance is registered. Check latency until node is deregistered

if [ "$(getState)" == "OutOfService" ]; then
    register >> /dev/null
fi

waitUntil "InService"

curl $lburl &
sleep 1

deregister >> /dev/null

waitUntil "OutOfService"

I can't comment cause of my low reputation. Here is some snippets I crafted that might be very useful for someone out there. It utilizes the aws cli tool to check when an instance been drained of connections.

You need an ec2-instance with provided python server behind an ELB.

from flask import Flask
import time

app = Flask(__name__)

@app.route("/")
def index():
    return "ok\n"

@app.route("/wait/<int:secs>")
def wait(secs):
    time.sleep(secs)
    return str(secs) + "\n"

if __name__ == "__main__":
    app.run(
        host='0.0.0.0',
        debug=True)

Then run following script from local workstation towards the ELB.

#!/bin/bash

which jq >> /dev/null || {
   echo "Get jq from http://stedolan.github.com/jq"
}

# Fill in following vars
lbname="ELBNAME"
lburl="http://ELBURL.REGION.elb.amazonaws.com/wait/30"
instanceid="i-XXXXXXX"

getState () {
    aws elb describe-instance-health \
        --load-balancer-name $lbname \
        --instance $instanceid | jq '.InstanceStates[0].State' -r
}

register () {
    aws elb register-instances-with-load-balancer \
        --load-balancer-name $lbname \
        --instance $instanceid | jq .
}

deregister () {
    aws elb deregister-instances-from-load-balancer \
        --load-balancer-name $lbname \
        --instance $instanceid | jq .
}

waitUntil () {
    echo -n "Wait until state is $1"
    while [ "$(getState)" != "$1" ]; do
        echo -n "."
        sleep 1
    done
    echo
}

# Actual Dance
# Make sure instance is registered. Check latency until node is deregistered

if [ "$(getState)" == "OutOfService" ]; then
    register >> /dev/null
fi

waitUntil "InService"

curl $lburl &
sleep 1

deregister >> /dev/null

waitUntil "OutOfService"
毁虫ゝ2024-12-15 12:04:26

现有答案中未讨论的一个警告是,ELB 还使用具有 60 秒 TTL 的 DNS 记录来平衡多个 ELB 节点(每个节点都附加一个或多个实例)之间的负载。

这意味着,如果您的实例位于两个不同的可用区,则您的 ELB 可能有两个 IP 地址,其 A 记录上的 TTL 为 60 秒。当您从此类可用区域中删除最终实例时,您的客户端“可能”仍然使用旧的 IP 地址至少一分钟 - 有故障的 DNS 解析器的行为可能会更糟。

另一种情况是,ELB 使用多个 IP 并遇到相同的问题,即在单个可用区中,您拥有大量实例,这对于一台 ELB 服务器来说无法处理。在这种情况下,ELB 还将创建另一个服务器并将其 IP 添加到具有 60 秒 TTL 的 A 记录列表中。

A caveat that was not discussed in the existing answers is that ELBs also use DNS records with 60 second TTLs to balance load between multiple ELB nodes (each having one or more of your instances attached to it).

This means that if you have instances in two different availability zones, you probably have two IP addresses for your ELB with a 60s TTL on their A records. When you remove the final instances from such an availability zone, your clients "might" still use the old IP address for at least a minute - faulty DNS resolvers might behave much worse.

Another time ELBs wear multiple IPs and have the same problem, is when in a single availability zone you have a very large number of instances which is too much for one ELB server to handle. ELB in that case will also create another server and add its IP to the list of A records with a 60 second TTL.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文