ECS / EC2自动缩放并不是一个接一个地处理两个任务
我目前的智慧结束了,试图解决这个问题。
我们有一个步骤功能管道,该管道可以在Fargate和EC2 ECS实例的混合物上运行任务。他们都处于同一集群中。
如果我们运行一个需要EC2的任务,并且之后要运行另一个任务,该任务也使用EC2,我们必须放置20分钟的Wait
命令,以使第二个任务成功运行。
它似乎不使用现有的EC2实例,或者在运行第二个任务时再扩展更多?它给出了资源的错误:内存
。我希望它可以扩展更多的EC2实例,以符合需求,或者使用现有的EC2实例来运行任务。
ECS集群具有一个容量提供商,其量身限度扩大,托管终止保护和目标容量为100%。
ASG的最小容量为0,最大容量为8。 它已管理扩展。 实例类型是R5.4XLARGE
示例步骤函数,它重新系列问题:
{
"StartAt": "Set up variables",
"States": {
"Set up variables": {
"Type": "Pass",
"Next": "Map1",
"Result": [
1,
2,
3
],
"ResultPath": "$.input"
},
"Map1": {
"Type": "Map",
"Next": "Map2",
"ItemsPath": "$.input",
"ResultPath": null,
"Iterator": {
"StartAt": "Inner1",
"States": {
"Inner1": {
"ResultPath": null,
"Type": "Task",
"TimeoutSeconds": 2000,
"End": true,
"Resource": "arn:aws:states:::ecs:runTask.sync",
"Parameters": {
"Cluster": "arn:aws:ecs:CLUSTER_ID",
"TaskDefinition": "processing-task",
"NetworkConfiguration": {
"AwsvpcConfiguration": {
"Subnets": [
"subnet-111"
]
}
},
"Overrides": {
"Memory": "110000",
"Cpu": "4096",
"ContainerOverrides": [
{
"Command": [
"sh",
"-c",
"sleep 600"
],
"Name": "processing-task"
}
]
}
}
}
}
}
},
"Map2": {
"Type": "Map",
"End": true,
"ItemsPath": "$.input",
"Iterator": {
"StartAt": "Inner2",
"States": {
"Inner2": {
"ResultPath": null,
"Type": "Task",
"TimeoutSeconds": 2000,
"End": true,
"Resource": "arn:aws:states:::ecs:runTask.sync",
"Parameters": {
"Cluster": "arn:aws:ecs:CLUSTER_ID",
"TaskDefinition": "processing-task",
"NetworkConfiguration": {
"AwsvpcConfiguration": {
"Subnets": [
"subnet-111"
]
}
},
"Overrides": {
"Memory": "110000",
"Cpu": "4096",
"ContainerOverrides": [
{
"Command": [
"sh",
"-c",
"sleep 600"
],
"Name": "processing-task"
}
]
}
}
}
}
}
}
}
}
我到目前为止尝试过的是:
我尝试在EC2实例中更改冷却时间,并以少量的成功。唯一的问题是,现在它的扩展太快了,我们仍然必须等待之前等待更多任务,只有我们必须等待更短的时间。
请让我知道我们想要什么,如果是可能的话 谢谢
I'm currently at my wits end trying to figure this out.
We have a step functions pipeline that runs tasks on a mixture of Fargate and EC2 ECS instances. They are all in the same cluster.
If we run a task that requires EC2, and we want to run another task afterwards that also uses EC2 we have to put a 20 minute Wait
command in order for the second task to run successfully.
It doesn't seem to use existing EC2 instances, or scale up any more for when we run the second task? It gives the error of RESOURCE:MEMORY
. I would expect it to scale up some more EC2 instances in order to match the demand, or to use the existing EC2 instances to run the tasks.
The ECS cluster has a capacity provider with managed scaling on, managed termination protection on and target capacity at 100%.
The ASG has a min capacity of 0, and a max capacity of 8.
It has managed scaling on.
Instance type is r5.4xlarge
Example step function that recreates the problem:
{
"StartAt": "Set up variables",
"States": {
"Set up variables": {
"Type": "Pass",
"Next": "Map1",
"Result": [
1,
2,
3
],
"ResultPath": "$.input"
},
"Map1": {
"Type": "Map",
"Next": "Map2",
"ItemsPath": "$.input",
"ResultPath": null,
"Iterator": {
"StartAt": "Inner1",
"States": {
"Inner1": {
"ResultPath": null,
"Type": "Task",
"TimeoutSeconds": 2000,
"End": true,
"Resource": "arn:aws:states:::ecs:runTask.sync",
"Parameters": {
"Cluster": "arn:aws:ecs:CLUSTER_ID",
"TaskDefinition": "processing-task",
"NetworkConfiguration": {
"AwsvpcConfiguration": {
"Subnets": [
"subnet-111"
]
}
},
"Overrides": {
"Memory": "110000",
"Cpu": "4096",
"ContainerOverrides": [
{
"Command": [
"sh",
"-c",
"sleep 600"
],
"Name": "processing-task"
}
]
}
}
}
}
}
},
"Map2": {
"Type": "Map",
"End": true,
"ItemsPath": "$.input",
"Iterator": {
"StartAt": "Inner2",
"States": {
"Inner2": {
"ResultPath": null,
"Type": "Task",
"TimeoutSeconds": 2000,
"End": true,
"Resource": "arn:aws:states:::ecs:runTask.sync",
"Parameters": {
"Cluster": "arn:aws:ecs:CLUSTER_ID",
"TaskDefinition": "processing-task",
"NetworkConfiguration": {
"AwsvpcConfiguration": {
"Subnets": [
"subnet-111"
]
}
},
"Overrides": {
"Memory": "110000",
"Cpu": "4096",
"ContainerOverrides": [
{
"Command": [
"sh",
"-c",
"sleep 600"
],
"Name": "processing-task"
}
]
}
}
}
}
}
}
}
}
What I've tried so far:
I've tried changing the cooldown period for the EC2 instances, with a small amount of success. The only problem is that it now scales up too fast and we still have to wait before running more tasks, only we have to wait a shorter time.
Please let me know if what we want is possible and how to do it if it is
Thank you
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我最近与能力提供商遇到了类似的情况。通过ECS Run-Task(用Lambda调用)并未返回响应中的任务信息。尽管如此,在集群上的处理状态中排队了一个任务,它将待一段时间,然后最终无法从错误资源开始:内存。
猜测:似乎问题与容量提供商的刷新间隔有关: https://aws.amazon.com/blogs/containers/containers/deep-deep-dive-on-amazon-emazon-ecs-cluster-auto-auto-auto-scaling/ 。
为了使群集根据其警报扩展(或in), 的能力保存需要更改,但是超过您当前总容量的任务安装爆发似乎并不总是满足这一要求。
如果响应包含一个空的
tasks []
Collection,我们能够克服这种行为,即通过指数级备份并重试呼叫ECS运行任务来克服这种行为。这对我们的任务放置吞吐量只有很小的影响,而且我们还没有看到问题的重新出现。I very recently ran into a similar scenario with a Capacity Provider. Bursts of concurrent task placements via ECS run-task (invoked with a Lambda) were not returning task information in the response. Despite this, a task was queued in the PROCESSING state on the cluster where it would sit for some time and then eventually fail to start with the error RESOURCE:MEMORY.
Speculation: It seems that the problem is related to the capacity provider's refresh interval of CapacityProviderReservation: https://aws.amazon.com/blogs/containers/deep-dive-on-amazon-ecs-cluster-auto-scaling/.
CapacityProviderReservation needs to change in order for your cluster to scale out (or in) based on its Alarm, but bursts of task placements which exceed your total current capacity don't always seem to satisfy this requirement.
We were able to overcome this behavior of failing to place tasks by exponentially backing off and retrying the call to ECS run-task if the response contains an empty
tasks[]
collection. This has had only a minor impact on our task placement throughput and we haven't seen the problem reoccur, since.