错误“没有工作人员可用于服务请求:模型”在负载增加期间调用 SageMaker 端点时
我有一个自定义容器,它接受请求,进行一些特征提取,然后将增强的请求传递到分类器端点。在特征提取期间,会调用另一个端点来生成文本嵌入。我正在为我的嵌入模型使用 HuggingFace 估计器 。
它一直工作正常,但请求增加了,看起来嵌入端点不知何故超时了。
我正在考虑添加 自动缩放到端点,但我想确保我了解发生了什么并且它正确解决了问题。不幸的是,搜索此错误消息并没有得到太多结果。实例指标未显示端点过载 - CPU 利用率最大约为 30%。自动缩放会解决无人工作的问题吗?还是有什么不同?当时我每分钟收到几百个请求。
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 2073, in wsgi_app
response = self.full_dispatch_request()
File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 1518, in full_dispatch_request
rv = self.handle_user_exception(e)
File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 1516, in full_dispatch_request
rv = self.dispatch_request()
File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 1502, in dispatch_request
return self.ensure_sync(self.view_functions[rule.endpoint])(**req.view_args)
File "/opt/program/predictor.py", line 56, in transformation
result = preprocessor.transform(data)
File "/opt/program/preprocessor.py", line 189, in transform
response = embed_predictor.predict(data=json.dumps(payload))
File "/usr/local/lib/python3.7/site-packages/sagemaker/predictor.py", line 136, in predict
response = self.sagemaker_session.sagemaker_runtime_client.invoke_endpoint(**request_args)
File "/usr/local/lib/python3.7/site-packages/botocore/client.py", line 386, in _api_call
return self._make_api_call(operation_name, kwargs)
File "/usr/local/lib/python3.7/site-packages/botocore/client.py", line 705, in _make_api_call
raise error_class(parsed_response, operation_name)
botocore.errorfactory.ModelError: An error occurred (ModelError) when calling the InvokeEndpoint operation: Received server error (503) from primary with message "{
"code": 503,
"type": "ServiceUnavailableException",
"message": "No worker is available to serve request: model"
}
I have a custom container that takes a request, does some feature extraction and then passes on the enhanced request to a classifier endpoint. During feature extraction another endpoint is being called for generating text embeddings. I am using the HuggingFace estimator for my embedding model.
It has been working fine, but there was an increase in requests and looks like the embedding endpoint timed out somehow.
I am looking at adding automatic scaling to the endpoint, but I want to make sure I understand what is happening and that it properly addresses the issue. Unfortunately searching for this error message does not pull up much. The instance metrics is not showing the endpoint to be overloaded - cpu utilization was max ~30%. Would auto scaling address the no worker issue or is this something different? I was receiving a few hundred requests per minute at the time.
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 2073, in wsgi_app
response = self.full_dispatch_request()
File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 1518, in full_dispatch_request
rv = self.handle_user_exception(e)
File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 1516, in full_dispatch_request
rv = self.dispatch_request()
File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 1502, in dispatch_request
return self.ensure_sync(self.view_functions[rule.endpoint])(**req.view_args)
File "/opt/program/predictor.py", line 56, in transformation
result = preprocessor.transform(data)
File "/opt/program/preprocessor.py", line 189, in transform
response = embed_predictor.predict(data=json.dumps(payload))
File "/usr/local/lib/python3.7/site-packages/sagemaker/predictor.py", line 136, in predict
response = self.sagemaker_session.sagemaker_runtime_client.invoke_endpoint(**request_args)
File "/usr/local/lib/python3.7/site-packages/botocore/client.py", line 386, in _api_call
return self._make_api_call(operation_name, kwargs)
File "/usr/local/lib/python3.7/site-packages/botocore/client.py", line 705, in _make_api_call
raise error_class(parsed_response, operation_name)
botocore.errorfactory.ModelError: An error occurred (ModelError) when calling the InvokeEndpoint operation: Received server error (503) from primary with message "{
"code": 503,
"type": "ServiceUnavailableException",
"message": "No worker is available to serve request: model"
}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我建议确认
MemoryUtilization
没有被淹没,并且 CloudWatch Logs 中也没有特定错误。如果
MemoryUtilization
不堪重负,您可以测试配置 Auto Scaling,以便将请求负载分配到多个实例。话虽这么说,虽然我不确定您的自定义容器的详细信息,但我还建议确认容器本身可以处理多个并发请求(即有多个工作人员可用于处理请求)。I would suggest confirming
MemoryUtilization
is not being overwhelmed and there is no specifc error in CloudWatch Logs as well.If
MemoryUtilization
is overwhelmed, you can test configuring Auto Scaling in order to distribute the load of request to multiple instances. That being said, while I am not sure of the details of your custom container, I also recommend confirming the container itself can handle multiple concurrent requests (i.e have multiple workers available to serve requests).