极慢的阿波罗网关 /服务器响应

发布于 2025-02-06 16:16:43 字数 686 浏览 2 评论 0原文

我们面临一些挑战性的性能问题,一些查询需要很长时间才能做出响应,通常超过20多岁,并且经常在60年代进行计时。即使使用GraphQl-Bench和Autocannon(例如,使用负载测试工具),这些极端时间也无法在本地开发环境中复制。这些相同的查询也可以在〜500ms中完成。

我们有一个带有Apollo网关实例(运行〜0.24)的联合图设置,并且2个Apollo Server的实例(运行3.x)提供了子图。所有3个都在Apollo Server + Express的顶部使用Nestjs 7。我们的解析器使用Typeorm与相当大的复杂MSSQL数据库进行连接。该数据库是非常操作的,资源监视器永远不会超过10%的RAM&中央处理器。

API服务器托管在AWS ECS上,是慷慨的资源。像DB服务器一样,CPU和RAM使用的平均值为10-20%,尽管CPU有时确实飙升至100%。

在阿波罗工作室(Apollo Studio)进行追踪似乎显示了一些高级解析器“悬挂”或“等待”长时间。 IE。在30年代的回应中27s。

我们确实有一些N+1问题,尽管我们绝对可以实施更多的问题,但在解决这些问题的地方,我们对数据加载程序进行了一些有限的使用。我并不认识到N+1问题是由于这些时机的野外范围和它们的不一致而导致了极端时机的原因。

这种情况听起来对任何人都家庭吗?有人有任何想法或指示,我们如何追踪根部问题并调试它?我们有点困惑的TBH,尤其是因为问题似乎不可能在本地进行复制。

We're facing some challenging performance issues with some queries taking an extremely long time to respond, often in excess of 20s and frequently timing out at 60s. These extreme timings cannot be reproduced in local dev environments, even when using load testing tools like graphql-bench and autocannon. These same queries can also complete in ~500ms.

We have a federated graph setup with an instance of Apollo Gateway (running ~0.24) and 2 instances of Apollo Server (running 3.x) providing subgraphs. All 3 use NestJS 7 on top of Apollo Server + express. Our resolvers use Typeorm to interface with a reasonably large and complex MSSQL database. The database is hugely OP and resource monitors never get over 10% RAM & CPU.

The API servers are hosted on AWS ECS and are generously resources. Like the DB server, the CPU and RAM usage average out at 10-20%, although the CPU does spike to >100% sometimes.

Tracing in Apollo Studio appears to show some high level resolvers "hanging" or "waiting" for long periods of time. ie. 27s out of a 30s response.

We do have some N+1 issues and we've made some limited use of Dataloader in places to resolve those issues, although we can definitely implement it more. I'm not convined that the N+1 issues are responsible for the extreme timings though due to the wild range of those timing and the inconsistencies of them.

Does this scenario sound famililar to anyone? Does anyone have any ideas or pointers how we might go about tracking down the root issue and debugging it? We're a bit stumped tbh, especially as the issues seem to be impossible to repro locally.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。
列表为空,暂无数据
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文