负载测试期间特殊 /降级三角潜伏期峰值模式的原因

发布于 2025-02-13 08:59:31 字数 1242 浏览 1 评论 0 原文

我很难确定我应用程序最大百分比的以下延迟模式的潜在问题:

这是一个盖特图表,显示了4分钟的负载测试。前两分钟是相同场景的热身(这就是为什么它没有延迟图)。

在多个测试运行中,有两个带有几乎相同坡度的三角形(有时更多)可以清晰可见且可重现,无论我们在负载均衡器后面部署了多少个应用程序实例:

我正在寻找更多的调查途径,因为我很难搜索这种模式 - 它罢工了我特别奇怪,这个三角形不是“填充”的,而只是由钉子组成。此外,三角形感觉“倒置”:如果这将是一种不断增加的载荷的情况(不是),我希望看到这种三角形以倒置斜坡表现出来 - 这个斜率没有任何意义大部头书。

技术环境:

  • 这是针对AWS中使用PostgreSQL数据库的Spring Boot应用程序,
  • 在我们的Kubernetes群集中部署了6个POD,
  • 我们的gatling Test 使用了该测试的自动缩放(请参阅下面的答案) ,事实证明,这是一个谎言)
  • kubernetes入口配置是IS,如果我阅读了,这意味着对每个上游的野生关系默认值正确正确,
  • 每个pod的数据库和CPU都没有最大化,
  • 我们的负载测试计算机的网络上行链路没有最大化,并且该机器除运行负载测试
  • (请求 / sec)上的其他其他内容几乎是恒定的,并且在测量
  • 垃圾收集活动

期间进行热身 /在我们进行一些申请端优化以请求潜伏期之前,请勿在测量垃圾收集活动期间进行更改:

I am having a hard time to identify the underlying issue for the following latency pattern for the max percentile of my application:
enter image description here

This is a gatling chart that shows 4 minutes of load testing. The first two minutes are warmup of the same scenario (thats why it has no latency graph).

Two triangles (sometimes more) with a nearly identical slope are clearly visible and reproducible across multiple test runs, no matter how many application instances we deploy behind our load balancer:
enter image description here

I am looking for more paths to investigate as I have a hard time googling for this pattern - it strikes me as particularly odd that this triangle is not "filled" but just consists of spikes. Furthermore the triangle feels "inverted": if this would be a scenario with ever-increasing load (which it isn't) I would expect to see this kind of triangle manifest with an inverted slope - this slope just doesn't make any sense to me.

Technical context:

  • This is for a Spring Boot application with a PostgreSQL database in AWS
  • There are 6 pods deployed in our Kubernetes cluster, auto-scaling was disabled for this test
  • Keep-alive is used by our Gatling test (see answer below, turns out this was a lie)
  • Kubernetes ingress configuration is left as-is which implicates keep-alive to each upstream if I read the defaults correctly
  • Both the database and CPU per pod are not maxed out
  • The network uplink of our load testing machine is not maxed out and the machine does nothing else besides running the load test
  • The load (requests / sec) on the application is nearly constant and not changing after the warmup / during the measurement
  • Garbage collection activity is low

Here is another image to demonstrate the "triangle" before we made some application-side optimizations to request latency:
enter image description here

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

栖竹 2025-02-20 08:59:31

事实证明,这是一个由两部分组成的问题:

  • 我们认为我们的负载测试是使用了保持静脉连接(SSL握手昂贵,一段时间后短暂端口耗尽)
  • 一个基于自定义优先级的任务调度系统(较早的请求及其子任务的优先级比以后的请求更高)“丢失”它的任务优先级,因为Kotlin Coroutines的工作原理(线程A在Coroutine期间被暂停,另一个在剩下的工作中暂停稍后,失去任何线程局部优先级 - 可以通过 ascontextlement())修复,

而这并不能解释它确实解决了我们遇到的主要问题,而模式已经消失了。 。

This turned out to be a two-part issue:

  • we thought our load test was using keep-alive connection, which it didn't (ssl handshakes are pricey, ephemeral ports run out after some time)
  • a custom priority based task scheduling system (an earlier request and its subtasks have higher priority than later requests) "lost" it's task priority because of how Kotlin coroutines work (thread A gets suspended during a coroutine and another picks up the remaining work later, losing any threadlocal priority - this can be fixed via asContextElement())

While this does not explain the more than peculiar shape of the latency pattern it did resolve the main issues we had and the pattern is gone.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文