我的图书馆的变化使它变慢了很多。分析对我没有帮助。速度缓慢的原因可能是什么?

发布于 2024-12-31 23:22:56 字数 3884 浏览 0 评论 0原文

我的问题,简单地说,

我对我的库进行了更改,现在它慢得多,但我无法弄清楚它在哪里花费了所有额外的时间。分析报告没有帮助。请帮我看看可能是什么原因。

一些背景

我制作了一个名为 Hedis 的 Redis 客户端库,并为其提供了一个基准程序。现在,我对库进行了一些内部更改,以清理架构。这导致性能(以每秒 Redis 请求数为单位,根据上述基准测试)下降约 2.5 倍。

该基准测试打开 50 个到本地主机上的 Redis 服务器的网络连接。两个版本之间的连接处理方式不同:

  • 快速版本每个连接使用一个线程(因此,基准测试有 50 个并发运行的线程)。它从套接字读取 使用 unsafeInterleaveIO 进行处理(我广泛地描述了我的方法 此处的博客文章中的笔划)。我有点不高兴 与架构,因此我将其更改为
  • 慢速版本,每个连接使用三个线程。它们通过两个 Chan 进行通信(基准测试中运行 150 个线程)。

一些可能相关的更多信息:

  • 使用 GHC 7.2.2 编译。
  • 两个版本之间的基准测试程序没有变化,因此网络 流量是一样的。
  • 两个版本都使用单线程运行时(编译没有 -线程)。
  • 所有线程都是通过调用forkIO创建的。 不是越贵 forkOS

分析结果

分析并没有给出性能下降的明确原因。根据分析报告,这两个版本 99% 以上的时间都花在 System.IO.hFlush 和 Data.ByteString.hGetSome 上。两个版本中调用 hFlushhGetSome 的次数是相同的。由于两种情况下的网络流量也相同,因此这些功能不能成为速度变慢的原因。

我能测量到的两个版本之间唯一显着的区别是 time (Unix 实用程序)告诉我的:慢速版本(线程数是原来的三倍)在“sys”中花费的时间要多得多,而不是在“sys”中花费更多的时间。与快速版本相比,“用户”。 GHC +RTS -s 标志将此报告为生产力降低。

以下是带有 +RTS -s 标志的两个版本的程序输出:

快速版本的

$ time ./dist/build/hedis-benchmark/hedis-benchmark +RTS -s -p
ping                   33305.29 Req/s
get                    25802.92 Req/s
mget                   18215.94 Req/s
ping (pipelined)      268994.36 Req/s
   5,118,163,904 bytes allocated in the heap
     185,075,608 bytes copied during GC
       4,084,384 bytes maximum residency (39 sample(s))
         916,544 bytes maximum slop
              10 MB total memory in use (0 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0      7416 colls,     0 par    0.38s    0.40s     0.0001s    0.0003s
  Gen  1        39 colls,     0 par    0.03s    0.03s     0.0007s    0.0009s

  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time    7.93s  ( 12.34s elapsed)
  GC      time    0.41s  (  0.43s elapsed)
  RP      time    0.00s  (  0.00s elapsed)
  PROF    time    0.00s  (  0.00s elapsed)
  EXIT    time    0.00s  (  0.00s elapsed)
  Total   time    8.33s  ( 12.76s elapsed)

  %GC     time       4.9%  (3.3% elapsed)

  Alloc rate    645,587,554 bytes per MUT second

  Productivity  95.1% of total user, 62.1% of total elapsed


real    0m12.772s
user    0m8.334s
sys     0m4.424s

基准 慢速版本的基准

$ time ./dist/build/hedis-benchmark/hedis-benchmark +RTS -s -p
ping                   11457.83 Req/s
get                    11169.64 Req/s
mget                    8446.96 Req/s
ping (pipelined)      130114.31 Req/s
   6,053,055,680 bytes allocated in the heap
   1,184,574,408 bytes copied during GC
       9,750,264 bytes maximum residency (198 sample(s))
       2,872,280 bytes maximum slop
              26 MB total memory in use (0 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0      9105 colls,     0 par    2.11s    2.14s     0.0002s    0.0006s
  Gen  1       198 colls,     0 par    0.23s    0.24s     0.0012s    0.0093s

  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time   10.99s  ( 27.92s elapsed)
  GC      time    2.34s  (  2.38s elapsed)
  RP      time    0.00s  (  0.00s elapsed)
  PROF    time    0.00s  (  0.00s elapsed)
  EXIT    time    0.00s  (  0.00s elapsed)
  Total   time   13.33s  ( 30.30s elapsed)

  %GC     time      17.6%  (7.8% elapsed)

  Alloc rate    550,656,490 bytes per MUT second

  Productivity  82.4% of total user, 36.3% of total elapsed


real    0m30.305s
user    0m13.333s
sys     0m16.964s

您是否有任何想法或提示,额外的时间可能来自何处?

My Problem, Briefly

I made a change to my library, now it's much slower but I can't figure out where it spends all that additional time. Profiling reports are not helping. Please help me figure out what the reason might be.

Some Context

I made a Redis client-library called Hedis and have a benchmark program for it. Now, I made some internal changes to the library, to clean up the architecture. This caused performance (in Redis-requests per second, as measured by said benchmark) to drop by a factor of about 2.5.

The benchmark opens 50 network connections to a Redis server on localhost. The connections are handled differently between the two versions:

  • The fast version uses one thread per connection (thus, the benchmark has 50 threads running concurrently). It reads from the socket
    handle by using unsafeInterleaveIO (I describe my approach in broad
    strokes in a blog post here). I was a bit unhappy
    with the architecture, hence I changed things to
  • the slow version which uses three threads per connection. They communicate via two Chans (150 threads running in the benchmark).

Some more information that might be relevant:

  • Compiled with GHC 7.2.2.
  • The benchmark program is unchanged between the two versions, so network
    traffic is the same.
  • Both versions use the single-threaded runtime (compiled without
    -threaded).
  • All threads are created by calling forkIO. Not the more expensive
    forkOS.

Profiling Results

Profiling does not give me a clear reason for the drop in performance. According to the profiling report, both versions spend more than 99% of their time in System.IO.hFlush and Data.ByteString.hGetSome. The number of times hFlush and hGetSome are called is the same in both versions. Since network traffic is also the same in both cases, these functions can't be the reason for the slow-down.

The only significant difference I can measure between the two versions is what time (the Unix utility) tells me: the slow version (with three times as many threads) spends significantly more time in "sys" instead of "user", compared to the fast version. The GHC +RTS -s flag reports this as reduced productivity.

Here are the program outputs for both versions with the +RTS -s flag:

Benchmark of the Fast Version

$ time ./dist/build/hedis-benchmark/hedis-benchmark +RTS -s -p
ping                   33305.29 Req/s
get                    25802.92 Req/s
mget                   18215.94 Req/s
ping (pipelined)      268994.36 Req/s
   5,118,163,904 bytes allocated in the heap
     185,075,608 bytes copied during GC
       4,084,384 bytes maximum residency (39 sample(s))
         916,544 bytes maximum slop
              10 MB total memory in use (0 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0      7416 colls,     0 par    0.38s    0.40s     0.0001s    0.0003s
  Gen  1        39 colls,     0 par    0.03s    0.03s     0.0007s    0.0009s

  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time    7.93s  ( 12.34s elapsed)
  GC      time    0.41s  (  0.43s elapsed)
  RP      time    0.00s  (  0.00s elapsed)
  PROF    time    0.00s  (  0.00s elapsed)
  EXIT    time    0.00s  (  0.00s elapsed)
  Total   time    8.33s  ( 12.76s elapsed)

  %GC     time       4.9%  (3.3% elapsed)

  Alloc rate    645,587,554 bytes per MUT second

  Productivity  95.1% of total user, 62.1% of total elapsed


real    0m12.772s
user    0m8.334s
sys     0m4.424s

Benchmark of the Slow Version

$ time ./dist/build/hedis-benchmark/hedis-benchmark +RTS -s -p
ping                   11457.83 Req/s
get                    11169.64 Req/s
mget                    8446.96 Req/s
ping (pipelined)      130114.31 Req/s
   6,053,055,680 bytes allocated in the heap
   1,184,574,408 bytes copied during GC
       9,750,264 bytes maximum residency (198 sample(s))
       2,872,280 bytes maximum slop
              26 MB total memory in use (0 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0      9105 colls,     0 par    2.11s    2.14s     0.0002s    0.0006s
  Gen  1       198 colls,     0 par    0.23s    0.24s     0.0012s    0.0093s

  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time   10.99s  ( 27.92s elapsed)
  GC      time    2.34s  (  2.38s elapsed)
  RP      time    0.00s  (  0.00s elapsed)
  PROF    time    0.00s  (  0.00s elapsed)
  EXIT    time    0.00s  (  0.00s elapsed)
  Total   time   13.33s  ( 30.30s elapsed)

  %GC     time      17.6%  (7.8% elapsed)

  Alloc rate    550,656,490 bytes per MUT second

  Productivity  82.4% of total user, 36.3% of total elapsed


real    0m30.305s
user    0m13.333s
sys     0m16.964s

Do you have any ideas or hints where that additional time might come from?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

心舞飞扬 2025-01-07 23:22:56

根据分析报告,大部分时间花费在 hFlushhGetSome 上。根据time,慢速版本需要更多的系统时间。所以,我的假设是,大量时间花费在阻塞和等待上,无论是等待更多输入还是锁定和解锁线程。

这是我要做的第一件事:使用 -threaded 编译代码,看看会发生什么。线程运行时使用完全不同的 IO 管理器,我强烈怀疑这一单一更改将解决您的问题。

According to the profiling report, most of the time is spent in hFlush and hGetSome. According to time, the slow version takes a lot more sys time. So, my hypothesis is that a lot of time is spent blocked and waiting, whether it's waiting for more input or locking and unlocking threads.

Here's the first thing I would do: compile the code with -threaded and see what happens. The threaded runtime uses a completely different IO manager, and I strongly suspect that this single change will fix your problem.

不及他 2025-01-07 23:22:56

我的猜测是这与 Chan 的开销有关。

我的第一个想法是增加 GC 时间,但事实似乎并非如此。所以我的第二个想法是,也许使用 Chan (在 MVar 之上实现)涉及的所有锁定和解锁都是问题。但这仍然只是一个猜测。

您可以尝试使用 TChan(即 STM),看看这是否会产生最轻微的差异。 (也许您可以编写一个小框架,只是为了比较两者,看看这是否是问题所在,而不是重新实现您的“真实”代码。)

除此之外,我没有想法。

My guess would be that it's something to do with the overhead of Chan.

My first thought was increased GC time, but that doesn't appear to be the case at all. So my second thought is that maybe all the locking and unlocking involved in using Chan (which is implemented on top of MVar) is the problem. But this is still only a guess.

You might try TChan (i.e., STM) instead and see if that makes the slightest shred of difference. (Perhaps you could code up a small skeleton just to compare the two and see if that's where the problem is, rather than reimplement your "real" code.)

Other than that, I'm out of ideas.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文