当前位置：文江博客话题详情

我的图书馆的变化使它变慢了很多。分析对我没有帮助。速度缓慢的原因可能是什么？

发布于 2024-12-31 23:22:56 字数 3884 浏览 3 评论 0原文

我的问题，简单地说，

我对我的库进行了更改，现在它慢得多，但我无法弄清楚它在哪里花费了所有额外的时间。分析报告没有帮助。请帮我看看可能是什么原因。

一些背景

我制作了一个名为 Hedis 的 Redis 客户端库，并为其提供了一个基准程序。现在，我对库进行了一些内部更改，以清理架构。这导致性能（以每秒 Redis 请求数为单位，根据上述基准测试）下降约 2.5 倍。

该基准测试打开 50 个到本地主机上的 Redis 服务器的网络连接。两个版本之间的连接处理方式不同：

快速版本每个连接使用一个线程（因此，基准测试有 50 个并发运行的线程）。它从套接字读取使用 unsafeInterleaveIO 进行处理（我广泛地描述了我的方法此处的博客文章中的笔划）。我有点不高兴与架构，因此我将其更改为
慢速版本，每个连接使用三个线程。它们通过两个 Chan 进行通信（基准测试中运行 150 个线程）。

一些可能相关的更多信息：

使用 GHC 7.2.2 编译。
两个版本之间的基准测试程序没有变化，因此网络流量是一样的。
两个版本都使用单线程运行时（编译没有 -线程）。
所有线程都是通过调用forkIO创建的。不是越贵 forkOS。

分析结果

分析并没有给出性能下降的明确原因。根据分析报告，这两个版本 99% 以上的时间都花在 System.IO.hFlush 和 Data.ByteString.hGetSome 上。两个版本中调用 hFlush 和 hGetSome 的次数是相同的。由于两种情况下的网络流量也相同，因此这些功能不能成为速度变慢的原因。

我能测量到的两个版本之间唯一显着的区别是 time （Unix 实用程序）告诉我的：慢速版本（线程数是原来的三倍）在“sys”中花费的时间要多得多，而不是在“sys”中花费更多的时间。与快速版本相比，“用户”。 GHC +RTS -s 标志将此报告为生产力降低。

以下是带有 +RTS -s 标志的两个版本的程序输出：

快速版本的

$ time ./dist/build/hedis-benchmark/hedis-benchmark +RTS -s -p
ping                   33305.29 Req/s
get                    25802.92 Req/s
mget                   18215.94 Req/s
ping (pipelined)      268994.36 Req/s
   5,118,163,904 bytes allocated in the heap
     185,075,608 bytes copied during GC
       4,084,384 bytes maximum residency (39 sample(s))
         916,544 bytes maximum slop
              10 MB total memory in use (0 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0      7416 colls,     0 par    0.38s    0.40s     0.0001s    0.0003s
  Gen  1        39 colls,     0 par    0.03s    0.03s     0.0007s    0.0009s

  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time    7.93s  ( 12.34s elapsed)
  GC      time    0.41s  (  0.43s elapsed)
  RP      time    0.00s  (  0.00s elapsed)
  PROF    time    0.00s  (  0.00s elapsed)
  EXIT    time    0.00s  (  0.00s elapsed)
  Total   time    8.33s  ( 12.76s elapsed)

  %GC     time       4.9%  (3.3% elapsed)

  Alloc rate    645,587,554 bytes per MUT second

  Productivity  95.1% of total user, 62.1% of total elapsed


real    0m12.772s
user    0m8.334s
sys     0m4.424s

基准慢速版本的基准

$ time ./dist/build/hedis-benchmark/hedis-benchmark +RTS -s -p
ping                   11457.83 Req/s
get                    11169.64 Req/s
mget                    8446.96 Req/s
ping (pipelined)      130114.31 Req/s
   6,053,055,680 bytes allocated in the heap
   1,184,574,408 bytes copied during GC
       9,750,264 bytes maximum residency (198 sample(s))
       2,872,280 bytes maximum slop
              26 MB total memory in use (0 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0      9105 colls,     0 par    2.11s    2.14s     0.0002s    0.0006s
  Gen  1       198 colls,     0 par    0.23s    0.24s     0.0012s    0.0093s

  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time   10.99s  ( 27.92s elapsed)
  GC      time    2.34s  (  2.38s elapsed)
  RP      time    0.00s  (  0.00s elapsed)
  PROF    time    0.00s  (  0.00s elapsed)
  EXIT    time    0.00s  (  0.00s elapsed)
  Total   time   13.33s  ( 30.30s elapsed)

  %GC     time      17.6%  (7.8% elapsed)

  Alloc rate    550,656,490 bytes per MUT second

  Productivity  82.4% of total user, 36.3% of total elapsed


real    0m30.305s
user    0m13.333s
sys     0m16.964s

您是否有任何想法或提示，额外的时间可能来自何处？

原文

My Problem, Briefly

I made a change to my library, now it's much slower but I can't figure out where it spends all that additional time. Profiling reports are not helping. Please help me figure out what the reason might be.

Some Context

I made a Redis client-library called Hedis and have a benchmark program for it. Now, I made some internal changes to the library, to clean up the architecture. This caused performance (in Redis-requests per second, as measured by said benchmark) to drop by a factor of about 2.5.

The benchmark opens 50 network connections to a Redis server on localhost. The connections are handled differently between the two versions:

The fast version uses one thread per connection (thus, the benchmark has 50 threads running concurrently). It reads from the socket
handle by using unsafeInterleaveIO (I describe my approach in broad
strokes in a blog post here). I was a bit unhappy
with the architecture, hence I changed things to
the slow version which uses three threads per connection. They communicate via two Chans (150 threads running in the benchmark).

Some more information that might be relevant:

Compiled with GHC 7.2.2.
The benchmark program is unchanged between the two versions, so network
traffic is the same.
Both versions use the single-threaded runtime (compiled without
-threaded).
All threads are created by calling forkIO. Not the more expensive
forkOS.

Profiling Results

Profiling does not give me a clear reason for the drop in performance. According to the profiling report, both versions spend more than 99% of their time in System.IO.hFlush and Data.ByteString.hGetSome. The number of times hFlush and hGetSome are called is the same in both versions. Since network traffic is also the same in both cases, these functions can't be the reason for the slow-down.

The only significant difference I can measure between the two versions is what time (the Unix utility) tells me: the slow version (with three times as many threads) spends significantly more time in "sys" instead of "user", compared to the fast version. The GHC +RTS -s flag reports this as reduced productivity.

Here are the program outputs for both versions with the +RTS -s flag:

Benchmark of the Fast Version

$ time ./dist/build/hedis-benchmark/hedis-benchmark +RTS -s -p
ping                   33305.29 Req/s
get                    25802.92 Req/s
mget                   18215.94 Req/s
ping (pipelined)      268994.36 Req/s
   5,118,163,904 bytes allocated in the heap
     185,075,608 bytes copied during GC
       4,084,384 bytes maximum residency (39 sample(s))
         916,544 bytes maximum slop
              10 MB total memory in use (0 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0      7416 colls,     0 par    0.38s    0.40s     0.0001s    0.0003s
  Gen  1        39 colls,     0 par    0.03s    0.03s     0.0007s    0.0009s

  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time    7.93s  ( 12.34s elapsed)
  GC      time    0.41s  (  0.43s elapsed)
  RP      time    0.00s  (  0.00s elapsed)
  PROF    time    0.00s  (  0.00s elapsed)
  EXIT    time    0.00s  (  0.00s elapsed)
  Total   time    8.33s  ( 12.76s elapsed)

  %GC     time       4.9%  (3.3% elapsed)

  Alloc rate    645,587,554 bytes per MUT second

  Productivity  95.1% of total user, 62.1% of total elapsed


real    0m12.772s
user    0m8.334s
sys     0m4.424s

Benchmark of the Slow Version

$ time ./dist/build/hedis-benchmark/hedis-benchmark +RTS -s -p
ping                   11457.83 Req/s
get                    11169.64 Req/s
mget                    8446.96 Req/s
ping (pipelined)      130114.31 Req/s
   6,053,055,680 bytes allocated in the heap
   1,184,574,408 bytes copied during GC
       9,750,264 bytes maximum residency (198 sample(s))
       2,872,280 bytes maximum slop
              26 MB total memory in use (0 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0      9105 colls,     0 par    2.11s    2.14s     0.0002s    0.0006s
  Gen  1       198 colls,     0 par    0.23s    0.24s     0.0012s    0.0093s

  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time   10.99s  ( 27.92s elapsed)
  GC      time    2.34s  (  2.38s elapsed)
  RP      time    0.00s  (  0.00s elapsed)
  PROF    time    0.00s  (  0.00s elapsed)
  EXIT    time    0.00s  (  0.00s elapsed)
  Total   time   13.33s  ( 30.30s elapsed)

  %GC     time      17.6%  (7.8% elapsed)

  Alloc rate    550,656,490 bytes per MUT second

  Productivity  82.4% of total user, 36.3% of total elapsed


real    0m30.305s
user    0m13.333s
sys     0m16.964s

Do you have any ideas or hints where that additional time might come from?

分享到QQ

分享到微博