我的图书馆的变化使它变慢了很多。分析对我没有帮助。速度缓慢的原因可能是什么?
我的问题,简单地说,
我对我的库进行了更改,现在它慢得多,但我无法弄清楚它在哪里花费了所有额外的时间。分析报告没有帮助。请帮我看看可能是什么原因。
一些背景
我制作了一个名为 Hedis 的 Redis 客户端库,并为其提供了一个基准程序。现在,我对库进行了一些内部更改,以清理架构。这导致性能(以每秒 Redis 请求数为单位,根据上述基准测试)下降约 2.5 倍。
该基准测试打开 50 个到本地主机上的 Redis 服务器的网络连接。两个版本之间的连接处理方式不同:
- 快速版本每个连接使用一个线程(因此,基准测试有 50 个并发运行的线程)。它从套接字读取 使用
unsafeInterleaveIO
进行处理(我广泛地描述了我的方法 此处的博客文章中的笔划)。我有点不高兴 与架构,因此我将其更改为 - 慢速版本,每个连接使用三个线程。它们通过两个
Chan
进行通信(基准测试中运行 150 个线程)。
一些可能相关的更多信息:
- 使用 GHC 7.2.2 编译。
- 两个版本之间的基准测试程序没有变化,因此网络 流量是一样的。
- 两个版本都使用单线程运行时(编译没有
-线程
)。 - 所有线程都是通过调用
forkIO
创建的。 不是越贵forkOS
。
分析结果
分析并没有给出性能下降的明确原因。根据分析报告,这两个版本 99% 以上的时间都花在 System.IO.hFlush 和 Data.ByteString.hGetSome 上。两个版本中调用 hFlush
和 hGetSome
的次数是相同的。由于两种情况下的网络流量也相同,因此这些功能不能成为速度变慢的原因。
我能测量到的两个版本之间唯一显着的区别是 time
(Unix 实用程序)告诉我的:慢速版本(线程数是原来的三倍)在“sys”中花费的时间要多得多,而不是在“sys”中花费更多的时间。与快速版本相比,“用户”。 GHC +RTS -s
标志将此报告为生产力降低。
以下是带有 +RTS -s
标志的两个版本的程序输出:
快速版本的
$ time ./dist/build/hedis-benchmark/hedis-benchmark +RTS -s -p
ping 33305.29 Req/s
get 25802.92 Req/s
mget 18215.94 Req/s
ping (pipelined) 268994.36 Req/s
5,118,163,904 bytes allocated in the heap
185,075,608 bytes copied during GC
4,084,384 bytes maximum residency (39 sample(s))
916,544 bytes maximum slop
10 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 7416 colls, 0 par 0.38s 0.40s 0.0001s 0.0003s
Gen 1 39 colls, 0 par 0.03s 0.03s 0.0007s 0.0009s
INIT time 0.00s ( 0.00s elapsed)
MUT time 7.93s ( 12.34s elapsed)
GC time 0.41s ( 0.43s elapsed)
RP time 0.00s ( 0.00s elapsed)
PROF time 0.00s ( 0.00s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 8.33s ( 12.76s elapsed)
%GC time 4.9% (3.3% elapsed)
Alloc rate 645,587,554 bytes per MUT second
Productivity 95.1% of total user, 62.1% of total elapsed
real 0m12.772s
user 0m8.334s
sys 0m4.424s
基准 慢速版本的基准
$ time ./dist/build/hedis-benchmark/hedis-benchmark +RTS -s -p
ping 11457.83 Req/s
get 11169.64 Req/s
mget 8446.96 Req/s
ping (pipelined) 130114.31 Req/s
6,053,055,680 bytes allocated in the heap
1,184,574,408 bytes copied during GC
9,750,264 bytes maximum residency (198 sample(s))
2,872,280 bytes maximum slop
26 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 9105 colls, 0 par 2.11s 2.14s 0.0002s 0.0006s
Gen 1 198 colls, 0 par 0.23s 0.24s 0.0012s 0.0093s
INIT time 0.00s ( 0.00s elapsed)
MUT time 10.99s ( 27.92s elapsed)
GC time 2.34s ( 2.38s elapsed)
RP time 0.00s ( 0.00s elapsed)
PROF time 0.00s ( 0.00s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 13.33s ( 30.30s elapsed)
%GC time 17.6% (7.8% elapsed)
Alloc rate 550,656,490 bytes per MUT second
Productivity 82.4% of total user, 36.3% of total elapsed
real 0m30.305s
user 0m13.333s
sys 0m16.964s
您是否有任何想法或提示,额外的时间可能来自何处?
My Problem, Briefly
I made a change to my library, now it's much slower but I can't figure out where it spends all that additional time. Profiling reports are not helping. Please help me figure out what the reason might be.
Some Context
I made a Redis client-library called Hedis and have a benchmark program for it. Now, I made some internal changes to the library, to clean up the architecture. This caused performance (in Redis-requests per second, as measured by said benchmark) to drop by a factor of about 2.5.
The benchmark opens 50 network connections to a Redis server on localhost. The connections are handled differently between the two versions:
- The fast version uses one thread per connection (thus, the benchmark has 50 threads running concurrently). It reads from the socket
handle by usingunsafeInterleaveIO
(I describe my approach in broad
strokes in a blog post here). I was a bit unhappy
with the architecture, hence I changed things to - the slow version which uses three threads per connection. They communicate via two
Chan
s (150 threads running in the benchmark).
Some more information that might be relevant:
- Compiled with GHC 7.2.2.
- The benchmark program is unchanged between the two versions, so network
traffic is the same. - Both versions use the single-threaded runtime (compiled without
-threaded
). - All threads are created by calling
forkIO
. Not the more expensiveforkOS
.
Profiling Results
Profiling does not give me a clear reason for the drop in performance. According to the profiling report, both versions spend more than 99% of their time in System.IO.hFlush
and Data.ByteString.hGetSome
. The number of times hFlush
and hGetSome
are called is the same in both versions. Since network traffic is also the same in both cases, these functions can't be the reason for the slow-down.
The only significant difference I can measure between the two versions is what time
(the Unix utility) tells me: the slow version (with three times as many threads) spends significantly more time in "sys" instead of "user", compared to the fast version. The GHC +RTS -s
flag reports this as reduced productivity.
Here are the program outputs for both versions with the +RTS -s
flag:
Benchmark of the Fast Version
$ time ./dist/build/hedis-benchmark/hedis-benchmark +RTS -s -p
ping 33305.29 Req/s
get 25802.92 Req/s
mget 18215.94 Req/s
ping (pipelined) 268994.36 Req/s
5,118,163,904 bytes allocated in the heap
185,075,608 bytes copied during GC
4,084,384 bytes maximum residency (39 sample(s))
916,544 bytes maximum slop
10 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 7416 colls, 0 par 0.38s 0.40s 0.0001s 0.0003s
Gen 1 39 colls, 0 par 0.03s 0.03s 0.0007s 0.0009s
INIT time 0.00s ( 0.00s elapsed)
MUT time 7.93s ( 12.34s elapsed)
GC time 0.41s ( 0.43s elapsed)
RP time 0.00s ( 0.00s elapsed)
PROF time 0.00s ( 0.00s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 8.33s ( 12.76s elapsed)
%GC time 4.9% (3.3% elapsed)
Alloc rate 645,587,554 bytes per MUT second
Productivity 95.1% of total user, 62.1% of total elapsed
real 0m12.772s
user 0m8.334s
sys 0m4.424s
Benchmark of the Slow Version
$ time ./dist/build/hedis-benchmark/hedis-benchmark +RTS -s -p
ping 11457.83 Req/s
get 11169.64 Req/s
mget 8446.96 Req/s
ping (pipelined) 130114.31 Req/s
6,053,055,680 bytes allocated in the heap
1,184,574,408 bytes copied during GC
9,750,264 bytes maximum residency (198 sample(s))
2,872,280 bytes maximum slop
26 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 9105 colls, 0 par 2.11s 2.14s 0.0002s 0.0006s
Gen 1 198 colls, 0 par 0.23s 0.24s 0.0012s 0.0093s
INIT time 0.00s ( 0.00s elapsed)
MUT time 10.99s ( 27.92s elapsed)
GC time 2.34s ( 2.38s elapsed)
RP time 0.00s ( 0.00s elapsed)
PROF time 0.00s ( 0.00s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 13.33s ( 30.30s elapsed)
%GC time 17.6% (7.8% elapsed)
Alloc rate 550,656,490 bytes per MUT second
Productivity 82.4% of total user, 36.3% of total elapsed
real 0m30.305s
user 0m13.333s
sys 0m16.964s
Do you have any ideas or hints where that additional time might come from?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
根据分析报告,大部分时间花费在
hFlush
和hGetSome
上。根据time
,慢速版本需要更多的系统时间。所以,我的假设是,大量时间花费在阻塞和等待上,无论是等待更多输入还是锁定和解锁线程。这是我要做的第一件事:使用
-threaded
编译代码,看看会发生什么。线程运行时使用完全不同的 IO 管理器,我强烈怀疑这一单一更改将解决您的问题。According to the profiling report, most of the time is spent in
hFlush
andhGetSome
. According totime
, the slow version takes a lot more sys time. So, my hypothesis is that a lot of time is spent blocked and waiting, whether it's waiting for more input or locking and unlocking threads.Here's the first thing I would do: compile the code with
-threaded
and see what happens. The threaded runtime uses a completely different IO manager, and I strongly suspect that this single change will fix your problem.我的猜测是这与 Chan 的开销有关。
我的第一个想法是增加 GC 时间,但事实似乎并非如此。所以我的第二个想法是,也许使用
Chan
(在MVar
之上实现)涉及的所有锁定和解锁都是问题。但这仍然只是一个猜测。您可以尝试使用
TChan
(即 STM),看看这是否会产生最轻微的差异。 (也许您可以编写一个小框架,只是为了比较两者,看看这是否是问题所在,而不是重新实现您的“真实”代码。)除此之外,我没有想法。
My guess would be that it's something to do with the overhead of
Chan
.My first thought was increased GC time, but that doesn't appear to be the case at all. So my second thought is that maybe all the locking and unlocking involved in using
Chan
(which is implemented on top ofMVar
) is the problem. But this is still only a guess.You might try
TChan
(i.e., STM) instead and see if that makes the slightest shred of difference. (Perhaps you could code up a small skeleton just to compare the two and see if that's where the problem is, rather than reimplement your "real" code.)Other than that, I'm out of ideas.