可变的(可能是并行的)Haskell 代码和性能调优
我现在已经实现了另一个 SHA3候选,即格罗斯特尔。这项工作仍在进行中(非常如此),但目前 224 位版本通过了所有 KAT。所以现在我想知道性能(再次:->)。这次的不同之处在于,我选择更接近地反映(优化的)C 实现,即我从 C 移植到 Haskell。优化的C 版本使用查表来实现算法。此外,该代码很大程度上基于更新包含 64 位字的数组。因此我选择在 Haskell 中使用可变的未装箱向量。
我的 Grøstl 代码可以在这里找到:https://github。 com/hakoja/SHA3/blob/master/Data/Digest/GroestlMutable.hs
算法的简短描述:这是一个 Merkle-Damgård 结构,只要还剩下 512 位消息块,就会迭代压缩函数(在我的代码中为 f512M)。压缩函数非常简单:它只是运行两个不同的独立的512位排列P和Q(permP和permQ)在我的代码中)并结合了它们的输出。这些排列是通过查找表实现的。
Q1) 首先让我困扰的是可变向量的使用使我的代码看起来非常难看。这是我第一次在 Haskell 中编写任何主要的可变代码,所以我真的不知道如何改进它。关于如何更好地构建单子代码的任何建议都将受到欢迎。
第二季度) 第二个是性能。实际上还不错,因为目前 Haskell 代码只慢了 3 倍。使用 GHC-7.2.1 并按如下方式编译:
ghc -O2 -Odph -fllvm -optlo-O3 -optlo-loop-reduce -optlo-loop-deletion
Haskell 代码使用 60 秒。输入约为 1GB,而 C 版本使用 21-22 秒。但有一些事情我觉得很奇怪:
(1) 如果我尝试内联 rnd512QM,代码需要的时间会长 4 倍,但如果我内联 rnd512PM什么也没发生!为什么会发生这种情况?这两个功能实际上是相同的!
(2) 这可能比较困难。我一直在尝试并行执行这两种排列。但目前无济于事。这是我尝试过的一个示例:
f512 h m = V.force outP `par` (V.force outQ `pseq` (V.zipWith3 xor3 h outP outQ))
where xor3 x1 x2 x3 = x1 `xor` x2 `xor` x3
inP = V.zipWith xor h m
outP = permP inP
outQ = permQ m
在检查运行时统计信息并使用 ThreadScope 时,我注意到创建了正确数量的 SPARKS,但几乎没有一个实际上被转换为有用的并行工作。因此我在加速方面一无所获。我的问题就变成了:
- P 和 Q 函数是否太小,以至于运行时无法并行运行?
- 如果不是,我对 par 和 pseq (可能还有 Vector.Unboxed.force)的使用是否错误?
- 通过转向策略我会得到什么好处吗?我该如何去做呢?
非常感谢您抽出时间。
编辑:
很抱歉没有提供任何真正的基准测试。存储库中的测试代码仅供我自己使用。对于那些想要测试代码的人,您需要编译 main.hs,然后运行它作为:
./main“算法”“testvariant”“字节对齐”
例如:
./main groestl Short224 False
或
./main groestl e False
(e 代表“Extreme”。这是 NIST KATS 提供的非常长的消息)。
I have now implemented another SHA3 candidate, namely Grøstl. This is still work in progress (very much so), but at the moment a 224-bit version pass all KATs. So now I'm wondering about performance (again :->). The difference this time, is that I chose to more closely mirror the (optimized) C implementation, i.e. I made a port from C to Haskell. The optimized C version use table-lookups to implement the algorithm. Furthermore the code is heavily based on updating an array containing 64-bit words. Thus I chose to use mutable unboxed vectors in Haskell.
My Grøstl code can be found here: https://github.com/hakoja/SHA3/blob/master/Data/Digest/GroestlMutable.hs
Short description of the algorithm: It's a Merkle-Damgård construction, iterating a compression function (f512M in my code) as long as there are 512-bits blocks of message left. The compression function is very simple: it simply runs two different independent 512-bit permutations P and Q (permP and permQ in my code) and combines their output. Its these permutations which are implemented by lookup tables.
Q1) The first thing that bothers me is that the use of mutable vectors makes my code look really fugly. This is my first time writing any major mutable code in Haskell so I don't really know how to improve this. Any tips on how I might better strucure the monadic code would be welcome.
Q2) The second is performance. Actually It's not too bad, because at the moment the Haskell code is only 3 times slower. Using GHC-7.2.1 and compiling as such:
ghc -O2 -Odph -fllvm -optlo-O3 -optlo-loop-reduce -optlo-loop-deletion
the Haskell code uses 60s. on an input of ~1GB, while the C-version uses 21-22s. But there are some things I find odd:
(1) If I try to inline rnd512QM, the code takes 4 times longer, but if I inline rnd512PM nothing happens! Why is this happening? These two functions are virtually identical!
(2) This is maybe more difficult. I've been experimenting with executing the two permutations in parallel. But currently to no avail. This is one example of what I tried:
f512 h m = V.force outP `par` (V.force outQ `pseq` (V.zipWith3 xor3 h outP outQ))
where xor3 x1 x2 x3 = x1 `xor` x2 `xor` x3
inP = V.zipWith xor h m
outP = permP inP
outQ = permQ m
When checking the run-time statistics, and using ThreadScope, I noticed that the correct number of SPARKS was created, but almost none was actually converted to useful parallel work. Thus I gained nothing in speedup. My question then becomes:
- Are the P and Q functions just too small for the runtime to bother to run in parallel?
- If not, is my use of par and pseq (and possibly Vector.Unboxed.force) wrong?
- Would I gain anything by switching to strategies? And how would I go about doing that?
Thank you so much for your time.
EDIT:
Sorry for not providing any real benchmark tests. The testing code in the repo was just intended for myself only. For those wanting to test the code out, you will need to compile main.hs, and then run it as:
./main "algorithm" "testvariant" "byte aligned"
For instance:
./main groestl short224 False
or
./main groestl e False
(e stands for "Extreme". It's the very long message provided with the NIST KATS).
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我检查了存储库,但没有简单的基准可以运行和使用,所以我的想法只是来自观察代码。编号与你的问题无关。
1)我很确定
force
不会做你想要的事情——它实际上强制底层向量的副本。2)我认为 unsafeThaw 和 unsafeFreeze 的使用有点奇怪。我只需将 f512M 放入 ST monad 中即可完成。然后像这样运行它:
3)
V.foldM'
有点愚蠢 - 你可以在列表上使用正常的(严格的)foldM - 折叠第二个参数中的向量不会好像没买什么东西4)我对
columnM
中的刘海和unsafeReads表示怀疑。另外...
a) 我怀疑异或未装箱向量可能可以在比 zipWith 更低的级别实现,利用 Data.Vector 内部结构。
b) 但是,最好不要这样做,因为它可能会干扰向量融合。
c) 经检查,
extractByte
看起来效率稍低?与其使用 fromIntegral 进行截断,不如使用 mod 或 quot ,然后使用单个 fromIntegral 直接将您带到 Int 。I checked out the repo, but there's no simple benchmark to just run and play with, so my ideas are just from eyeballing the code. Numbering is unrelated to your questions.
1) I'm pretty sure
force
doesn't do what you want -- it actually forces a copy of the underlying vector.2) I think the use of unsafeThaw and unsafeFreeze is sort of odd. I'd just put f512M in the ST monad and be done with it. Then run it something like so:
3)
V.foldM'
is sort of silly -- you can just use a normal (strict) foldM over a list -- folding over the vector in the second argument doesn't seem to buy anything.4) i'm dubious about the bangs in
columnM
and for the unsafeReads.Also...
a) I suspect that xoring unboxed vectors can probably be implemented at a lower level than
zipWith
, making use of Data.Vector internals.b) However, it may be better not to do this as it could interfere with vector fusion.
c) On inspection,
extractByte
looks slightly inefficient? Rather than using fromIntegral to truncate, maybe usemod
orquot
and then a single fromIntegral to take you directly to an Int.请务必使用
-threaded -rtsopts
进行编译并使用+RTS -N2
执行。否则,您将不会有多个操作系统线程来执行计算。尝试激发在其他地方引用的计算,否则它们可能会被收集:
_
_
3) 如果您进行切换< code>parseBlock 接受严格的字节串(或在需要时分块并打包惰性字节串),那么您可以使用
Data.Vector.Storable
并可能避免一些复制。Be sure to compile with
-threaded -rtsopts
and execute with+RTS -N2
. Without that, you won't have more than one OS thread to perform computations.Try to spark computations that are referred to elsewhere, otherwise they might be collected:
_
_
3) If you switch things up so
parseBlock
accepts strict bytestrings (or chunks and packs lazy ones when needed) then you can useData.Vector.Storable
and potentially avoid some copying.