有效地将 64 位 Double 转换为 ByteString

发布于 2024-12-18 21:01:05 字数 3155 浏览 3 评论 0原文

我编写了一个将 64 位 Double 转换为 ByteString 的函数(架构/类型安全并不是真正的问题 - 让我们现在假设 Double 是 64 位 Word)。虽然下面的函数运行良好,但我想知道是否有更快的方法将 Double 转换为 ByteString。在下面的代码中,首先将 Word64 解包到 Word8 列表中,然后进行反向操作(使其成为小端格式),然后打包到 ByteString 中。代码如下:

{-# LANGUAGE MagicHash #-}
import GHC.Prim
import GHC.Types
import GHC.Word
import Data.Bits (shiftR)
import Data.ByteString (pack, unpack)
import Data.ByteString.Internal (ByteString)
import Text.Printf (printf)

encodeDouble :: Double -> ByteString
encodeDouble (D# x) = pack $ reverse $ unpack64 $ W64# (unsafeCoerce# x)

unpack64 :: Word64 -> [Word8]
unpack64 x = map (fromIntegral.(shiftR x)) [56,48..0]

-- function to convert list of bytestring into hex digits - for debugging
bprint :: ByteString -> String
bprint x = ("0x" ++ ) $ foldl (++) "" $ fmap (printf "%02x") $ unpack x

main = putStrLn $ bprint $ encodeDouble 7234.4

Mac x86 上的示例 GHCi 输出:

*Main> bprint $ encodeDouble 7234.4
"0x666666666642bc40"

虽然代码似乎运行良好,但我计划使用它将大量 Double 值编码为 ByteString,然后再通过 IPC 发送。因此,如果有的话,我将不胜感激有关使其更快的指示。

在我看来,必须将 double 解包到 Word8 中,然后打包到 ByteString 中。因此,整体算法可能无法进行太多改进。但是,使用更高效的解包/打包功能可能会有所作为(如果有的话)。

编辑1: 我刚刚在 Mac (GHC 7.0.3) 上发现了另一个并发症 - 由于这个错误,上面的代码不会在 GHC 中编译 - 到目前为止我正在 GHCi 中进行测试:

$ ghc -O --make t.hs
[1 of 1] Compiling Main             ( t.hs, t.o )

/var/folders/_q/33htc59519b3xq7y6xv100z40000gp/T/ghc6976_0/ghc6976_0.s:285:0:
    suffix or operands invalid for `movsd'

/var/folders/_q/33htc59519b3xq7y6xv100z40000gp/T/ghc6976_0/ghc6976_0.s:304:0:
    suffix or operands invalid for `movsd'

所以,看起来我必须依靠 FFI (谷物/data-binary-ieee754 包)直到这个错误被修复,或者直到我找到解决方法。看起来与 GHC Ticket 4092 相关。如果这是一个新错误或不同的错误,请纠正我。目前,我无法编译它:(

EDIT2: 更新代码以使用 unsafeCoerce 修复了编译问题。下面的代码带有 Criterion 基准测试:

{-# LANGUAGE MagicHash #-}
import GHC.Prim
import GHC.Types
import GHC.Word
import Data.Bits (shiftR)
import Data.ByteString (pack, unpack)
import Data.ByteString.Internal (ByteString)
import Text.Printf (printf)
import Unsafe.Coerce
import Criterion.Main

--encodeDouble :: Double -> ByteString
encodeDouble  x = pack $ reverse $ unpack64 $ unsafeCoerce x

unpack64 :: Word64 -> [Word8]
unpack64 x = map (fromIntegral.(shiftR x)) [56,48..0]

main = defaultMain [
        bgroup "encodeDouble" [
          bench "78901.234"  $ whnf encodeDouble 78901.234
          , bench "789.01" $ whnf encodeDouble 789.01
          ]
       ]

Criterion 输出(截断):

estimating cost of a clock call...
mean is 46.09080 ns (36 iterations)

benchmarking encodeDouble/78901.234
mean: 218.8732 ns, lb 218.4946 ns, ub 219.3389 ns, ci 0.950
std dev: 2.134809 ns, lb 1.757455 ns, ub 2.568828 ns, ci 0.950

benchmarking encodeDouble/789.01
mean: 219.5382 ns, lb 219.0744 ns, ub 220.1296 ns, ci 0.950
std dev: 2.675674 ns, lb 2.197591 ns, ub 3.451464 ns, ci 0.950

经过进一步分析,大部分瓶颈似乎都在 unpack64 中。强制转换大约需要 6 纳秒。 unpack64 大约需要 195 纳秒。将 word64 解包为 word8 的列表在这里是相当昂贵的。

I wrote a function to convert 64-bit Double to ByteString (architecture/type safety is not really an issue - let us assume for now that the Double is 64-bit Word). While the function below works well, I am wondering if there is a faster way to convert the Double to ByteString. In the code below, there is one unpack of Word64 into Word8 list, followed by reverse (to make it little endian format), and then packing into ByteString. The code is below:

{-# LANGUAGE MagicHash #-}
import GHC.Prim
import GHC.Types
import GHC.Word
import Data.Bits (shiftR)
import Data.ByteString (pack, unpack)
import Data.ByteString.Internal (ByteString)
import Text.Printf (printf)

encodeDouble :: Double -> ByteString
encodeDouble (D# x) = pack $ reverse $ unpack64 $ W64# (unsafeCoerce# x)

unpack64 :: Word64 -> [Word8]
unpack64 x = map (fromIntegral.(shiftR x)) [56,48..0]

-- function to convert list of bytestring into hex digits - for debugging
bprint :: ByteString -> String
bprint x = ("0x" ++ ) $ foldl (++) "" $ fmap (printf "%02x") $ unpack x

main = putStrLn $ bprint $ encodeDouble 7234.4

A sample GHCi output on Mac x86:

*Main> bprint $ encodeDouble 7234.4
"0x666666666642bc40"

While the code seems to work well, I plan to use it for encoding lot of Double values into ByteString before sending it over IPC. So, I will appreciate pointers on making it faster, if there are any.

It does seem to me that double must be unpacked into Word8, and then packed into ByteString. So, may be the overall algorithm as it is, can't be improved upon much. But, using a more efficient unpack/pack function will probably make a difference, if there were one.

EDIT1:
I just discovered another complication on Mac (GHC 7.0.3) - the code above won't compile in GHC because of this error - I was testing in GHCi so far:

$ ghc -O --make t.hs
[1 of 1] Compiling Main             ( t.hs, t.o )

/var/folders/_q/33htc59519b3xq7y6xv100z40000gp/T/ghc6976_0/ghc6976_0.s:285:0:
    suffix or operands invalid for `movsd'

/var/folders/_q/33htc59519b3xq7y6xv100z40000gp/T/ghc6976_0/ghc6976_0.s:304:0:
    suffix or operands invalid for `movsd'

So, it looks like I have to fall back on FFI (cereal/data-binary-ieee754 package) until this bug is fixed, or until I find a workaround. Looks like related to GHC Ticket 4092. Please correct me if this is a new bug, or a different bug. For now, I can't compile it :(

EDIT2:
Updating the code to use unsafeCoerce fixes the compilation issue. Code below with Criterion benchmark:

{-# LANGUAGE MagicHash #-}
import GHC.Prim
import GHC.Types
import GHC.Word
import Data.Bits (shiftR)
import Data.ByteString (pack, unpack)
import Data.ByteString.Internal (ByteString)
import Text.Printf (printf)
import Unsafe.Coerce
import Criterion.Main

--encodeDouble :: Double -> ByteString
encodeDouble  x = pack $ reverse $ unpack64 $ unsafeCoerce x

unpack64 :: Word64 -> [Word8]
unpack64 x = map (fromIntegral.(shiftR x)) [56,48..0]

main = defaultMain [
        bgroup "encodeDouble" [
          bench "78901.234"  $ whnf encodeDouble 78901.234
          , bench "789.01" $ whnf encodeDouble 789.01
          ]
       ]

Criterion Output (truncated):

estimating cost of a clock call...
mean is 46.09080 ns (36 iterations)

benchmarking encodeDouble/78901.234
mean: 218.8732 ns, lb 218.4946 ns, ub 219.3389 ns, ci 0.950
std dev: 2.134809 ns, lb 1.757455 ns, ub 2.568828 ns, ci 0.950

benchmarking encodeDouble/789.01
mean: 219.5382 ns, lb 219.0744 ns, ub 220.1296 ns, ci 0.950
std dev: 2.675674 ns, lb 2.197591 ns, ub 3.451464 ns, ci 0.950

On further analysis, most of the bottleneck seems to be in unpack64. Coercion takes ~6ns. unpack64 takes ~195ns. Unpacking the word64 as a list of word8 is quite expensive here.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

凤舞天涯 2024-12-25 21:01:06

我最近添加了对 IEEE-754 浮点数的支持 cereal,您可以在 中找到 binary 的类似功能href="http://hackage.haskell.org/packages/archive/data-binary-ieee754/0.4.2.1/doc/html/Data-Binary-IEEE754.html" rel="nofollow noreferrer">数据-binary-ieee754。下面是一个使用 cereal 版本将 piByteString 往返的示例:

Prelude Data.Serialize> runGet getFloat64be $ runPut $ putFloat64be pi
Right 3.141592653589793

它使用 ST 数组的技巧来快速完成转换;请参阅这个较早的问题了解更多详情。

更新:哦,我应该知道如何使用我为库贡献的调用...

更新x2:关于编译失败,我认为这不合格作为一个错误。

我没有仔细查看为该特定代码生成的程序集,但 movsd 指令的操作数正在变得混乱。来自英特尔的第 11.4.1.1 节x86手册

MOVSD(移动标量双精度浮点)将 64 位双精度浮点操作数从内存传输到 XMM 寄存器的低四字,反之亦然,或者在 XMM 寄存器之间传输。

在未优化的代码中,您可以看到类似于 movsd LnTH(%rip),%xmm0 的指令,但在 -O 代码中,您会看到类似于 movsd Ln2cJ 的指令(%rip),%rax,其中 %rax 是通用寄存器,而不是 XMM 寄存器。

优化器可能会根据所涉及的数据类型对需要在寄存器之间移动的数据表示做出假设。 unsafeCoerce 和朋友们使这些假设无效,因此当指令选择器认为它正在为 D# 选择正确的操作时,它实际上会发出尝试填充 D#< 的代码。 /code> 很适合 W64#

由于处理这个问题需要优化器放弃许多让它在正常情况下发出更好的代码的假设,所以我倾向于说这不是一个错误,而是一个关于为什么不安全函数的好故事承担买者自负警告。

I recently added support for IEEE-754 floats to cereal, and you can find similar functions for binary in data-binary-ieee754. Here's an example using the cereal version to roundtrip pi to a ByteStringand back:

Prelude Data.Serialize> runGet getFloat64be $ runPut $ putFloat64be pi
Right 3.141592653589793

It uses a trick with ST arrays to do the conversion quickly; see this earlier question for more details.

Update: D'oh, I should know how to use calls I contributed to the library...

Update x2: Regarding the compile failure, I don't think this qualifies as a bug.

I haven't looked too carefully at the generated assembly for this particular code, but the operands to a movsd instruction are getting fouled up. From §11.4.1.1 of the Intel x86 manual:

The MOVSD (move scalar double-precision floating-point) transfers a 64-bit double-precision floating-point operand from memory to the low quadword of an XMM register or vice versa, or between XMM registers.

In the unoptimized code, you have fine instructions like movsd LnTH(%rip),%xmm0, but in the -O code, you see things like movsd Ln2cJ(%rip),%rax, where %rax is a general-purpose register, rather than an XMM register.

The optimizer is likely making assumptions about the representations of data it needs to move between registers based on the type of data involved. unsafeCoerce and friends invalidate those assumptions, so when the instruction selector thinks it's choosing the right operation for a D#, it's actually emitting code that tries to stuff that D# where a W64# would happily fit.

Since handling this would require the optimizer to abandon many of the assumptions that let it emit better code under normal circumstances, I'm inclined to say this is not a bug but rather a good story for why unsafe functions bear a caveat emptor warning.

人生百味 2024-12-25 21:01:06

请注意,文档说,这里使用 unsafeCoerce# 是危险的

将一个未装箱类型转换为相同大小的另一个未装箱类型(但不是浮点类型和整型类型之间的强制转换

关于速度,避免中间列表并直接写入内存可能会更快通过来自 Data.ByteString.InternalunsafeCreate

Note that the use of unsafeCoerce# is dangerous here, the docs say

Casting an unboxed type to another unboxed type of the same size (but not coercions between floating-point and integral types)

Concerning the speed, it may be faster to avoid the intermediate list and directly write to the memory via unsafeCreate from Data.ByteString.Internal.

凯凯我们等你回来 2024-12-25 21:01:06

根据 acfoltzer(谷物源代码)和 Daniel Fischer(unsafeCreate)的建议,我编写了下面的代码,该代码非常适合我的用例,而且速度也很快:

{-#LANGUAGE MagicHash #-}
import Data.ByteString (pack, unpack)
import Data.ByteString.Internal (unsafeCreate,ByteString)
import Data.Bits (shiftR)
import GHC.Int (Int64)
import GHC.Prim
import GHC.Types
import GHC.Word
import Unsafe.Coerce
import Criterion.Main
import Foreign

-- | Write a Word64 in little endian format
putWord64le :: Word64 -> Ptr Word8 -> IO()
putWord64le w p = do
  poke p               (fromIntegral (w)           :: Word8)
  poke (p `plusPtr` 1) (fromIntegral (shiftR w  8) :: Word8)
  poke (p `plusPtr` 2) (fromIntegral (shiftR w 16) :: Word8)
  poke (p `plusPtr` 3) (fromIntegral (shiftR w 24) :: Word8)
  poke (p `plusPtr` 4) (fromIntegral (shiftR w 32) :: Word8)
  poke (p `plusPtr` 5) (fromIntegral (shiftR w 40) :: Word8)
  poke (p `plusPtr` 6) (fromIntegral (shiftR w 48) :: Word8)
  poke (p `plusPtr` 7) (fromIntegral (shiftR w 56) :: Word8)

{-# INLINE putWord64le #-}

encodeDouble :: Double -> ByteString
encodeDouble x = unsafeCreate 8 (putWord64le $ unsafeCoerce x)

main :: IO ()
main = defaultMain [
        bgroup "encodeDouble" [
          bench "78901.234"  $ whnf encodeDouble 78901.234
          , bench "789.01" $ whnf encodeDouble 789.01
          ]
       ]

标准输出(截断):

estimating cost of a clock call...
mean is 46.80361 ns (35 iterations)
found 5 outliers among 35 samples (14.3%)
  3 (8.6%) high mild
  2 (5.7%) high severe

benchmarking encodeDouble/78901.234
mean: 18.80689 ns, lb 18.73805 ns, ub 18.97247 ns, ci 0.950
std dev: 516.7499 ps, lb 244.8588 ps, ub 1.043685 ns, ci 0.950

benchmarking encodeDouble/789.01
mean: 18.96963 ns, lb 18.90986 ns, ub 19.06127 ns, ci 0.950
std dev: 374.2191 ps, lb 275.3313 ps, ub 614.4281 ps, ci 0.950

从 ~220ns 降至 ~19ns,好的!我在编译时没有做任何花哨的事情。在 GHC7(Mac、x86_64)中只需 -O 标志即可。

现在,尝试找出如何快速使用双打列表!

Following the suggestion of acfoltzer (cereal source code), and Daniel Fischer (unsafeCreate), I wrote the code below that works well for my use case, and is fast too:

{-#LANGUAGE MagicHash #-}
import Data.ByteString (pack, unpack)
import Data.ByteString.Internal (unsafeCreate,ByteString)
import Data.Bits (shiftR)
import GHC.Int (Int64)
import GHC.Prim
import GHC.Types
import GHC.Word
import Unsafe.Coerce
import Criterion.Main
import Foreign

-- | Write a Word64 in little endian format
putWord64le :: Word64 -> Ptr Word8 -> IO()
putWord64le w p = do
  poke p               (fromIntegral (w)           :: Word8)
  poke (p `plusPtr` 1) (fromIntegral (shiftR w  8) :: Word8)
  poke (p `plusPtr` 2) (fromIntegral (shiftR w 16) :: Word8)
  poke (p `plusPtr` 3) (fromIntegral (shiftR w 24) :: Word8)
  poke (p `plusPtr` 4) (fromIntegral (shiftR w 32) :: Word8)
  poke (p `plusPtr` 5) (fromIntegral (shiftR w 40) :: Word8)
  poke (p `plusPtr` 6) (fromIntegral (shiftR w 48) :: Word8)
  poke (p `plusPtr` 7) (fromIntegral (shiftR w 56) :: Word8)

{-# INLINE putWord64le #-}

encodeDouble :: Double -> ByteString
encodeDouble x = unsafeCreate 8 (putWord64le $ unsafeCoerce x)

main :: IO ()
main = defaultMain [
        bgroup "encodeDouble" [
          bench "78901.234"  $ whnf encodeDouble 78901.234
          , bench "789.01" $ whnf encodeDouble 789.01
          ]
       ]

Criterion output (truncated):

estimating cost of a clock call...
mean is 46.80361 ns (35 iterations)
found 5 outliers among 35 samples (14.3%)
  3 (8.6%) high mild
  2 (5.7%) high severe

benchmarking encodeDouble/78901.234
mean: 18.80689 ns, lb 18.73805 ns, ub 18.97247 ns, ci 0.950
std dev: 516.7499 ps, lb 244.8588 ps, ub 1.043685 ns, ci 0.950

benchmarking encodeDouble/789.01
mean: 18.96963 ns, lb 18.90986 ns, ub 19.06127 ns, ci 0.950
std dev: 374.2191 ps, lb 275.3313 ps, ub 614.4281 ps, ci 0.950

From ~220ns down to ~19ns, nice! I didn't do anything fancy in compilation. Just -O flag will do in GHC7 (Mac, x86_64).

Now, trying to figure out how to do it fast with list of doubles!

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文