Haskell FFI/C 的性能考虑?

发布于 2024-11-01 20:45:08 字数 1035 浏览 1 评论 0原文

如果使用 Haskell 作为从我的 C 程序调用的库,那么调用它会对性能产生什么影响?例如,如果我有一个包含 20kB 数据的问题世界数据集,并且我想运行类似的操作:

// Go through my 1000 actors and have them make a decision based on
// HaskellCode() function, which is compiled Haskell I'm accessing through
// the FFI.  As an argument, send in the SAME 20kB of data to EACH of these
// function calls, and some actor specific data
// The 20kB constant data defines the environment and the actor specific
// data could be their personality or state
for(i = 0; i < 1000; i++)
   actor[i].decision = HaskellCode(20kB of data here, actor[i].personality);

这里会发生什么 - 我是否可以将这 20kB 数据作为全局不可变引用保留在某处由 Haskell 代码访问,或者我必须每次都创建该数据的副本?

令人担忧的是,这些数据可能会更大,更大——我还希望编写能够作用于更大数据集的算法,使用与 Haskell 代码的多次调用所使用的相同的不可变数据模式。

另外,我想并行化它,就像dispatch_apply() GCD 或Parallel.ForEach(..) C# 一样。我在 Haskell 之外进行并行化的理由是,我知道我将始终对许多单独的函数调用(即 1000 个参与者)进行操作,因此在 Haskell 函数内部使用细粒度并行化并不比在 C 级别管理它更好。运行 FFI Haskell 实例是否“线程安全”?如何实现这一点 - 每次启动并行运行时是否需要初始化 Haskell 实例? (如果我必须的话,看起来很慢..)我如何以良好的性能实现这一点?

If using Haskell as a library being called from my C program, what is the performance impact of making calls in to it? For instance if I have a problem world data set of say 20kB of data, and I want to run something like:

// Go through my 1000 actors and have them make a decision based on
// HaskellCode() function, which is compiled Haskell I'm accessing through
// the FFI.  As an argument, send in the SAME 20kB of data to EACH of these
// function calls, and some actor specific data
// The 20kB constant data defines the environment and the actor specific
// data could be their personality or state
for(i = 0; i < 1000; i++)
   actor[i].decision = HaskellCode(20kB of data here, actor[i].personality);

What's going to happen here - is it going to be possible for me to keep that 20kB of data as a global immutable reference somewhere that is accessed by the Haskell code, or must I create a copy of that data each time through?

The concern is that this data could be larger, much larger - I also hope to write algorithms that act on much larger sets of data, using the same pattern of immutable data being used by several calls of the Haskell code.

Also, I'd like to parallelize this, like a dispatch_apply() GCD or Parallel.ForEach(..) C#. My rationale for parallelization outside of Haskell is that I know I will always be operating on many separate function calls i.e. 1000 actors, so using fine-grained parallelization inside Haskell function is no better than managing it at the C level. Is running FFI Haskell instances 'Thread Safe' and how do I achieve this - do I need to initialize a Haskell instance every time I kick off a parallel run? (Seems slow if I must..) How do I achieve this with good performance?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

云淡风轻 2024-11-08 20:45:08

调用它对性能有何影响

假设您仅启动 Haskell 运行时一次(像这样),在我的机器上,从 C 到 Haskell 进行函数调用,跨边界来回传递 Int,大约需要 80,000 个周期(在我的机器上31,000 ns)核心 2)——通过rdstc寄存器通过实验确定

我是否可以将那 20kB 数据作为全局不可变引用保留在 Haskell 代码访问的某个地方

是的,这当然是可能的。如果数据确实是不可变的,那么您会得到相同的结果,无论您

  • 来回传递对数据的引用;
  • 或者将其缓存在 Haskell 端的 IORef 中。

哪种策略最好?这取决于数据类型。最惯用的方法是来回传递对 C 数据的引用,在 Haskell 端将其视为 ByteStringVector

我想并行化它

强烈建议反转控制,并从 Haskell 运行时进行并行化——它会更加健壮,因为该路径已经过严格测试。

关于线程安全性,对在同一运行时运行的外部导出函数进行并行调用显然是安全的——尽管相当肯定没有人尝试过这样做以获得并行性。调用获取一个功能,本质上是一个锁,因此多个调用可能会阻塞,从而减少并行的机会。在多核情况下(例如-N4 左右),您的结果可能会有所不同(可以使用多种功能),但是,这几乎肯定是提高性能的糟糕方法。

同样,通过 forkIO 从 Haskell 进行许多并行函数调用是一个更好的文档记录、更好的测试路径,比在 C 端完成工作的开销更少,并且最终可能更少的代码。

只需调用 Haskell 函数,该函数就会通过许多 Haskell 线程实现并行性。简单的!

what is the performance impact of making calls in to it

Assuming you start the Haskell runtime up only once (like this), on my machine, making a function call from C into Haskell, passing an Int back and forth across the boundary, takes about 80,000 cycles (31,000 ns on my Core 2) -- determined experimentally via the rdstc register

is it going to be possible for me to keep that 20kB of data as a global immutable reference somewhere that is accessed by the Haskell code

Yes, that is certainly possible. If the data really is immutable, then you get the same result whether you:

  • thread the data back and forth across the language boundary by marshalling;
  • pass a reference to the data back and forth;
  • or cache it in an IORef on the Haskell side.

Which strategy is best? It depends on the data type. The most idiomatic way would be to pass a reference to the C data back and forth, treating it as a ByteString or Vector on the Haskell side.

I'd like to parallelize this

I'd strongly recommend inverting the control then, and doing the parallelization from the Haskell runtime -- it'll be much more robust, as that path has been heavily tested.

Regarding thread safety, it is apparently safe to make parallel calls to foreign exported functions running in the same runtime -- though fairly sure no one has tried this in order to gain parallelism. Calls in acquire a capability, which is essentially a lock, so multiple calls may block, reducing your chances for parallelism. In the multicore case (e.g. -N4 or so) your results may be different (multiple capabilities are available), however, this is almost certainly a bad way to improve performance.

Again, making many parallel functions calls from Haskell via forkIO is a better documented, better tested path, with less overhead than doing the work on the C side, and probably less code in the end.

Just make a call into your Haskell function, that in turn will do the parallelism via many Haskell threads. Easy!

断肠人 2024-11-08 20:45:08

我在我的一个应用程序中混合使用了 C 和 Haskell 线程,并且没有注意到两者之间的切换对性能造成很大影响。所以我制定了一个简单的基准......这比 Don 的更快/更便宜。这是在 2.66GHz i7 上测量的 1000 万次迭代:

$ ./foo
IO  : 2381952795 nanoseconds total, 238.195279 nanoseconds per, 160000000 value
Pure: 2188546976 nanoseconds total, 218.854698 nanoseconds per, 160000000 value

上使用 GHC 7.0.3/x86_64 和 gcc-4.2.1 编译

ghc -no-hs-main -lstdc++ -O2 -optc-O2 -o foo ForeignExportCost.hs Driver.cpp

在 OSX 10.6 Haskell

{-# LANGUAGE ForeignFunctionInterface #-}

module ForeignExportCost where

import Foreign.C.Types

foreign export ccall simpleFunction :: CInt -> CInt
simpleFunction i = i * i

foreign export ccall simpleFunctionIO :: CInt -> IO CInt
simpleFunctionIO i = return (i * i)

:并且驱动它的 OSX C++ 应用程序应该很容易调整到 Windows 或 Linux:

#include <stdio.h>
#include <mach/mach_time.h>
#include <mach/kern_return.h>
#include <HsFFI.h>
#include "ForeignExportCost_stub.h"

static const int s_loop = 10000000;

int main(int argc, char** argv) {
    hs_init(&argc, &argv);

    struct mach_timebase_info timebase_info = { };
    kern_return_t err;
    err = mach_timebase_info(&timebase_info);
    if (err != KERN_SUCCESS) {
        fprintf(stderr, "error: %x\n", err);
        return err;
    }

    // timing a function in IO
    uint64_t start = mach_absolute_time();
    HsInt32 val = 0;
    for (int i = 0; i < s_loop; ++i) {
        val += simpleFunctionIO(4);
    }

    // in nanoseconds per http://developer.apple.com/library/mac/#qa/qa1398/_index.html
    uint64_t duration = (mach_absolute_time() - start) * timebase_info.numer / timebase_info.denom;
    double duration_per = static_cast<double>(duration) / s_loop;
    printf("IO  : %lld nanoseconds total, %f nanoseconds per, %d value\n", duration, duration_per, val);

    // run the loop again with a pure function
    start = mach_absolute_time();
    val = 0;
    for (int i = 0; i < s_loop; ++i) {
        val += simpleFunction(4);
    }

    duration = (mach_absolute_time() - start) * timebase_info.numer / timebase_info.denom;
    duration_per = static_cast<double>(duration) / s_loop;
    printf("Pure: %lld nanoseconds total, %f nanoseconds per, %d value\n", duration, duration_per, val);

    hs_exit();
}

I use a mix of C and Haskell threads for one of my applications and haven't noticed that much of a performance hit switching between the two. So I crafted a simple benchmark... which is quite a bit faster/cheaper than Don's. This is measuring 10 million iterations on a 2.66GHz i7:

$ ./foo
IO  : 2381952795 nanoseconds total, 238.195279 nanoseconds per, 160000000 value
Pure: 2188546976 nanoseconds total, 218.854698 nanoseconds per, 160000000 value

Compiled with GHC 7.0.3/x86_64 and gcc-4.2.1 on OSX 10.6

ghc -no-hs-main -lstdc++ -O2 -optc-O2 -o foo ForeignExportCost.hs Driver.cpp

Haskell:

{-# LANGUAGE ForeignFunctionInterface #-}

module ForeignExportCost where

import Foreign.C.Types

foreign export ccall simpleFunction :: CInt -> CInt
simpleFunction i = i * i

foreign export ccall simpleFunctionIO :: CInt -> IO CInt
simpleFunctionIO i = return (i * i)

And an OSX C++ app to drive it, should be simple to adjust to Windows or Linux:

#include <stdio.h>
#include <mach/mach_time.h>
#include <mach/kern_return.h>
#include <HsFFI.h>
#include "ForeignExportCost_stub.h"

static const int s_loop = 10000000;

int main(int argc, char** argv) {
    hs_init(&argc, &argv);

    struct mach_timebase_info timebase_info = { };
    kern_return_t err;
    err = mach_timebase_info(&timebase_info);
    if (err != KERN_SUCCESS) {
        fprintf(stderr, "error: %x\n", err);
        return err;
    }

    // timing a function in IO
    uint64_t start = mach_absolute_time();
    HsInt32 val = 0;
    for (int i = 0; i < s_loop; ++i) {
        val += simpleFunctionIO(4);
    }

    // in nanoseconds per http://developer.apple.com/library/mac/#qa/qa1398/_index.html
    uint64_t duration = (mach_absolute_time() - start) * timebase_info.numer / timebase_info.denom;
    double duration_per = static_cast<double>(duration) / s_loop;
    printf("IO  : %lld nanoseconds total, %f nanoseconds per, %d value\n", duration, duration_per, val);

    // run the loop again with a pure function
    start = mach_absolute_time();
    val = 0;
    for (int i = 0; i < s_loop; ++i) {
        val += simpleFunction(4);
    }

    duration = (mach_absolute_time() - start) * timebase_info.numer / timebase_info.denom;
    duration_per = static_cast<double>(duration) / s_loop;
    printf("Pure: %lld nanoseconds total, %f nanoseconds per, %d value\n", duration, duration_per, val);

    hs_exit();
}
抹茶夏天i‖ 2024-11-08 20:45:08

免责声明:我没有 FFI 方面的经验。

但在我看来,如果你想重用这 20 Kb 数据,这样你就不会每次都传递它,那么你可以简单地使用一个方法来获取“个性”列表,并返回“决策”列表。

因此,如果您有一个函数

f :: LotsaData -> Personality -> Decision
f data p = ...

那么为什么不创建一个辅助函数

helper :: LotsaData -> [Personality] -> [Decision]
helper data ps = map (f data) ps

并调用它呢?但是,使用这种方式,如果您想要并行化,则需要使用并行列表和并行映射在 Haskell 端进行。

我听从专家的解释,是否/如何可以轻松地将 C 数组编组到 Haskell 列表(或类似的结构)中。

Disclaimer: I have no experience with the FFI.

But it seems to me that if you want to reuse the 20 Kb of data so you're not passing it every time, then you could simply have a method that takes a list of "personalities", and returns a list of "decisions".

So if you have a function

f :: LotsaData -> Personality -> Decision
f data p = ...

Then why not make a helper function

helper :: LotsaData -> [Personality] -> [Decision]
helper data ps = map (f data) ps

And invoke that? Using this way, though, if you wanted to parallelize, you would need to do it Haskell-side with parallel lists and parallel map.

I defer to the experts to explain if/how C arrays can be marshaled into Haskell lists (or similar structure) easily.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文