当前位置：文江博客话题详情

FSharp 运行我的算法比 Python 慢

发布于 2024-11-04 02:52:33 字数 1426 浏览 5 评论 0原文

几年前，我通过动态规划解决了一个问题：

https://www.thanassis.space/fillupDVD.html< /a>

该解决方案是用 Python 编写的。

作为拓展视野的一部分，我最近开始学习 OCaml/F#。还有什么比将我用 Python 编写的命令式代码直接移植到 F# 更好的试水方法 - 并从那里开始，逐步转向函数式编程解决方案。

第一个直接端口的结果...令人不安：

在 Python 下：

  bash$ time python fitToSize.py
  ....
  real    0m1.482s
  user    0m1.413s
  sys     0m0.067s

在 FSharp 下：（

  bash$ time mono ./fitToSize.exe
  ....
  real    0m2.235s
  user    0m2.427s
  sys     0m0.063s

如果您注意到上面的“单声道”：我也在 Windows 下进行了测试，使用 Visual Studio - 相同的速度）。

至少可以说，我……很困惑。 Python 运行代码比 F# 更快？使用 .NET 运行时编译的二进制文件运行速度比 Python 的解释代码慢？！？！

我了解 VM（在本例中为单声道）的启动成本以及 JIT 如何改进 Python 等语言的性能，但仍然......我期望加速，而不是减速！

也许我做错了什么？

我已在这里上传代码：

https://www.thanassis.space/ fsharp.slower.than.python.tar.gz

请注意，F# 代码或多或少是 Python 代码的直接逐行翻译。

PS 当然还有其他好处，例如 F# 提供的静态类型安全 - 但如果命令式算法的速度在 F# 下更差……至少可以说，我很失望。

编辑：按照评论中的要求直接访问：

Python代码：https://gist .github.com/950697

FSharp 代码：https://gist.github.com/950699

原文

Years ago, I solved a problem via dynamic programming:

https://www.thanassis.space/fillupDVD.html

The solution was coded in Python.

As part of expanding my horizons, I recently started learning OCaml/F#. What better way to test the waters, than by doing a direct port of the imperative code I wrote in Python to F# - and start from there, moving in steps towards a functional programming solution.

The results of this first, direct port... are disconcerting:

Under Python:

  bash$ time python fitToSize.py
  ....
  real    0m1.482s
  user    0m1.413s
  sys     0m0.067s

Under FSharp:

  bash$ time mono ./fitToSize.exe
  ....
  real    0m2.235s
  user    0m2.427s
  sys     0m0.063s

(in case you noticed the "mono" above: I tested under Windows as well, with Visual Studio - same speed).

I am... puzzled, to say the least. Python runs code faster than F# ? A compiled binary, using the .NET runtime, runs SLOWER than Python's interpreted code?!?!

I know about startup costs of VMs (mono in this case) and how JITs improve things for languages like Python, but still... I expected a speedup, not a slowdown!

Have I done something wrong, perhaps?

I have uploaded the code here:

https://www.thanassis.space/fsharp.slower.than.python.tar.gz

Note that the F# code is more or less a direct, line-by-line translation of the Python code.

P.S. There are of course other gains, e.g. the static type safety offered by F# - but if the resulting speed of an imperative algorithm is worse under F# ... I am disappointed, to say the least.

EDIT: Direct access, as requested in the comments:

the Python code: https://gist.github.com/950697

the FSharp code: https://gist.github.com/950699

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

小忆控 2024-11-11 02:52:33

我通过电子邮件联系了乔恩·哈罗普博士，他解释了发生的事情：

问题很简单，该程序已针对 Python 进行了优化。当然，当程序员对一种语言比另一种语言更熟悉时，这种情况很常见。您只需学习一组不同的规则来规定如何优化 F# 程序......
有几件事让我突然想到，例如使用“for i in 1..n do”循环而不是“for i=1 to n do”循环（一般来说更快，但在这里并不重要），重复做列表上的 List.mapi 模仿数组索引（不必要地分配中间列表），以及使用 F# TryGetValue for Dictionary 进行不必要的分配（接受引用的 .NET TryGetValue 通常更快，但这里没有那么多）< /p>
...但真正的致命问题是使用哈希表来实现密集的二维矩阵。在 Python 中使用哈希表是理想的选择，因为它的哈希表实现已经过非常好的优化（事实证明，Python 代码的运行速度与编译为本机代码的 F# 一样快！），但数组是表示密集的更好方式矩阵，特别是当您想要默认值为零时。

有趣的是，当我第一次编写这个算法时，我确实使用了一个表 - 为了清楚起见，我将实现更改为字典（避免数组边界检查使代码更简单 - 而且更简单）更容易推理）。

Jon 将我的代码（返回:-)）转换为它的数组版本，并且它以 100 倍的速度运行。

这个故事的寓意是：

F# 字典需要改进...当使用元组作为键时，编译的 F# 比解释的 Python 哈希表慢！
显而易见，但重复一下也没有什么坏处：更干净的代码有时意味着……更慢的代码。

谢谢你，乔恩——非常感谢。

编辑：用数组替换 Dictionary 使 F# 最终以编译语言预期运行的速度运行这一事实，并不否定修复 Dictionary 速度的需要（我希望来自 MS 的 F# 人员正在读这篇文章）。其他算法依赖于字典/哈希，并且不能轻易切换为使用数组；每当使用字典时，使程序受到“解释器速度”的影响，可以说是一个错误。如果，正如某些人在评论中所说，问题不在于 F#，而在于 .NET Dictionary，那么我认为这是....NET 中的一个错误！

EDIT2：最清晰的解决方案，不需要算法切换到数组（某些算法根本不适合这样做），是将以下内容更改

let optimalResults = new Dictionary<_,_>()

为：

let optimalResults = new Dictionary<_,_>(HashIdentity.Structural)

此更改使 F# 代码运行速度快 2.7 倍，最终击败了 Python（快 1.6 倍）。奇怪的是，元组默认情况下使用结构比较，因此原则上，字典对键进行的比较是相同的（有或没有结构）。 Harrop 博士推测速度差异可能归因于虚拟调度：“据我所知，.NET 在优化虚拟调度方面几乎没有做任何事情，并且虚拟调度的成本在现代硬件上非常高，因为它是一个“计算的跳转”，将程序计数器跳转到不可预测的位置，从而破坏分支预测逻辑，并且几乎肯定会导致整个 CPU 管道被刷新和重新加载”。

简而言之，正如 Don Syme 所建议的（查看底部 3 个答案），“要明确关于将引用类型键与 .NET 集合结合使用时结构散列的使用”。（Harrop 博士在下面的评论中还表示，在使用 .NET 集合时，我们应该始终使用结构比较）。

亲爱的 MS 中的 F# 团队，如果有办法自动修复此问题，请这样做。

Dr Jon Harrop, whom I contacted over e-mail, explained what is going on:

The problem is simply that the program has been optimized for Python. This is common when the programmer is more familiar with one language than the other, of course. You just have to learn a different set of rules that dictate how F# programs should be optimized...
Several things jumped out at me such as the use of a "for i in 1..n do" loop rather than a "for i=1 to n do" loop (which is faster in general but not significant here), repeatedly doing List.mapi on a list to mimic an array index (which allocated intermediate lists unnecessarily) and your use of the F# TryGetValue for Dictionary which allocates unnecessarily (the .NET TryGetValue that accepts a ref is faster in general but not so much here)
... but the real killer problem turned out to be your use of a hash table to implement a dense 2D matrix. Using a hash table is ideal in Python because its hash table implementation has been extremely well optimized (as evidenced by the fact that your Python code is running as fast as F# compiled to native code!) but arrays are a much better way to represent dense matrices, particularly when you want a default value of zero.

The funny part is that when I first coded this algorithm, I DID use a table -- I changed the implementation to a dictionary for reasons of clarity (avoiding the array boundary checks made the code simpler - and much easier to reason about).

Jon transformed my code (back :-)) into its array version, and it runs at 100x speed.

Moral of the story:

F# Dictionary needs work... when using tuples as keys, compiled F# is slower than interpreted Python's hash tables!
Obvious, but no harm in repeating: Cleaner code sometimes means... much slower code.

Thank you, Jon -- much appreciated.

EDIT: the fact that replacing Dictionary with Array makes F# finally run at the speeds a compiled language is expected to run, doesn't negate the need for a fix in Dictionary's speed (I hope F# people from MS are reading this). Other algorithms depend on dictionaries/hashes, and can't be easily switched to using arrays; making programs suffer "interpreter-speeds" whenever one uses a Dictionary, is arguably, a bug. If, as some have said in the comments, the problem is not with F# but with .NET Dictionary, then I'd argue that this... is a bug in .NET!

EDIT2: The clearest solution, that doesn't require the algorithm to switch to arrays (some algorithms simply won't be amenable to that) is to change this:

let optimalResults = new Dictionary<_,_>()

into this:

let optimalResults = new Dictionary<_,_>(HashIdentity.Structural)

This change makes the F# code run 2.7x times faster, thus finally beating Python (1.6x faster). The weird thing is that tuples by default use structural comparison, so in principle, the comparisons done by the Dictionary on the keys are the same (with or without Structural). Dr Harrop theorizes that the speed difference may be attributed to virtual dispatch: "AFAIK, .NET does little to optimize virtual dispatch away and the cost of virtual dispatch is extremely high on modern hardware because it is a "computed goto" that jumps the program counter to an unpredictable location and, consequently, undermines branch prediction logic and will almost certainly cause the entire CPU pipeline to be flushed and reloaded".

In plain words, and as suggested by Don Syme (look at the bottom 3 answers), "be explicit about the use of structural hashing when using reference-typed keys in conjunction with the .NET collections". (Dr. Harrop in the comments below also says that we should always use Structural comparisons when using .NET collections).

Dear F# team in MS, if there is a way to automatically fix this, please do.

回复收藏 0 原文

羅雙樹 2024-11-11 02:52:33

正如 Jon Harrop 指出的那样，简单地使用 Dictionary(HashIdentity.Structural) 构建字典即可显着提高性能（在我的计算机上提高了 3 倍）。几乎可以肯定，这是您需要进行的微创更改，以获得比 Python 更好的性能，并使您的代码保持惯用（而不是用结构替换元组等）并与 Python 实现并行。

回复收藏 0 原文

妖妓 2024-11-11 02:52:33

编辑：我错了，这不是值类型与引用类型的问题。正如其他评论中所解释的，性能问题与哈希函数有关。我在这里保留我的答案，因为有一个有趣的讨论。我的代码部分解决了性能问题，但这不是干净且推荐的解决方案。

在我的计算机上，我通过用结构体替换元组，使示例运行速度两倍。这意味着，等效的 F# 代码应该比 Python 代码运行得更快。我不同意.NET哈希表很慢的评论，我相信与Python或其他语言实现没有显着差异。另外，我不同意“你不能一对一翻译代码期望它更快”：对于大多数任务，F# 代码通常比 Python 更快（静态类型对编译器非常有帮助）。在您的示例中，大部分时间都花在哈希表查找上，因此可以合理地想象两种语言应该几乎一样快。

我认为性能问题与垃圾收集有关（但我没有使用分析器检查过）。这里使用元组比结构慢的原因已经在SO问题中讨论过（为什么.Net 4.0中的新Tuple类型是引用类型（类）而不是值类型（结构）< /a>）和 MSDN 页面（构建元组）：

如果它们是引用类型，则这
意味着可能有很多垃圾
如果您正在更改元素，则会生成
在紧密循环的元组中。 [...]
F# 元组是引用类型，但是
团队有一种感觉
他们可以实现表演
改进如果两个，也许三个，
元素元组是值类型
反而。一些创建过的团队
内部元组具有使用价值
引用类型，因为它们的
场景非常敏感
创建大量托管对象。

当然，正如乔恩在另一条评论中所说，您的示例中明显的优化是将哈希表替换为数组。数组显然要快得多（整数索引，没有散列，没有碰撞处理，没有重新分配，更紧凑），但这对于你的问题来说是非常具体的，并且它并不能解释与Python的性能差异（据我所知， Python 代码使用哈希表，而不是数组）。

为了重现我的 50% 加速，这里是完整的代码： http://pastebin.com/nbYrEi5d

简而言之，我用这种类型替换了元组：

type Tup = {x: int; y: int}

另外，这似乎是一个细节，但您应该将 List.mapi (fun ix -> (i,x)) fileSizes 移出封闭循环。我相信 Python enumerate 实际上并不分配列表（因此在 F# 中只分配一次列表，或者使用 Seq 模块，或者使用可变计数器是公平的）。

Edit: I was wrong, it's not a question of value type vs reference type. The performance problem was related to the hash function, as explained in other comments. I keep my answer here because there's an interessant discussion. My code partially fixed the performance issue, but this is not the clean and recommended solution.

On my computer, I made your sample run twice as fast by replacing the tuple with a struct. This means, the equivalent F# code should run faster than your Python code. I don't agree with the comments saying that .NET hashtables are slow, I believe there's no significant difference with Python or other languages implementations. Also, I don't agree with the "You can't 1-to-1 translate code expect it to be faster": F# code will generally be faster than Python for most tasks (static typing is very helpful to the compiler). In your sample, most of the time is spent doing hashtable lookups, so it's fair to imagine that both languages should be almost as fast.

I think the performance issue is related to gabage collection (but I haven't checked with a profiler). The reason why using tuples can be slower here than structures has been discussed in a SO question ( Why is the new Tuple type in .Net 4.0 a reference type (class) and not a value type (struct)) and a MSDN page (Building tuples):

If they are reference types, this
means there can be lots of garbage
generated if you are changing elements
in a tuple in a tight loop. [...]
F# tuples were reference types, but
there was a feeling from the team that
they could realize a performance
improvement if two, and perhaps three,
element tuples were value types
instead. Some teams that had created
internal tuples had used value instead
of reference types, because their
scenarios were very sensitive to
creating lots of managed objects.

Of course, as Jon said in another comment, the obvious optimization in your example is to replace hashtables with arrays. Arrays are obviously much faster (integer index, no hashing, no collision handling, no reallocation, more compact), but this is very specific to your problem, and it doesn't explain the performance difference with Python (as far as I know, Python code is using hashtables, not arrays).

To reproduce my 50% speedup, here is the full code: http://pastebin.com/nbYrEi5d

In short, I replaced the tuple with this type:

type Tup = {x: int; y: int}

Also, it seems like a detail, but you should move the List.mapi (fun i x -> (i,x)) fileSizes out of the enclosing loop. I believe Python enumerate does not actually allocate a list (so it's fair to allocate the list only once in F#, or use Seq module, or use a mutable counter).

回复收藏 0 原文

耶耶耶 2024-11-11 02:52:33

嗯..如果哈希表是主要瓶颈，那么它就是哈希函数本身。没有看具体的哈希函数，但是对于最常见的哈希函数之一，即

((a * x + b) % p) % q

如果 p 和 q 的形式为 2^，则模数运算 % 非常慢k - 1，我们可以用与、加和移位运算来求模。

Dietzfelbingers 通用哈希函数 h_a : [2^w] -> [2^l]

lowerbound(((a * x) % 2^w)/2^(wl))

其中是 w 位的随机奇数种子。

可以通过(a*x)>>计算(wl)，其速度比第一个哈希函数快几个数量级。我必须实现一个带有链表的哈希表作为冲突处理。实现和测试花了10分钟，我们必须用这两个功能来测试它，并分析速度的差异。我记得第二个哈希函数的速度增益约为 4-10 倍，具体取决于表的大小。
但这里要学习的是，如果您的程序瓶颈是哈希表查找，则哈希函数也必须很快

回复收藏 0 原文

~没有更多了~