FSharp 运行我的算法比 Python 慢
几年前,我通过动态规划解决了一个问题:
https://www.thanassis.space/fillupDVD.html< /a>
该解决方案是用 Python 编写的。
作为拓展视野的一部分,我最近开始学习 OCaml/F#。还有什么比将我用 Python 编写的命令式代码直接移植到 F# 更好的试水方法 - 并从那里开始,逐步转向函数式编程解决方案。
第一个直接端口的结果...令人不安:
在 Python 下:
bash$ time python fitToSize.py
....
real 0m1.482s
user 0m1.413s
sys 0m0.067s
在 FSharp 下:(
bash$ time mono ./fitToSize.exe
....
real 0m2.235s
user 0m2.427s
sys 0m0.063s
如果您注意到上面的“单声道”:我也在 Windows 下进行了测试,使用 Visual Studio - 相同的速度)。
至少可以说,我……很困惑。 Python 运行代码比 F# 更快?使用 .NET 运行时编译的二进制文件运行速度比 Python 的解释代码慢?!?!
我了解 VM(在本例中为单声道)的启动成本以及 JIT 如何改进 Python 等语言的性能,但仍然......我期望加速,而不是减速!
也许我做错了什么?
我已在这里上传代码:
https://www.thanassis.space/ fsharp.slower.than.python.tar.gz
请注意,F# 代码或多或少是 Python 代码的直接逐行翻译。
PS 当然还有其他好处,例如 F# 提供的静态类型安全 - 但如果命令式算法的速度在 F# 下更差……至少可以说,我很失望。
编辑:按照评论中的要求直接访问:
Python代码:https://gist .github.com/950697
FSharp 代码:https://gist.github.com/950699
Years ago, I solved a problem via dynamic programming:
https://www.thanassis.space/fillupDVD.html
The solution was coded in Python.
As part of expanding my horizons, I recently started learning OCaml/F#. What better way to test the waters, than by doing a direct port of the imperative code I wrote in Python to F# - and start from there, moving in steps towards a functional programming solution.
The results of this first, direct port... are disconcerting:
Under Python:
bash$ time python fitToSize.py
....
real 0m1.482s
user 0m1.413s
sys 0m0.067s
Under FSharp:
bash$ time mono ./fitToSize.exe
....
real 0m2.235s
user 0m2.427s
sys 0m0.063s
(in case you noticed the "mono" above: I tested under Windows as well, with Visual Studio - same speed).
I am... puzzled, to say the least. Python runs code faster than F# ? A compiled binary, using the .NET runtime, runs SLOWER than Python's interpreted code?!?!
I know about startup costs of VMs (mono in this case) and how JITs improve things for languages like Python, but still... I expected a speedup, not a slowdown!
Have I done something wrong, perhaps?
I have uploaded the code here:
https://www.thanassis.space/fsharp.slower.than.python.tar.gz
Note that the F# code is more or less a direct, line-by-line translation of the Python code.
P.S. There are of course other gains, e.g. the static type safety offered by F# - but if the resulting speed of an imperative algorithm is worse under F# ... I am disappointed, to say the least.
EDIT: Direct access, as requested in the comments:
the Python code: https://gist.github.com/950697
the FSharp code: https://gist.github.com/950699
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
我通过电子邮件联系了乔恩·哈罗普博士,他解释了发生的事情:
有趣的是,当我第一次编写这个算法时,我确实使用了一个表 - 为了清楚起见,我将实现更改为字典(避免数组边界检查使代码更简单 - 而且更简单)更容易推理)。
Jon 将我的代码(返回:-))转换为它的数组版本,并且它以 100 倍的速度运行。
这个故事的寓意是:
谢谢你,乔恩——非常感谢。
编辑:用数组替换 Dictionary 使 F# 最终以编译语言预期运行的速度运行这一事实,并不否定修复 Dictionary 速度的需要(我希望来自 MS 的 F# 人员正在读这篇文章)。其他算法依赖于字典/哈希,并且不能轻易切换为使用数组;每当使用字典时,使程序受到“解释器速度”的影响,可以说是一个错误。如果,正如某些人在评论中所说,问题不在于 F#,而在于 .NET Dictionary,那么我认为这是....NET 中的一个错误!
EDIT2:最清晰的解决方案,不需要算法切换到数组(某些算法根本不适合这样做),是将以下内容更改
为:
此更改使 F# 代码运行速度快 2.7 倍,最终击败了 Python(快 1.6 倍)。奇怪的是,元组默认情况下使用结构比较,因此原则上,字典对键进行的比较是相同的(有或没有结构)。 Harrop 博士推测速度差异可能归因于虚拟调度:“据我所知,.NET 在优化虚拟调度方面几乎没有做任何事情,并且虚拟调度的成本在现代硬件上非常高,因为它是一个“计算的跳转”,将程序计数器跳转到不可预测的位置,从而破坏分支预测逻辑,并且几乎肯定会导致整个 CPU 管道被刷新和重新加载”。
简而言之,正如 Don Syme 所建议的(查看底部 3 个答案),“要明确关于将引用类型键与 .NET 集合结合使用时结构散列的使用”。 (Harrop 博士在下面的评论中还表示,在使用 .NET 集合时,我们应该始终使用结构比较)。
亲爱的 MS 中的 F# 团队,如果有办法自动修复此问题,请这样做。
Dr Jon Harrop, whom I contacted over e-mail, explained what is going on:
The funny part is that when I first coded this algorithm, I DID use a table -- I changed the implementation to a dictionary for reasons of clarity (avoiding the array boundary checks made the code simpler - and much easier to reason about).
Jon transformed my code (back :-)) into its array version, and it runs at 100x speed.
Moral of the story:
Thank you, Jon -- much appreciated.
EDIT: the fact that replacing Dictionary with Array makes F# finally run at the speeds a compiled language is expected to run, doesn't negate the need for a fix in Dictionary's speed (I hope F# people from MS are reading this). Other algorithms depend on dictionaries/hashes, and can't be easily switched to using arrays; making programs suffer "interpreter-speeds" whenever one uses a Dictionary, is arguably, a bug. If, as some have said in the comments, the problem is not with F# but with .NET Dictionary, then I'd argue that this... is a bug in .NET!
EDIT2: The clearest solution, that doesn't require the algorithm to switch to arrays (some algorithms simply won't be amenable to that) is to change this:
into this:
This change makes the F# code run 2.7x times faster, thus finally beating Python (1.6x faster). The weird thing is that tuples by default use structural comparison, so in principle, the comparisons done by the Dictionary on the keys are the same (with or without Structural). Dr Harrop theorizes that the speed difference may be attributed to virtual dispatch: "AFAIK, .NET does little to optimize virtual dispatch away and the cost of virtual dispatch is extremely high on modern hardware because it is a "computed goto" that jumps the program counter to an unpredictable location and, consequently, undermines branch prediction logic and will almost certainly cause the entire CPU pipeline to be flushed and reloaded".
In plain words, and as suggested by Don Syme (look at the bottom 3 answers), "be explicit about the use of structural hashing when using reference-typed keys in conjunction with the .NET collections". (Dr. Harrop in the comments below also says that we should always use Structural comparisons when using .NET collections).
Dear F# team in MS, if there is a way to automatically fix this, please do.
正如 Jon Harrop 指出的那样,简单地使用
Dictionary(HashIdentity.Structural)
构建字典即可显着提高性能(在我的计算机上提高了 3 倍)。几乎可以肯定,这是您需要进行的微创更改,以获得比 Python 更好的性能,并使您的代码保持惯用(而不是用结构替换元组等)并与 Python 实现并行。As Jon Harrop has pointed out, simply constructing the dictionaries using
Dictionary(HashIdentity.Structural)
gives a major performance improvement (a factor of 3 on my computer). This is almost certainly the minimally invasive change you need to make to get better performance than Python, and keeps your code idiomatic (as opposed to replacing tuples with structs, etc.) and parallel to the Python implementation.编辑:我错了,这不是值类型与引用类型的问题。正如其他评论中所解释的,性能问题与哈希函数有关。我在这里保留我的答案,因为有一个有趣的讨论。我的代码部分解决了性能问题,但这不是干净且推荐的解决方案。
--
在我的计算机上,我通过用结构体替换元组,使示例运行速度两倍。这意味着,等效的 F# 代码应该比 Python 代码运行得更快。我不同意.NET哈希表很慢的评论,我相信与Python或其他语言实现没有显着差异。另外,我不同意“你不能一对一翻译代码期望它更快”:对于大多数任务,F# 代码通常比 Python 更快(静态类型对编译器非常有帮助)。在您的示例中,大部分时间都花在哈希表查找上,因此可以合理地想象两种语言应该几乎一样快。
我认为性能问题与垃圾收集有关(但我没有使用分析器检查过)。这里使用元组比结构慢的原因已经在SO问题中讨论过(为什么.Net 4.0中的新Tuple类型是引用类型(类)而不是值类型(结构)< /a>)和 MSDN 页面(构建元组) :
当然,正如乔恩在另一条评论中所说,您的示例中明显的优化是将哈希表替换为数组。数组显然要快得多(整数索引,没有散列,没有碰撞处理,没有重新分配,更紧凑),但这对于你的问题来说是非常具体的,并且它并不能解释与Python的性能差异(据我所知, Python 代码使用哈希表,而不是数组)。
为了重现我的 50% 加速,这里是完整的代码: http://pastebin.com/nbYrEi5d
简而言之,我用这种类型替换了元组:
另外,这似乎是一个细节,但您应该将
List.mapi (fun ix -> (i,x)) fileSizes 移出封闭循环。我相信 Python
enumerate
实际上并不分配列表(因此在 F# 中只分配一次列表,或者使用Seq
模块,或者使用可变计数器是公平的)。Edit: I was wrong, it's not a question of value type vs reference type. The performance problem was related to the hash function, as explained in other comments. I keep my answer here because there's an interessant discussion. My code partially fixed the performance issue, but this is not the clean and recommended solution.
--
On my computer, I made your sample run twice as fast by replacing the tuple with a struct. This means, the equivalent F# code should run faster than your Python code. I don't agree with the comments saying that .NET hashtables are slow, I believe there's no significant difference with Python or other languages implementations. Also, I don't agree with the "You can't 1-to-1 translate code expect it to be faster": F# code will generally be faster than Python for most tasks (static typing is very helpful to the compiler). In your sample, most of the time is spent doing hashtable lookups, so it's fair to imagine that both languages should be almost as fast.
I think the performance issue is related to gabage collection (but I haven't checked with a profiler). The reason why using tuples can be slower here than structures has been discussed in a SO question ( Why is the new Tuple type in .Net 4.0 a reference type (class) and not a value type (struct)) and a MSDN page (Building tuples):
Of course, as Jon said in another comment, the obvious optimization in your example is to replace hashtables with arrays. Arrays are obviously much faster (integer index, no hashing, no collision handling, no reallocation, more compact), but this is very specific to your problem, and it doesn't explain the performance difference with Python (as far as I know, Python code is using hashtables, not arrays).
To reproduce my 50% speedup, here is the full code: http://pastebin.com/nbYrEi5d
In short, I replaced the tuple with this type:
Also, it seems like a detail, but you should move the
List.mapi (fun i x -> (i,x)) fileSizes
out of the enclosing loop. I believe Pythonenumerate
does not actually allocate a list (so it's fair to allocate the list only once in F#, or useSeq
module, or use a mutable counter).嗯..如果哈希表是主要瓶颈,那么它就是哈希函数本身。没有看具体的哈希函数,但是对于最常见的哈希函数之一,即
((a * x + b) % p) % q
如果 p 和 q 的形式为 2^,则模数运算 % 非常慢k - 1,我们可以用与、加和移位运算来求模。
Dietzfelbingers 通用哈希函数 h_a : [2^w] -> [2^l]
lowerbound(((a * x) % 2^w)/2^(wl))
其中 是 w 位的随机奇数种子。
可以通过(a*x)>>计算(wl),其速度比第一个哈希函数快几个数量级。我必须实现一个带有链表的哈希表作为冲突处理。实现和测试花了10分钟,我们必须用这两个功能来测试它,并分析速度的差异。我记得第二个哈希函数的速度增益约为 4-10 倍,具体取决于表的大小。
但这里要学习的是,如果您的程序瓶颈是哈希表查找,则哈希函数也必须很快
Hmm.. if the hashtable is the major bottleneck, then it is properly the hash function itself. Havn't look at the specific hash function but For one of the most common hash functions namely
((a * x + b) % p) % q
The modulus operation % is painfully slow, if p and q is of the form 2^k - 1, we can do modulus with an and, add and a shift operation.
Dietzfelbingers universal hash function h_a : [2^w] -> [2^l]
lowerbound(((a * x) % 2^w)/2^(w-l))
Where is a random odd seed of w-bit.
It can be computed by (a*x) >> (w-l), which is magnitudes of speed faster than the first hash function. I had to implement a hash table with linked list as collision handling. It took 10 minutes to implement and test, we had to test it with both functions, and analyse the differens of speed. The second hash function had as I remember around 4-10 times of speed gain dependend on the size of the table.
But the thing to learn here is if your programs bottleneck is hashtable lookup the hash function has to be fast too