为什么我的 scala future 效率不高?

发布于 2024-09-17 09:44:02 字数 1831 浏览 6 评论 0原文

我正在 32 位四核 Core2 系统上运行此 scala 代码:(

def job(i:Int,s:Int):Long = {
  val r=(i to 500000000 by s).map(_.toLong).foldLeft(0L)(_+_)
  println("Job "+i+" done")
  r
}

import scala.actors.Future
import scala.actors.Futures._

val JOBS=4

val jobs=(0 until JOBS).toList.map(i=>future {job(i,JOBS)})
println("Running...")
val results=jobs.map(f=>f())
println(results.foldLeft(0L)(_+_))

是的,我确实知道有更多更有效的方法来对一系列整数求和;只是给CPU一些事情做)。

根据我设置的 JOBS,代码会在以下时间运行:

JOBS=1 : 31.99user 0.84system 0:28.87elapsed 113%CPU
JOBS=2 : 27.71user 1.12system 0:14.74elapsed 195%CPU
JOBS=3 : 33.19user 0.39system 0:13.02elapsed 257%CPU
JOBS=4 : 49.08user 8.46system 0:22.71elapsed 253%CPU

令我惊讶的是,这并没有真正超出 2 个“正在运行”的 future。我编写了大量多线程 C++ 代码,毫无疑问,如果我使用 Intel 的 TBB 或 boost::threads 编写此类代码,我会很好地扩展到 4 个核心,并看到 >390% CPU 利用率code> (当然它会更加冗长)。

那么:发生了什么事以及如何才能扩展到我期望看到的 4 核?这是否受到 scala 或 JVM 中某些内容的限制?在我看来,我实际上并不知道 scala 的 futures 在“哪里”运行...是每个 future 生成的线程,还是“Futures”提供了一个专门用于运行它们的线程池?

[我在带有 sun-java6 (6-20-0lennnny1) 的 Lenny 系统上使用 Debian/Squeeze 的 scala 2.7.7 软件包。]

更新:

根据 Rex 的回答中的建议,我重新编码为避免创建对象。

def job(i:Long,s:Long):Long = {
  var t=0L
  var v=i
  while (v<=10000000000L) {
    t+=v
    v+=s
  }
  println("Job "+i+" done")
  t
}
// Rest as above...

这太快了,我必须显着增加迭代次数才能运行任意时间!结果是:

JOBS=1: 28.39user 0.06system 0:29.25elapsed 97%CPU
JOBS=2: 28.46user 0.04system 0:14.95elapsed 190%CPU
JOBS=3: 24.66user 0.06system 0:10.26elapsed 240%CPU
JOBS=4: 28.32user 0.12system 0:07.85elapsed 362%CPU

这更像是我希望看到的情况(尽管 3 份工作的情况有点奇怪,其中一项任务总是比其他两项早几秒钟完成)。

更进一步,在四核超线程 i7 上,后一个带有 JOBS=8 的版本比 JOBS=1 实现了 x4.4 的加速,CPU 使用率为 571%。

I'm running this scala code on a 32-bit quad-core Core2 system:

def job(i:Int,s:Int):Long = {
  val r=(i to 500000000 by s).map(_.toLong).foldLeft(0L)(_+_)
  println("Job "+i+" done")
  r
}

import scala.actors.Future
import scala.actors.Futures._

val JOBS=4

val jobs=(0 until JOBS).toList.map(i=>future {job(i,JOBS)})
println("Running...")
val results=jobs.map(f=>f())
println(results.foldLeft(0L)(_+_))

(Yes, I do know there are much more efficient ways to sum a series of integers; it's just to give the CPU something to do).

Depending on what I set JOBS to, the code runs in the following times:

JOBS=1 : 31.99user 0.84system 0:28.87elapsed 113%CPU
JOBS=2 : 27.71user 1.12system 0:14.74elapsed 195%CPU
JOBS=3 : 33.19user 0.39system 0:13.02elapsed 257%CPU
JOBS=4 : 49.08user 8.46system 0:22.71elapsed 253%CPU

I'm surprised that this doesn't really scale well beyond 2 futures "in play". I do a lot of multithreaded C++ code and have no doubt I'd get good scaling up to 4 cores and see >390% CPU utilisation if I coded this sort of thing with Intel's TBB or boost::threads (it'd be considerably more verbose of course).

So: what's going on and how can I get the scaling to 4 cores I'd expect to see ? Is this limited by something in scala or the JVM ? It occurs to me I don't actually know "where" scala's futures run... is a thread spawned per future, or does "Futures" provide a thread pool dedicated to running them ?

[I'm using the scala 2.7.7 packages from Debian/Squeeze on a Lenny system with sun-java6 (6-20-0lennny1).]

Update:

As suggested in Rex's answer, I recoded to avoid object creation.

def job(i:Long,s:Long):Long = {
  var t=0L
  var v=i
  while (v<=10000000000L) {
    t+=v
    v+=s
  }
  println("Job "+i+" done")
  t
}
// Rest as above...

This was so much faster I had to significantly increase the iteration count to run for any amount of time! Results are:

JOBS=1: 28.39user 0.06system 0:29.25elapsed 97%CPU
JOBS=2: 28.46user 0.04system 0:14.95elapsed 190%CPU
JOBS=3: 24.66user 0.06system 0:10.26elapsed 240%CPU
JOBS=4: 28.32user 0.12system 0:07.85elapsed 362%CPU

which is much more like what I'd hope to see (although the 3 jobs case is a little odd, with one task consistently completing a couple of seconds before the other two).

Pushing it a bit further, on a quad-core hyperthreaded i7 the latter version with JOBS=8 achieves an x4.4 speedup vs JOBS=1, with 571% CPU usage.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

离去的眼神 2024-09-24 09:44:10

尝试

(i to 500000000 by s).view.map(_.toLong).foldLeft(0L)(_+_)

view 的应用程序应该(据我理解 id)通过提供简单的包装器来避免重复迭代和对象创建。

另请注意,您可以使用 reduceLeft(_+_) 而不是折叠。

Try

(i to 500000000 by s).view.map(_.toLong).foldLeft(0L)(_+_)

The application of view is supposed to (as I understood id) to avoid repeated iteration and object creation by providing simple wrappers.

Note also that you can use reduceLeft(_+_) instead of fold.

鸠书 2024-09-24 09:44:09

我的猜测是,垃圾收集器所做的工作比添加本身还要多。因此,您受到垃圾收集器可以管理的内容的限制。尝试使用不创建任何对象的东西再次运行测试(例如使用 while 循环而不是范围/地图/折叠)。如果您的实际应用程序对 GC 的影响如此之大,您还可以使用并行 GC 选项。

My guess is that the garbage collector is doing more work than the addition itself. So you're limited by what the garbage collector can manage. Try running the test again with something that doesn't create any objects (e.g. use a while loop instead of the range/map/fold). You can also play with the parallel GC options if your real application will hit the GC this heavily.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文