Scala 并行集合运行时令人费解的问题

发布于 2024-12-13 08:29:50 字数 2985 浏览 4 评论 0原文

编辑:我的样本量太小。当我在 8 个 CPU 上针对真实数据运行它时,我发现速度提高了 7.2 倍。在我的代码中添加 4 个字符还算不错;)

我目前正在尝试向管理人员推销使用 Scala 的好处,尤其是在使用 CPU 进行扩展时。为此,我创建了一个简单的测试应用程序,该应用程序执行大量向量数学运算,并且有点惊讶地发现,在我的四核机器上,运行时间并没有明显更好。有趣的是,我发现第一次查看集合时运行时间是最差的,并且在后续调用中会变得更好。并行集合中是否有一些懒惰的事情导致了这种情况,或者我只是做错了?应该指出的是,我来自 C++/C# 世界,所以我完全有可能以某种方式弄乱了我的配置。无论如何,这是我的设置:

InteliJ Scala Plugin

Scala 2.9.1.final

Windows 7 64 位,四核处理器(无超线程)

import util.Random

  // simple Vector3D class that has final x,y,z components a length, and a '-' function
  class Vector3D(val x:Double,  val y:Double, val z:Double)
  {
    def length = math.sqrt(x*x+y*y+z*z)
    def -(rhs : Vector3D ) = new Vector3D(x - rhs.x, y - rhs.y, z - rhs.z)
  }

object MainClass {

  def main(args : Array[String]) =
  {
    println("Available CPU's: " + Runtime.getRuntime.availableProcessors())
    println("Parallelism Degree set to: " + collection.parallel.ForkJoinTasks.defaultForkJoinPool.getParallelism);
    // my position
    val myPos = new Vector3D(0,0,0);

    val r = new Random(0);

    // define a function nextRand that gets us a random between 0 and 100
    def nextRand = r.nextDouble() * 100;

    // make 10 million random targets
    val targets = (0 until 10000000).map(_ => new Vector3D(nextRand, nextRand, nextRand)).toArray
    // take the .par hit before we start profiling
    val parTargets = targets.par

    println("Created " + targets.length + " vectors")

    // define a range function
    val rangeFunc : (Vector3D => Double) = (targetPos) => (targetPos - myPos).length

    // we'll select ones that are <50
    val within50 : (Vector3D => Boolean) = (targetPos) => rangeFunc(targetPos) < 50

    // time it sequentially
    val startTime_sequential = System.currentTimeMillis()
    val numTargetsInRange_sequential = targets.filter(within50)
    val endTime_sequential = System.currentTimeMillis()
    println("Sequential (ms): " + (endTime_sequential - startTime_sequential))

    // do the parallel version 10 times
    for(i <- 1 to 10)
    {

      val startTime_par = System.currentTimeMillis()
      val numTargetsInRange_parallel = parTargets.filter(within50)
      val endTime_par = System.currentTimeMillis()

      val ms = endTime_par - startTime_par;
      println("Iteration[" + i + "] Executed in " + ms + " ms")
    }
  }
}

该程序的输出是:

Available CPU's: 4
Parallelism Degree set to: 4
Created 10000000 vectors
Sequential (ms): 216
Iteration[1] Executed in 227 ms
Iteration[2] Executed in 253 ms
Iteration[3] Executed in 76 ms
Iteration[4] Executed in 78 ms
Iteration[5] Executed in 77 ms
Iteration[6] Executed in 80 ms
Iteration[7] Executed in 78 ms
Iteration[8] Executed in 78 ms
Iteration[9] Executed in 79 ms
Iteration[10] Executed in 82 ms

那么这里发生了什么?前两次我们进行过滤,速度较慢,然后速度加快?我知道本质上存在并行启动成本,我只是想弄清楚在我的应用程序中表达并行性的意义在哪里,特别是我希望能够向管理人员展示一个运行 3-4 次的程序在四核盒子上速度更快。这难道不是一个好问题吗?

有想法吗?

Edit: My sample size was too small. When I ran it against the real data on 8 CPU's, I saw a 7.2x speed increase. Not too shabby for adding 4 characters to my code ;)

I am currently in the process of trying to 'sell' management on the benefits of using Scala, especially when it comes to scaling with CPU's. To that end, I created a simple test application that does a bunch of vector math and was a bit surprised to find that the runtime was not noticably better on my quad-core machine. Interestingly enough, I found that the runtime is the worst the first time you go through the collection and gets better with subsequent calls. Are there some lazy things in the parallel collection that are causing this, or am I just doing this wrong? It should be noted that I come from the C++/C# world, so it's entirely possible that I have messed up my configuration somehow. Regardless, here's my setup:

InteliJ Scala Plugin

Scala 2.9.1.final

Windows 7 64 bit, Quad-Core Processor (no hyperthreading)

import util.Random

  // simple Vector3D class that has final x,y,z components a length, and a '-' function
  class Vector3D(val x:Double,  val y:Double, val z:Double)
  {
    def length = math.sqrt(x*x+y*y+z*z)
    def -(rhs : Vector3D ) = new Vector3D(x - rhs.x, y - rhs.y, z - rhs.z)
  }

object MainClass {

  def main(args : Array[String]) =
  {
    println("Available CPU's: " + Runtime.getRuntime.availableProcessors())
    println("Parallelism Degree set to: " + collection.parallel.ForkJoinTasks.defaultForkJoinPool.getParallelism);
    // my position
    val myPos = new Vector3D(0,0,0);

    val r = new Random(0);

    // define a function nextRand that gets us a random between 0 and 100
    def nextRand = r.nextDouble() * 100;

    // make 10 million random targets
    val targets = (0 until 10000000).map(_ => new Vector3D(nextRand, nextRand, nextRand)).toArray
    // take the .par hit before we start profiling
    val parTargets = targets.par

    println("Created " + targets.length + " vectors")

    // define a range function
    val rangeFunc : (Vector3D => Double) = (targetPos) => (targetPos - myPos).length

    // we'll select ones that are <50
    val within50 : (Vector3D => Boolean) = (targetPos) => rangeFunc(targetPos) < 50

    // time it sequentially
    val startTime_sequential = System.currentTimeMillis()
    val numTargetsInRange_sequential = targets.filter(within50)
    val endTime_sequential = System.currentTimeMillis()
    println("Sequential (ms): " + (endTime_sequential - startTime_sequential))

    // do the parallel version 10 times
    for(i <- 1 to 10)
    {

      val startTime_par = System.currentTimeMillis()
      val numTargetsInRange_parallel = parTargets.filter(within50)
      val endTime_par = System.currentTimeMillis()

      val ms = endTime_par - startTime_par;
      println("Iteration[" + i + "] Executed in " + ms + " ms")
    }
  }
}

The output of this program is:

Available CPU's: 4
Parallelism Degree set to: 4
Created 10000000 vectors
Sequential (ms): 216
Iteration[1] Executed in 227 ms
Iteration[2] Executed in 253 ms
Iteration[3] Executed in 76 ms
Iteration[4] Executed in 78 ms
Iteration[5] Executed in 77 ms
Iteration[6] Executed in 80 ms
Iteration[7] Executed in 78 ms
Iteration[8] Executed in 78 ms
Iteration[9] Executed in 79 ms
Iteration[10] Executed in 82 ms

So what's going on here? The first 2 times we do the filter, it's slower, and then things speed up? I understand that there will inherently be a parallelism startup cost, I'm just trying to figure out where it makes sense to express the parallelism in my applicaion, and specifically I want to be able to show Management a program that runs 3-4 times faster on a Quad core box. Is this just not a good problem?

Ideas?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

就像说晚安 2024-12-20 08:29:50

你患有微基准病。您很可能正在对 JIT 编译阶段进行基准测试。您需要首先通过预运行来预热 JIT。

最好的想法可能是使用微基准测试框架,例如 http://code.google.com/p /caliper/ 为您处理所有这些。

编辑:有一个很好的 SBT 模板,用于 Caliper 基准测试 Scala项目,如此博文中引用的

You have the micro-benchmark disease. You are most likely benchmarking the JIT compile phase. You'll need to warm up your JIT with a pre-run first.

Probably the best idea is to use a micro-benchmarking framework like http://code.google.com/p/caliper/ which handles all that for you.

Edit: There is a nice SBT Template for Caliper benchmarking Scala projects, as referenced from this blog post

九歌凝 2024-12-20 08:29:50

事情确实会加速,但这与并行与顺序无关,你不是在比较苹果与苹果。 JVM 有一个 JIT(即时)编译器,只有在代码使用一定次数后才会编译一些字节代码。因此,您在第一次迭代中看到的是尚未进行 JIT 编译的代码的执行速度较慢,以及正在进行的 JIT 编译本身的时间也较慢。删除.par,使其全部顺序,这是我在我的机器上看到的(迭代次数减少了 10 倍,因为我使用的是较旧的机器):

Sequential (ms): 312
Iteration[1] Executed in 117 ms
Iteration[2] Executed in 112 ms
Iteration[3] Executed in 112 ms
Iteration[4] Executed in 112 ms
Iteration[5] Executed in 114 ms
Iteration[6] Executed in 113 ms
Iteration[7] Executed in 113 ms
Iteration[8] Executed in 117 ms
Iteration[9] Executed in 113 ms
Iteration[10] Executed in 111 ms

但它都是顺序的!您可以使用 JVM -XX:+PrintCompilation(在 JAVA_OPTS 中设置)查看 JVM 在 JIT 方面所做的工作,或者使用 -J-XX:+PrintCompilation scala 选项。在第一次迭代中,您将看到大量 JVM 打印语句,显示正在执行 JIT 的内容,然后它会稳定下来,

因此要进行逐个比较,您首先要在没有 的情况下运行 。 par,然后添加 par 并运行相同的程序。在我的双核上,当使用 .par 时,我得到:

Sequential (ms): 329
Iteration[1] Executed in 197 ms
Iteration[2] Executed in 60 ms
Iteration[3] Executed in 57 ms
Iteration[4] Executed in 58 ms
Iteration[5] Executed in 59 ms
Iteration[6] Executed in 73 ms
Iteration[7] Executed in 56 ms
Iteration[8] Executed in 60 ms
Iteration[9] Executed in 58 ms
Iteration[10] Executed in 57 ms

,一旦稳定,或多或少会有 2 倍的加速。

因此 需要小心的是装箱和拆箱,特别是如果您只与 Java 库的高阶函数(如 filter)进行基本类型的装箱和拆箱比较,这通常是那些最初感到失望的原因。将代码从 Java 转换为 Scala。

虽然它不适用于本例,因为 for 超出了计时范围,但使用 for 而不是 while 也有一些成本,但是使用 -optimize scalac 标志时,2.9.1 编译器应该表现得不错。

Things do speed up, but that has nothing to do with parallel versus sequential, you are not comparing apples to apples. The JVM has a JIT (just in time) compiler that will compile some byte code only after the code is used a certain number of times. So what you see in the first iterations is slower execution for code that is not yet JIT-ed as well as time for ongoing JIT compilation itself. Removing the .par so that it's all sequential here is what I see on my machine (10x less iteration cause I'm using an older machine):

Sequential (ms): 312
Iteration[1] Executed in 117 ms
Iteration[2] Executed in 112 ms
Iteration[3] Executed in 112 ms
Iteration[4] Executed in 112 ms
Iteration[5] Executed in 114 ms
Iteration[6] Executed in 113 ms
Iteration[7] Executed in 113 ms
Iteration[8] Executed in 117 ms
Iteration[9] Executed in 113 ms
Iteration[10] Executed in 111 ms

But it's all sequential! You can see what the JVM does in terms of JIT by using the JVM -XX:+PrintCompilation (set in JAVA_OPTS or use -J-XX:+PrintCompilation scala option. In the first iterations you will see a large numbers of JVM print statements showing what's being JIT-ed, then it stabilizes later on.

So to compare apples to apples, you first run without the par, then add the par and run the same program. On my dual core, when using .par I get:

Sequential (ms): 329
Iteration[1] Executed in 197 ms
Iteration[2] Executed in 60 ms
Iteration[3] Executed in 57 ms
Iteration[4] Executed in 58 ms
Iteration[5] Executed in 59 ms
Iteration[6] Executed in 73 ms
Iteration[7] Executed in 56 ms
Iteration[8] Executed in 60 ms
Iteration[9] Executed in 58 ms
Iteration[10] Executed in 57 ms

So more or less a 2x speedup once it's stable.

On related note, the other thing you want to be careful with is boxing and un-boxing, especially if you are comparing to just Java. The scala library high order functions like filter are doing boxing and un-boxing of primitive types and this is typically source of initial disappointment for those that convert code from Java to Scala.

Though it does not apply in this case as the for is outside the timing, there is also some cost for using for instead of while, but the 2.9.1 compiler should be doing a decent job when using the -optimize scalac flag.

闻呓 2024-12-20 08:29:50

除了前面提到的 JIT 优化之外,您需要评估的一个关键概念是您的问题是否倾向于并行化:拆分、线程协调和连接的固有成本会抵消并行处理的优势。 Scala 向您隐藏了这种复杂性,但您确实需要知道何时应用它以获得良好的结果。

就您而言,尽管您正在执行大量操作,但每个操作本身对于 CPU 来说几乎是微不足道的。要查看并行集合的运行情况,请尝试以单位为单位进行大量操作。

对于类似的 Scala 演示,我使用了一个简单(效率低下)的算法来计算一个数字是否是素数:
def isPrime(x:Int) = (2 to x/2).forall(y=>x%y!=0)

然后使用您提供的相同逻辑来确定集合中的数字 :

val col = 1 to 1000000
col.filter(isPrime(_))  // sequential
col.par.filter(isPrime(_)) // parallel

CPU 行为确实显示了两者之间的差异
素数:顺序与并行

4 核 CPU 中并行收集的时间大约缩短了 3.5 倍。

Next to the JIT optimizations mentioned before, a key concept you need to evaluate is whether your problem leans itself to parallelization: There's an inherent cost of split, thread coordination and join that weights against the advantage of doing things in parallel. Scala hides this complexity from you, but you do need to know when to apply this for good results.

In your case, although you're performing a huge amount of operations, each operation on its own is almost CPU trivial. To see the parallel collections in action, try an operation that is heavy on an unit basis.

For a similar Scala presentation, I used a simple (innefficient) algo to calculate whether a number is a prime:
def isPrime(x:Int) = (2 to x/2).forall(y=>x%y!=0)

Then use the same logic you presented to determine the numbers in a collection that are prime:

val col = 1 to 1000000
col.filter(isPrime(_))  // sequential
col.par.filter(isPrime(_)) // parallel

The CPU behaviour really showed the difference between both:
prime numbers: sequential vs parallel

The time was about 3.5x better for the parallel collections in a 4-core CPU.

姜生凉生 2024-12-20 08:29:50

怎么样

val numTargetsInRange_sequential = parTargets.filter(within50)

此外,使用映射而不是过滤操作可能会获得更令人印象深刻的结果。

how about

val numTargetsInRange_sequential = parTargets.filter(within50)

?

Also, you will probably get more impressive results with a map rather than a filter operation.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文