在哈希表上使用 get() 方法时的 scala 速度？（是否生成临时 Option() 对象？）

发布于 2024-12-09 21:48:02 字数 647 浏览 0 评论 0原文

我正在将一些代码转换为 Scala。它的代码位于包含大量数据的内部循环中，因此需要速度快，并且涉及在哈希表中查找键并计算概率。它需要根据是否找到密钥来执行不同的操作。使用“标准”习惯用法，代码看起来像这样：

counts.get(word) match {
  case None => {
    WordDist.overall_word_probs.get(word) match {
      case None => (unseen_mass*WordDist.globally_unseen_word_prob
                    / WordDist.num_unseen_word_types)
      case Some(owprob) => unseen_mass * owprob / overall_unseen_mass
    }
  }
  case Some(wordcount) => wordcount.toDouble/total_tokens*(1.0 - unseen_mass)
}

但我担心这种代码会非常慢，因为所有这些临时 Some() 对象都被创建然后被垃圾收集。 Scala2e 书声称智能 JVM“可能”优化这些，以便代码高效地执行正确的操作，但是使用 Sun 的 JVM 真的会发生这种情况吗？有人知道吗？

原文

I am converting some code to Scala. It's code that sits in an inner loop with very large amounts of data so it needs to be fast, and it involves looking up keys in a hash table and computing probabilities. It needs to do different things depending on whether a key is found or not. The code would look like this using the "standard" idiom:

counts.get(word) match {
  case None => {
    WordDist.overall_word_probs.get(word) match {
      case None => (unseen_mass*WordDist.globally_unseen_word_prob
                    / WordDist.num_unseen_word_types)
      case Some(owprob) => unseen_mass * owprob / overall_unseen_mass
    }
  }
  case Some(wordcount) => wordcount.toDouble/total_tokens*(1.0 - unseen_mass)
}

but I am concerned that code of this sort is going to be very slow because of all these temporary Some() objects being created and then garbage-collected. The Scala2e book claims that a smart JVM "might" optimize these away so that the code does the right thing efficiency-wise, but does this actually happen using Sun's JVM? Anyone know?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

七月上 2024-12-16 21:48:02

如果您在 jvm 中启用逃逸分析，则可能会发生这种情况与：

-XX:+DoEscapeAnalysis

在 JRE 1.6 上。本质上，它应该检测正在创建的对象，这些对象不会逃逸方法激活帧，并将它们分配到堆栈上，或者在不再需要它们后立即对其进行 GC。

您可以做的一件事是使用 scala.testing.Benchmark 特征。只需使用单例对象扩展它并实现 run 方法，编译并运行它。它将多次运行 run 方法，并测量执行时间。

This may happen if you enable escape analysis in the jvm, enabled with:

-XX:+DoEscapeAnalysis

on JRE 1.6. Essentially, it should detect objects being created which do not escape the method activation frame and either allocate them on the stack or GC them right after they're no longer needed.

One thing you could do is to micro benchmark your code using the scala.testing.Benchmark trait. Just extend it with a singleton object and implement the run method, compile it and run it. It will run the run method multiple times, and measure execution times.

回复收藏 0 原文

↙厌世 2024-12-16 21:48:02

是的，Some 对象将被创建（None 是单例）。当然，除非 JVM 忽略了这一点——这取决于许多因素，包括 JVM 是否认为代码被调用了那么多。

无论如何，该代码并不是真正的标准习惯用法。甚至有一个关于它的模因：有一次，一位经验丰富的 Scala 开发人员编写了这样的代码，而另一位开发人员回答说“这是什么？业余时间？平面地图那该死！”

不管怎样，我是这样重写它的：

( counts 
  get word
  map (_.toDouble / total_tokens * (1.0 - unseen_mass))
  getOrElse (
    WordDist.overall_word_probs
    get word
    map (unseen_mass * _ / overall_unseen_mass)
    getOrElse (unseen_mass * WordDist.globally_unseen_word_prob
                / WordDist.num_unseen_word_types)
  )
)

然后你可以重构它——两个 getOrElse 参数都可以用不同的方法分割，并使用漂亮的名称。由于它们只是返回一个值而不需要输入，因此它们应该非常快。

现在，我们在 Option 上仅调用两个方法：map 和 getOrElse。以下是其实现的开始：

@inline final def map
@inline final def getOrElse

由于 getOrElse 的参数是按名称传递的，因此涉及匿名函数的创建。当然，map 的参数也是一个函数。除此之外，这些方法被内联的机会非常好。

所以，这是重构的代码，尽管我对它了解不够，无法给出好名字。

def knownWordsFrequency = counts get word map computeKnownFrequency
def computeKnownFrenquency = 
  (_: Int).toDouble / total_tokens * (1.0 - unseen_mass)

def probableWordsFrequency = (
  WordDist.overall_word_probs 
  get word 
  map computeProbableFrequency
)
def computeProbableFrequency = unseen_mass * (_: Double) / overall_unseen_mass

def unknownFrequency = (unseen_mass * WordDist.globally_unseen_word_prob
  / WordDist.num_unseen_word_types)

def estimatedWordsFrequency = probablyWordsFrequency getOrElse unknownFrequency

knownWordsFrequency getOrElse estimatedWordsFrequency

Yes, Some objects will be created (None is a singleton). Unless, of course, JVM elides that -- that depends on many factors, including whether or not JVM thinks the code is called all that much.

Anyway, that code is not really the standard idiom. There's even a meme about it: once, one experienced Scala developer was written code like this, when the other one replied "What's this? Amateur hour? Flatmap that sh*t!"

Anyway, here's how I'd rewrite it:

( counts 
  get word
  map (_.toDouble / total_tokens * (1.0 - unseen_mass))
  getOrElse (
    WordDist.overall_word_probs
    get word
    map (unseen_mass * _ / overall_unseen_mass)
    getOrElse (unseen_mass * WordDist.globally_unseen_word_prob
                / WordDist.num_unseen_word_types)
  )
)

You can then refactor this -- both getOrElse parameters could be split in different method with nice names. Since they just return a value without input, they should be pretty fast.

Now, we call just two methods here on Option: map and getOrElse. Here's the beginning of their implementation:

@inline final def map
@inline final def getOrElse

As the parameter to getOrElse is passed by name, it involves an anonymous function creation. And, of course, the parameter to map is also a function. Other than that, the chance of these methods getting inlined is pretty good.

So, here's the refactored code, though I don't know enough about it to give good names.

def knownWordsFrequency = counts get word map computeKnownFrequency
def computeKnownFrenquency = 
  (_: Int).toDouble / total_tokens * (1.0 - unseen_mass)

def probableWordsFrequency = (
  WordDist.overall_word_probs 
  get word 
  map computeProbableFrequency
)
def computeProbableFrequency = unseen_mass * (_: Double) / overall_unseen_mass

def unknownFrequency = (unseen_mass * WordDist.globally_unseen_word_prob
  / WordDist.num_unseen_word_types)

def estimatedWordsFrequency = probablyWordsFrequency getOrElse unknownFrequency

knownWordsFrequency getOrElse estimatedWordsFrequency

回复收藏 0 原文

~没有更多了~