Scala：列表元素的groupBy（身份）

发布于 2024-10-03 10:08:54 字数 986 浏览 2 评论 0原文

我开发了一个应用程序，它在（标记化的）文本中构建单词对，并生成每对出现的次数（即使相同的单词对出现多次，也没关系，因为稍后会在算法中平衡）。

当我使用时，

elements groupBy()

我想按元素的内容本身进行分组，所以我写了以下内容：

def self(x: (String, String)) = x

/**
 * Maps a collection of words to a map where key is a pair of words and the 
 *  value is number of
 * times this pair
 * occurs in the passed array
 */
def producePairs(words: Array[String]): Map[(String,String), Double] = {
  var table = List[(String, String)]()
  words.foreach(w1 =>
    words.foreach(w2 =>
      table = table ::: List((w1, w2))))


  val grouppedPairs = table.groupBy(self)
  val size = int2double(grouppedPairs.size)
  return grouppedPairs.mapValues(_.length / size)
}

现在，我完全意识到这个 self() 技巧是一个肮脏的黑客。于是我想了想，得出了一个结论：

grouppedPairs = table groupBy (x => x)

这样就产生了我想要的东西。然而，我仍然觉得我明显错过了一些东西，应该有更简单的方法来做到这一点。亲爱的大家，有什么想法吗？

另外，如果你能帮助我改进对提取部分，它也会有很大帮助——它现在看起来非常必要，C++ 左右。非常感谢！

原文

I develop an application that builds pairs of words in (tokenised) text and produces the number of times each pair occurs (even when same-word pairs occur multiple times, it's OK as it'll be evened out later in the algorithm).

When I use

elements groupBy()

I want to group by the elements' content itself, so I wrote the following:

def self(x: (String, String)) = x

/**
 * Maps a collection of words to a map where key is a pair of words and the 
 *  value is number of
 * times this pair
 * occurs in the passed array
 */
def producePairs(words: Array[String]): Map[(String,String), Double] = {
  var table = List[(String, String)]()
  words.foreach(w1 =>
    words.foreach(w2 =>
      table = table ::: List((w1, w2))))


  val grouppedPairs = table.groupBy(self)
  val size = int2double(grouppedPairs.size)
  return grouppedPairs.mapValues(_.length / size)
}

Now, I fully realise that this self() trick is a dirty hack. So I thought a little a came out with a:

grouppedPairs = table groupBy (x => x)

This way it produced what I want. However, I still feel that I clearly miss something and there should be easier way of doing it. Any ideas at all, dear all?

Also, if you'd help me to improve the pairs extraction part, it'll also help a lot – it looks very imperative, C++ - ish right now. Many thanks in advance!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

壹場煙雨 2024-10-10 10:08:54

我建议这样：

def producePairs(words: Array[String]): Map[(String,String), Double] = {
    val table = for(w1 <- words; w2 <- words) yield (w1,w2)
    val grouppedPairs = table.groupBy(identity)
    val size = grouppedPairs.size.toDouble
    grouppedPairs.mapValues(_.length / size)
}

for 理解更容易阅读，并且已经有一个预定义函数 identity，其中是您的 self 的通用版本。

I'd suggest this:

def producePairs(words: Array[String]): Map[(String,String), Double] = {
    val table = for(w1 <- words; w2 <- words) yield (w1,w2)
    val grouppedPairs = table.groupBy(identity)
    val size = grouppedPairs.size.toDouble
    grouppedPairs.mapValues(_.length / size)
}

The for comprehension is much easier to read, and there is already a predifined function identity, with is a generalized version of your self.

回复收藏 0 原文

A君 2024-10-10 10:08:54

您正在通过迭代单词两次来创建所有单词对的所有单词对的列表，我猜您只需要相邻的对。最简单的方法是使用滑动视图。

def producePairs(words: Array[String]): Map[(String, String), Int] = {
  val pairs   = words.sliding(2, 1).map(arr => arr(0) -> arr(1)).toList
  val grouped = pairs.groupBy(t => t)
  grouped.mapValues(_.size)
}

另一种方法是通过求和来折叠配对列表。不确定这是否更有效：

def producePairs(words: Array[String]): Map[(String, String), Int] = {
  val pairs = words.sliding(2, 1).map(arr => arr(0) -> arr(1))
  pairs.foldLeft(Map.empty[(String, String), Int]) { (m, p) =>
     m + (p -> (m.getOrElse(p, 0) + 1))
  }
}

我看到你返回一个相对数字（双精度）。为简单起见，我刚刚计算了出现次数，因此您需要进行最后的除法。我认为你想除以总对的数量 (words.size - 1) 而不是除以唯一对的数量 (grouped.size)...，因此相对频率总和为 1.0

you are creating a list of pairs of all words against all words by iterating over words twice, where i guess you just want the neighbouring pairs. the easiest is to use a sliding view instead.

def producePairs(words: Array[String]): Map[(String, String), Int] = {
  val pairs   = words.sliding(2, 1).map(arr => arr(0) -> arr(1)).toList
  val grouped = pairs.groupBy(t => t)
  grouped.mapValues(_.size)
}

another approach would be to fold the list of pairs by summing them up. not sure though that this is more efficient:

def producePairs(words: Array[String]): Map[(String, String), Int] = {
  val pairs = words.sliding(2, 1).map(arr => arr(0) -> arr(1))
  pairs.foldLeft(Map.empty[(String, String), Int]) { (m, p) =>
     m + (p -> (m.getOrElse(p, 0) + 1))
  }
}

i see you are return a relative number (Double). for simplicity i have just counted the occurances, so you need to do the final division. i think you want to divide by the number of total pairs (words.size - 1) and not by the number of unique pairs (grouped.size)..., so the relative frequencies sum up to 1.0

回复收藏 0 原文

空‖城人不在 2024-10-10 10:08:54

另一种方法的顺序不是 O(num_words * num_words) ，而是 O(num_unique_words * num_unique_words) （或类似的东西）：

def producePairs[T <% Traversable[String]](words: T): Map[(String,String), Double] = {
  val counts = words.groupBy(identity).map{case (w, ws) => (w -> ws.size)}
  val size = (counts.size * counts.size).toDouble
  for(w1 <- counts; w2 <- counts) yield {
      ((w1._1, w2._1) -> ((w1._2 * w2._2) / size))
  }
}

Alternative approach which is not of order O(num_words * num_words) but of order O(num_unique_words * num_unique_words) (or something like that):

def producePairs[T <% Traversable[String]](words: T): Map[(String,String), Double] = {
  val counts = words.groupBy(identity).map{case (w, ws) => (w -> ws.size)}
  val size = (counts.size * counts.size).toDouble
  for(w1 <- counts; w2 <- counts) yield {
      ((w1._1, w2._1) -> ((w1._2 * w2._2) / size))
  }
}

回复收藏 0 原文

~没有更多了~