Scala:列表元素的groupBy(身份)
我开发了一个应用程序,它在(标记化的)文本中构建单词对,并生成每对出现的次数(即使相同的单词对出现多次,也没关系,因为稍后会在算法中平衡)。
当我使用时,
elements groupBy()
我想按元素的内容本身进行分组,所以我写了以下内容:
def self(x: (String, String)) = x
/**
* Maps a collection of words to a map where key is a pair of words and the
* value is number of
* times this pair
* occurs in the passed array
*/
def producePairs(words: Array[String]): Map[(String,String), Double] = {
var table = List[(String, String)]()
words.foreach(w1 =>
words.foreach(w2 =>
table = table ::: List((w1, w2))))
val grouppedPairs = table.groupBy(self)
val size = int2double(grouppedPairs.size)
return grouppedPairs.mapValues(_.length / size)
}
现在,我完全意识到这个 self() 技巧是一个肮脏的黑客。于是我想了想,得出了一个结论:
grouppedPairs = table groupBy (x => x)
这样就产生了我想要的东西。然而,我仍然觉得我明显错过了一些东西,应该有更简单的方法来做到这一点。亲爱的大家,有什么想法吗?
另外,如果你能帮助我改进对提取部分,它也会有很大帮助——它现在看起来非常必要,C++ 左右。非常感谢!
I develop an application that builds pairs of words in (tokenised) text and produces the number of times each pair occurs (even when same-word pairs occur multiple times, it's OK as it'll be evened out later in the algorithm).
When I use
elements groupBy()
I want to group by the elements' content itself, so I wrote the following:
def self(x: (String, String)) = x
/**
* Maps a collection of words to a map where key is a pair of words and the
* value is number of
* times this pair
* occurs in the passed array
*/
def producePairs(words: Array[String]): Map[(String,String), Double] = {
var table = List[(String, String)]()
words.foreach(w1 =>
words.foreach(w2 =>
table = table ::: List((w1, w2))))
val grouppedPairs = table.groupBy(self)
val size = int2double(grouppedPairs.size)
return grouppedPairs.mapValues(_.length / size)
}
Now, I fully realise that this self() trick is a dirty hack. So I thought a little a came out with a:
grouppedPairs = table groupBy (x => x)
This way it produced what I want. However, I still feel that I clearly miss something and there should be easier way of doing it. Any ideas at all, dear all?
Also, if you'd help me to improve the pairs extraction part, it'll also help a lot – it looks very imperative, C++ - ish right now. Many thanks in advance!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
我建议这样:
for 理解更容易阅读,并且已经有一个预定义函数
identity
,其中是您的self
的通用版本。I'd suggest this:
The for comprehension is much easier to read, and there is already a predifined function
identity
, with is a generalized version of yourself
.您正在通过迭代单词两次来创建所有单词对的所有单词对的列表,我猜您只需要相邻的对。最简单的方法是使用滑动视图。
另一种方法是通过求和来折叠配对列表。不确定这是否更有效:
我看到你返回一个相对数字(双精度)。为简单起见,我刚刚计算了出现次数,因此您需要进行最后的除法。我认为你想除以总对的数量 (words.size - 1) 而不是除以唯一对的数量 (grouped.size)...,因此相对频率总和为 1.0
you are creating a list of pairs of all words against all words by iterating over words twice, where i guess you just want the neighbouring pairs. the easiest is to use a sliding view instead.
another approach would be to fold the list of pairs by summing them up. not sure though that this is more efficient:
i see you are return a relative number (Double). for simplicity i have just counted the occurances, so you need to do the final division. i think you want to divide by the number of total pairs (words.size - 1) and not by the number of unique pairs (grouped.size)..., so the relative frequencies sum up to 1.0
另一种方法的顺序不是
O(num_words * num_words)
,而是O(num_unique_words * num_unique_words)
(或类似的东西):Alternative approach which is not of order
O(num_words * num_words)
but of orderO(num_unique_words * num_unique_words)
(or something like that):