组合词频数据列表

发布于 2024-12-11 17:40:44 字数 1167 浏览 3 评论 0原文

这似乎应该是一个显而易见的问题，但列表上的教程和文档尚未发布。其中许多问题源于我的文本文件的巨大大小（数百 MB）以及我试图将它们归结为我的系统可管理的内容。因此，我正在分段进行工作，现在正在尝试合并结果。

我有多个词频列表（大约 40 个）。这些列表可以通过 Import[ ] 获取，也可以作为在 Mathematica 中生成的变量。每个列表如下所示，并且是使用 Tally[ ] 和 Sort[ ] 命令生成的：

{{"the", 42216}, {"of", 24903}, {"and", 18624}, {"n", 16850}, {"in",
16164}, {"de", 14930}, {"a", 14660}, {"to", 14175}, {"la", 7347}, {"是", 6030}, {"l", 5981}, {"le", 5735}, <<51293>>, {"屠宰场", 1}, {"减少", 1}, {"减少", 1}, {"减少", 1}, {"abated", 1}, {"abandonn", 1}, {"abaiss", 1}, {"aback", 1}, {"aase", 1}, {"aaijaut", 1}, {"aaaah", 1}, {"aaa", 1}}

以下是第二个文件的示例：

{{"the", 30419}, {"n", 20414}, {"de", 19956}, {"of", 16262}, {"and",
14488}, {"到", 12726}, {"a", 12635}, {"在", 11141}, {"la", 10739}, {“et”，9016}，{“les”，8675}，{“le”，7748}，<<101032>>， {"abattement", 1}, {"abattagen", 1}, {"abattage", 1}, {"abated", 1}, {"放弃", 1}, {"abaiss", 1}, {"aback", 1}, {"aase", 1}, {"aaijaut", 1}, {"aaaah", 1}, {"aaa", 1}}

我想将它们组合起来，以便频率数据聚合：即，如果第二个文件有 30,419 次出现 'the' 并且是加入第一个文件后，它应该返回 72,635 次出现（当我浏览整个集合时，依此类推）。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

嗼ふ静 2024-12-18 17:40:44

听起来您需要 GatherBy。

假设您的两个列表分别命名为 data1 和 data2，然后使用

{#[[1, 1]], Total[#[[All, 2]]]} & /@ GatherBy[Join[data1, data2], First]

这可以轻松推广到任意数量的列表，而不仅仅是两个。

It sounds like you need GatherBy.

Suppose your two lists are named data1 and data2, then use

{#[[1, 1]], Total[#[[All, 2]]]} & /@ GatherBy[Join[data1, data2], First]

This easily generalizes to any number of lists, not just two.

回复收藏 0 原文

好听的两个字的网名 2024-12-18 17:40:44

尝试使用哈希表，就像这样。首先进行设置：

ClearAll[freq];
freq[_] = 0;

现在例如 freq["safas"] 返回 0。接下来，如果列表定义为

lst1 = {{"the", 42216}, {"of", 24903}, {"and", 18624}, {"n", 
    16850}, {"in", 16164}, {"de", 14930}, {"a", 14660}, {"to", 
    14175}, {"la", 7347}, {"was", 6030}, {"l", 5981}, {"le", 
    5735}, {"abattoir", 1}, {"abattement", 1}, {"abattagen", 
    1}, {"abattage", 1}, {"abated", 1}, {"abandonn", 1}, {"abaiss", 
    1}, {"aback", 1}, {"aase", 1}, {"aaijaut", 1}, {"aaaah", 
    1}, {"aaa", 1}};
lst2 = {{"the", 30419}, {"n", 20414}, {"de", 19956}, {"of", 
    16262}, {"and", 14488}, {"to", 12726}, {"a", 12635}, {"in", 
    11141}, {"la", 10739}, {"et", 9016}, {"les", 8675}, {"le", 
    7748}, {"abattement", 1}, {"abattagen", 1}, {"abattage", 
    1}, {"abated", 1}, {"abandonn", 1}, {"abaiss", 1}, {"aback", 
    1}, {"aase", 1}, {"aaijaut", 1}, {"aaaah", 1}, {"aaa", 1}};

您可以运行此

Scan[(freq[#[[1]]] += #[[2]]) &, lst1]

命令，然后运行下一个列表

freq["the"]
(*
42216
*)

下一个列表，

Scan[(freq[#[[1]]] += #[[2]]) &, lst2]

，然后运行

freq["the"]
72635

然后仍然运行

freq["safas"]
(*
0
*)

Try using a hash table, like this. First set things up:

ClearAll[freq];
freq[_] = 0;

Now eg freq["safas"] returns 0. Next, if the lists are defined as

lst1 = {{"the", 42216}, {"of", 24903}, {"and", 18624}, {"n", 
    16850}, {"in", 16164}, {"de", 14930}, {"a", 14660}, {"to", 
    14175}, {"la", 7347}, {"was", 6030}, {"l", 5981}, {"le", 
    5735}, {"abattoir", 1}, {"abattement", 1}, {"abattagen", 
    1}, {"abattage", 1}, {"abated", 1}, {"abandonn", 1}, {"abaiss", 
    1}, {"aback", 1}, {"aase", 1}, {"aaijaut", 1}, {"aaaah", 
    1}, {"aaa", 1}};
lst2 = {{"the", 30419}, {"n", 20414}, {"de", 19956}, {"of", 
    16262}, {"and", 14488}, {"to", 12726}, {"a", 12635}, {"in", 
    11141}, {"la", 10739}, {"et", 9016}, {"les", 8675}, {"le", 
    7748}, {"abattement", 1}, {"abattagen", 1}, {"abattage", 
    1}, {"abated", 1}, {"abandonn", 1}, {"abaiss", 1}, {"aback", 
    1}, {"aase", 1}, {"aaijaut", 1}, {"aaaah", 1}, {"aaa", 1}};

you may run this

Scan[(freq[#[[1]]] += #[[2]]) &, lst1]

after which eg

freq["the"]
(*
42216
*)

and then the next list

Scan[(freq[#[[1]]] += #[[2]]) &, lst2]

after which eg

freq["the"]
72635

while still

freq["safas"]
(*
0
*)

回复收藏 0 原文

凉薄对峙 2024-12-18 17:40:44

这是一个直接的 Sow/Reap 函数：

Reap[#2~Sow~# & @@@ data1~Join~data2;, _, {#, Tr@#2} &][[2]]

这是 acl 方法的简洁形式：

Module[{c},
  c[_] = 0;

  c[#] += #2 & @@@ data1~Join~data2;

  {#[[1, 1]], #2} & @@@ Most@DownValues@c
]

这似乎比我系统上的 Szabolcs 代码快一点：

data1 ~Join~ data2 ~GatherBy~ First /.
  {{{x_, a_}, {x_, b_}} :> {x, a + b}, {x : {_, _}} :> x}

Here is a direct Sow/Reap function:

Reap[#2~Sow~# & @@@ data1~Join~data2;, _, {#, Tr@#2} &][[2]]

Here is a concise form of acl's method:

Module[{c},
  c[_] = 0;

  c[#] += #2 & @@@ data1~Join~data2;

  {#[[1, 1]], #2} & @@@ Most@DownValues@c
]

This appears to be a bit faster than Szabolcs code on my system:

data1 ~Join~ data2 ~GatherBy~ First /.
  {{{x_, a_}, {x_, b_}} :> {x, a + b}, {x : {_, _}} :> x}

回复收藏 0 原文

寄风 2024-12-18 17:40:44

有句老话说：“如果你只有一把锤子，那么一切都会变成钉子。”所以，这是我的锤子：SelectEquivalents 。

使用 SelectEquivalents 可以更快地完成此操作：

SelectEquivalents[data1~Join~data2, #[[1]]&, #[[2]]&, {#1, Total[#2]}&]

按顺序，第一个参数显然只是连接列表，第二个参数是它们的分组依据（在本例中是第一个元素），第三个参数去掉字符串，只留下计数，第四个参数将其与字符串一起放回原处，作为 #1 ，将列表中的计数作为 #2 。

There's an old saying, "if all you have is a hammer, everything becomes a nail." So, here's my hammer: SelectEquivalents.

This can be done a little quicker using SelectEquivalents:

SelectEquivalents[data1~Join~data2, #[[1]]&, #[[2]]&, {#1, Total[#2]}&]

In order, the first param is obviously just the joined lists, the second one is what they're grouped by (in this case the first element), the third param strips off the string leaving just the count, and the fourth param puts it back together with the string as #1 and the counts in a list as #2.

回复收藏 0 原文

￡噩梦荏苒 2024-12-18 17:40:44

尝试ReplaceRepeated。

加入列表。然后使用

//. {{f1___, {a_, c1_}, f2___, {a_, c2_}, f3___} -> {f1, f2, f3, {a, c1 + c2}}}

Try ReplaceRepeated.

Join the lists. Then use

//. {{f1___, {a_, c1_}, f2___, {a_, c2_}, f3___} -> {f1, f2, f3, {a, c1 + c2}}}

回复收藏 0 原文

~没有更多了~

关于作者

傲世九天

暂无简介

文章

26 人气

关注发私信

友情链接

文江博客

组合词频数据列表

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（5）

关于作者

相关话题

热门标签

推荐作者

忆悲凉

hgfg1645

qq_qLPLYi

戏舞

殊姿

﹂绝世的画

友情链接

组合词频数据列表

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（5）

关于作者

相关话题

热门标签

推荐作者

忆悲凉

hgfg1645

qq_qLPLYi

戏舞

殊姿

﹂绝世的画

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。