组合词频数据列表

发布于 2024-12-11 17:40:44 字数 1167 浏览 3 评论 0原文

这似乎应该是一个显而易见的问题,但列表上的教程和文档尚未发布。其中许多问题源于我的文本文件的巨大大小(数百 MB)以及我试图将它们归结为我的系统可管理的内容。因此,我正在分段进行工作,现在正在尝试合并结果。

我有多个词频列表(大约 40 个)。这些列表可以通过 Import[ ] 获取,也可以作为在 Mathematica 中生成的变量。每个列表如下所示,并且是使用 Tally[ ] 和 Sort[ ] 命令生成的:

{{"the", 42216}, {"of", 24903}, {"and", 18624}, {"n", 16850}, {"in",
16164}, {"de", 14930}, {"a", 14660}, {"to", 14175}, {"la", 7347}, {"是", 6030}, {"l", 5981}, {"le", 5735}, <<51293>>, {"屠宰场", 1}, {"减少", 1}, {"减少", 1}, {"减少", 1}, {"abated", 1}, {"abandonn", 1}, {"abaiss", 1}, {"aback", 1}, {"aase", 1}, {"aaijaut", 1}, {"aaaah", 1}, {"aaa", 1}}

以下是第二个文件的示例:

{{"the", 30419}, {"n", 20414}, {"de", 19956}, {"of", 16262}, {"and",
14488}, {"到", 12726}, {"a", 12635}, {"在", 11141}, {"la", 10739}, {“et”,9016},{“les”,8675},{“le”,7748},<<101032>>, {"abattement", 1}, {"abattagen", 1}, {"abattage", 1}, {"abated", 1}, {"放弃", 1}, {"abaiss", 1}, {"aback", 1}, {"aase", 1}, {"aaijaut", 1}, {"aaaah", 1}, {"aaa", 1}}

我想将它们组合起来,以便频率数据聚合:即,如果第二个文件有 30,419 次出现 'the' 并且是加入第一个文件后,它应该返回 72,635 次出现(当我浏览整个集合时,依此类推)。

This seems like it should be an obvious question, but the tutorials and documentation on lists are not forthcoming. Many of these issues stem from the sheer size of my text files (hundreds of MB) and my attempts to boil them down to something manageable by my system. As a result, I'm doing my work in segments and am now trying to combine the results.

I have multiple word frequency lists (~40 of them). The lists can either be taken through Import[ ] or as variables generated in Mathematica. Each list appears as the following and has been generated using the Tally[ ] and Sort[ ] commands:

{{"the", 42216}, {"of", 24903}, {"and", 18624}, {"n", 16850}, {"in",
16164}, {"de", 14930}, {"a", 14660}, {"to", 14175}, {"la", 7347},
{"was", 6030}, {"l", 5981}, {"le", 5735}, <<51293>>, {"abattoir",
1}, {"abattement", 1}, {"abattagen", 1}, {"abattage", 1},
{"abated", 1}, {"abandonn", 1}, {"abaiss", 1}, {"aback", 1},
{"aase", 1}, {"aaijaut", 1}, {"aaaah", 1}, {"aaa", 1}}

Here is an example of the second file:

{{"the", 30419}, {"n", 20414}, {"de", 19956}, {"of", 16262}, {"and",
14488}, {"to", 12726}, {"a", 12635}, {"in", 11141}, {"la", 10739},
{"et", 9016}, {"les", 8675}, {"le", 7748}, <<101032>>,
{"abattement", 1}, {"abattagen", 1}, {"abattage", 1}, {"abated",
1}, {"abandonn", 1}, {"abaiss", 1}, {"aback", 1}, {"aase", 1},
{"aaijaut", 1}, {"aaaah", 1}, {"aaa", 1}}

I want to combine them so that the frequency data aggregates: i.e. if the second file has 30,419 occurrences of 'the' and is joined to the first file, it should return that there are 72,635 occurrences (and so on as I move through the entire collection).

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

嗼ふ静 2024-12-18 17:40:44

听起来您需要 GatherBy

假设您的两个列表分别命名为 data1data2,然后使用

{#[[1, 1]], Total[#[[All, 2]]]} & /@ GatherBy[Join[data1, data2], First]

这可以轻松推广到任意数量的列表,而不仅仅是两个。

It sounds like you need GatherBy.

Suppose your two lists are named data1 and data2, then use

{#[[1, 1]], Total[#[[All, 2]]]} & /@ GatherBy[Join[data1, data2], First]

This easily generalizes to any number of lists, not just two.

好听的两个字的网名 2024-12-18 17:40:44

尝试使用哈希表,就像这样。首先进行设置:

ClearAll[freq];
freq[_] = 0;

现在例如 freq["safas"] 返回 0。接下来,如果列表定义为

lst1 = {{"the", 42216}, {"of", 24903}, {"and", 18624}, {"n", 
    16850}, {"in", 16164}, {"de", 14930}, {"a", 14660}, {"to", 
    14175}, {"la", 7347}, {"was", 6030}, {"l", 5981}, {"le", 
    5735}, {"abattoir", 1}, {"abattement", 1}, {"abattagen", 
    1}, {"abattage", 1}, {"abated", 1}, {"abandonn", 1}, {"abaiss", 
    1}, {"aback", 1}, {"aase", 1}, {"aaijaut", 1}, {"aaaah", 
    1}, {"aaa", 1}};
lst2 = {{"the", 30419}, {"n", 20414}, {"de", 19956}, {"of", 
    16262}, {"and", 14488}, {"to", 12726}, {"a", 12635}, {"in", 
    11141}, {"la", 10739}, {"et", 9016}, {"les", 8675}, {"le", 
    7748}, {"abattement", 1}, {"abattagen", 1}, {"abattage", 
    1}, {"abated", 1}, {"abandonn", 1}, {"abaiss", 1}, {"aback", 
    1}, {"aase", 1}, {"aaijaut", 1}, {"aaaah", 1}, {"aaa", 1}};

您可以运行此

Scan[(freq[#[[1]]] += #[[2]]) &, lst1]

命令,然后运行下一个列表

freq["the"]
(*
42216
*)

下一个列表,

Scan[(freq[#[[1]]] += #[[2]]) &, lst2]

,然后运行

freq["the"]
72635

然后仍然运行

freq["safas"]
(*
0
*)

Try using a hash table, like this. First set things up:

ClearAll[freq];
freq[_] = 0;

Now eg freq["safas"] returns 0. Next, if the lists are defined as

lst1 = {{"the", 42216}, {"of", 24903}, {"and", 18624}, {"n", 
    16850}, {"in", 16164}, {"de", 14930}, {"a", 14660}, {"to", 
    14175}, {"la", 7347}, {"was", 6030}, {"l", 5981}, {"le", 
    5735}, {"abattoir", 1}, {"abattement", 1}, {"abattagen", 
    1}, {"abattage", 1}, {"abated", 1}, {"abandonn", 1}, {"abaiss", 
    1}, {"aback", 1}, {"aase", 1}, {"aaijaut", 1}, {"aaaah", 
    1}, {"aaa", 1}};
lst2 = {{"the", 30419}, {"n", 20414}, {"de", 19956}, {"of", 
    16262}, {"and", 14488}, {"to", 12726}, {"a", 12635}, {"in", 
    11141}, {"la", 10739}, {"et", 9016}, {"les", 8675}, {"le", 
    7748}, {"abattement", 1}, {"abattagen", 1}, {"abattage", 
    1}, {"abated", 1}, {"abandonn", 1}, {"abaiss", 1}, {"aback", 
    1}, {"aase", 1}, {"aaijaut", 1}, {"aaaah", 1}, {"aaa", 1}};

you may run this

Scan[(freq[#[[1]]] += #[[2]]) &, lst1]

after which eg

freq["the"]
(*
42216
*)

and then the next list

Scan[(freq[#[[1]]] += #[[2]]) &, lst2]

after which eg

freq["the"]
72635

while still

freq["safas"]
(*
0
*)
凉薄对峙 2024-12-18 17:40:44

这是一个直接的 Sow/Reap 函数:

Reap[#2~Sow~# & @@@ data1~Join~data2;, _, {#, Tr@#2} &][[2]]

这是 acl 方法的简洁形式:

Module[{c},
  c[_] = 0;

  c[#] += #2 & @@@ data1~Join~data2;

  {#[[1, 1]], #2} & @@@ Most@DownValues@c
]

这似乎比我系统上的 Szabolcs 代码快一点:

data1 ~Join~ data2 ~GatherBy~ First /.
  {{{x_, a_}, {x_, b_}} :> {x, a + b}, {x : {_, _}} :> x}

Here is a direct Sow/Reap function:

Reap[#2~Sow~# & @@@ data1~Join~data2;, _, {#, Tr@#2} &][[2]]

Here is a concise form of acl's method:

Module[{c},
  c[_] = 0;

  c[#] += #2 & @@@ data1~Join~data2;

  {#[[1, 1]], #2} & @@@ Most@DownValues@c
]

This appears to be a bit faster than Szabolcs code on my system:

data1 ~Join~ data2 ~GatherBy~ First /.
  {{{x_, a_}, {x_, b_}} :> {x, a + b}, {x : {_, _}} :> x}
寄风 2024-12-18 17:40:44

有句老话说:“如果你只有一把锤子,那么一切都会变成钉子。”所以,这是我的锤子:SelectEquivalents

使用 SelectEquivalents 可以更快地完成此操作:

SelectEquivalents[data1~Join~data2, #[[1]]&, #[[2]]&, {#1, Total[#2]}&]

按顺序,第一个参数显然只是连接列表,第二个参数是它们的分组依据(在本例中是第一个元素),第三个参数去掉字符串,只留下计数,第四个参数将其与字符串一起放回原处,作为 #1 ,将列表中的计数作为 #2

There's an old saying, "if all you have is a hammer, everything becomes a nail." So, here's my hammer: SelectEquivalents.

This can be done a little quicker using SelectEquivalents:

SelectEquivalents[data1~Join~data2, #[[1]]&, #[[2]]&, {#1, Total[#2]}&]

In order, the first param is obviously just the joined lists, the second one is what they're grouped by (in this case the first element), the third param strips off the string leaving just the count, and the fourth param puts it back together with the string as #1 and the counts in a list as #2.

£噩梦荏苒 2024-12-18 17:40:44

尝试ReplaceRepeated

加入列表。然后使用

//. {{f1___, {a_, c1_}, f2___, {a_, c2_}, f3___} -> {f1, f2, f3, {a, c1 + c2}}}

Try ReplaceRepeated.

Join the lists. Then use

//. {{f1___, {a_, c1_}, f2___, {a_, c2_}, f3___} -> {f1, f2, f3, {a, c1 + c2}}}
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文