组合词频数据列表
这似乎应该是一个显而易见的问题,但列表上的教程和文档尚未发布。其中许多问题源于我的文本文件的巨大大小(数百 MB)以及我试图将它们归结为我的系统可管理的内容。因此,我正在分段进行工作,现在正在尝试合并结果。
我有多个词频列表(大约 40 个)。这些列表可以通过 Import[ ] 获取,也可以作为在 Mathematica 中生成的变量。每个列表如下所示,并且是使用 Tally[ ] 和 Sort[ ] 命令生成的:
{{"the", 42216}, {"of", 24903}, {"and", 18624}, {"n", 16850}, {"in",
16164}, {"de", 14930}, {"a", 14660}, {"to", 14175}, {"la", 7347}, {"是", 6030}, {"l", 5981}, {"le", 5735}, <<51293>>, {"屠宰场", 1}, {"减少", 1}, {"减少", 1}, {"减少", 1}, {"abated", 1}, {"abandonn", 1}, {"abaiss", 1}, {"aback", 1}, {"aase", 1}, {"aaijaut", 1}, {"aaaah", 1}, {"aaa", 1}}
以下是第二个文件的示例:
{{"the", 30419}, {"n", 20414}, {"de", 19956}, {"of", 16262}, {"and",
14488}, {"到", 12726}, {"a", 12635}, {"在", 11141}, {"la", 10739}, {“et”,9016},{“les”,8675},{“le”,7748},<<101032>>, {"abattement", 1}, {"abattagen", 1}, {"abattage", 1}, {"abated", 1}, {"放弃", 1}, {"abaiss", 1}, {"aback", 1}, {"aase", 1}, {"aaijaut", 1}, {"aaaah", 1}, {"aaa", 1}}
我想将它们组合起来,以便频率数据聚合:即,如果第二个文件有 30,419 次出现 'the' 并且是加入第一个文件后,它应该返回 72,635 次出现(当我浏览整个集合时,依此类推)。
This seems like it should be an obvious question, but the tutorials and documentation on lists are not forthcoming. Many of these issues stem from the sheer size of my text files (hundreds of MB) and my attempts to boil them down to something manageable by my system. As a result, I'm doing my work in segments and am now trying to combine the results.
I have multiple word frequency lists (~40 of them). The lists can either be taken through Import[ ] or as variables generated in Mathematica. Each list appears as the following and has been generated using the Tally[ ] and Sort[ ] commands:
{{"the", 42216}, {"of", 24903}, {"and", 18624}, {"n", 16850}, {"in",
16164}, {"de", 14930}, {"a", 14660}, {"to", 14175}, {"la", 7347},
{"was", 6030}, {"l", 5981}, {"le", 5735}, <<51293>>, {"abattoir",
1}, {"abattement", 1}, {"abattagen", 1}, {"abattage", 1},
{"abated", 1}, {"abandonn", 1}, {"abaiss", 1}, {"aback", 1},
{"aase", 1}, {"aaijaut", 1}, {"aaaah", 1}, {"aaa", 1}}
Here is an example of the second file:
{{"the", 30419}, {"n", 20414}, {"de", 19956}, {"of", 16262}, {"and",
14488}, {"to", 12726}, {"a", 12635}, {"in", 11141}, {"la", 10739},
{"et", 9016}, {"les", 8675}, {"le", 7748}, <<101032>>,
{"abattement", 1}, {"abattagen", 1}, {"abattage", 1}, {"abated",
1}, {"abandonn", 1}, {"abaiss", 1}, {"aback", 1}, {"aase", 1},
{"aaijaut", 1}, {"aaaah", 1}, {"aaa", 1}}
I want to combine them so that the frequency data aggregates: i.e. if the second file has 30,419 occurrences of 'the' and is joined to the first file, it should return that there are 72,635 occurrences (and so on as I move through the entire collection).
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
听起来您需要
GatherBy
。假设您的两个列表分别命名为
data1
和data2
,然后使用这可以轻松推广到任意数量的列表,而不仅仅是两个。
It sounds like you need
GatherBy
.Suppose your two lists are named
data1
anddata2
, then useThis easily generalizes to any number of lists, not just two.
尝试使用哈希表,就像这样。首先进行设置:
现在例如
freq["safas"]
返回 0。接下来,如果列表定义为您可以运行此
命令,然后运行下一个列表
下一个列表,
,然后运行
然后仍然运行
Try using a hash table, like this. First set things up:
Now eg
freq["safas"]
returns 0. Next, if the lists are defined asyou may run this
after which eg
and then the next list
after which eg
while still
这是一个直接的
Sow
/Reap
函数:这是 acl 方法的简洁形式:
这似乎比我系统上的 Szabolcs 代码快一点:
Here is a direct
Sow
/Reap
function:Here is a concise form of acl's method:
This appears to be a bit faster than Szabolcs code on my system:
有句老话说:“如果你只有一把锤子,那么一切都会变成钉子。”所以,这是我的锤子:
SelectEquivalents
。使用 SelectEquivalents 可以更快地完成此操作:
按顺序,第一个参数显然只是连接列表,第二个参数是它们的分组依据(在本例中是第一个元素),第三个参数去掉字符串,只留下计数,第四个参数将其与字符串一起放回原处,作为
#1
,将列表中的计数作为#2
。There's an old saying, "if all you have is a hammer, everything becomes a nail." So, here's my hammer:
SelectEquivalents
.This can be done a little quicker using
SelectEquivalents
:In order, the first param is obviously just the joined lists, the second one is what they're grouped by (in this case the first element), the third param strips off the string leaving just the count, and the fourth param puts it back together with the string as
#1
and the counts in a list as#2
.尝试
ReplaceRepeated
。加入列表。然后使用
Try
ReplaceRepeated
.Join the lists. Then use