查找字符串向量的所有唯一组合的幂集

发布于 2024-11-28 06:53:38 字数 781 浏览 2 评论 0原文

我试图找到长度为 39 的向量/项目列表的所有唯一分组。下面是我的代码:

x <- c("Dominion","progress","scarolina","tampa","tva","TminKTYS",
       "TmaxKTYS","TminKBNA","TmaxKBNA","TminKMEM","TmaxKMEM",
       "TminKCRW","TmaxKCRW","TminKROA","TmaxKROA","TminKCLT",
       "TmaxKCLT","TminKCHS","TmaxKCHS","TminKATL","TmaxKATL",
       "TminKCMH","TmaxKCMH","TminKJAX","TmaxKJAX","TminKLTH",
       "TmaxKLTH","TminKMCO","TmaxKMCO","TminKMIA","TmaxKMIA",
       "TminKPTA","TmaxKTPA","TminKPNS","TmaxKPNS","TminKLEX",
       "TmaxKLEX","TminKSDF","TmaxKSDF")

# Generate a list with the combinations  
zz <- sapply(seq_along(x), function(y) combn(x,y))
# Filter out all the duplicates
sapply(zz, function(z) t(unique(t(z)))) 

但是,该代码会导致我的计算机内存不足。有更好的方法吗?我意识到我有一个很大的清单。谢谢。

I am trying to find all of the unique groupings of a vector/list of items, length 39. Below is the code I have:

x <- c("Dominion","progress","scarolina","tampa","tva","TminKTYS",
       "TmaxKTYS","TminKBNA","TmaxKBNA","TminKMEM","TmaxKMEM",
       "TminKCRW","TmaxKCRW","TminKROA","TmaxKROA","TminKCLT",
       "TmaxKCLT","TminKCHS","TmaxKCHS","TminKATL","TmaxKATL",
       "TminKCMH","TmaxKCMH","TminKJAX","TmaxKJAX","TminKLTH",
       "TmaxKLTH","TminKMCO","TmaxKMCO","TminKMIA","TmaxKMIA",
       "TminKPTA","TmaxKTPA","TminKPNS","TmaxKPNS","TminKLEX",
       "TmaxKLEX","TminKSDF","TmaxKSDF")

# Generate a list with the combinations  
zz <- sapply(seq_along(x), function(y) combn(x,y))
# Filter out all the duplicates
sapply(zz, function(z) t(unique(t(z)))) 

However, the code causes my computer to run out of memory. Is there a better way to do this? I realize I have a large list. thanks.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

一曲琵琶半遮面シ 2024-12-05 06:53:38

要计算所有唯一子集,您只需创建与原始项目集的基数具有相同长度的所有二进制向量。如果有 39 个项目,那么您将查看长度为 39 的所有二进制向量。每个向量的每个元素都标识该项目是否在相应的子集中(是或否)。

由于有 39 个项目,并且每个项目既可以在给定的子集中,也可以不在给定的子集中,因此可能有 2^39 个子集。排除空集,即全 0 向量,您有 2^39 - 1 个可能的子集。

也就是说,正如 @joran 所说,大约有 549B 个向量。鉴于二进制向量最紧凑地表示数据(即没有字符串),那么您将需要 549B * 39 位来返回所有子集。我认为你不想存储这个:大约 2.68E12 字节。如果您坚持使用这些字符,则可能会达到数十 TB。

购买一个能够支持这一点的系统当然是可行的,但不太划算。

在元层面上,正如 @JD 所说,这很可能不是您真正需要走的路。我建议发布一个新问题,也许可以在此处或与统计相关的 SE 网站上进行完善。

To calculate all unique subsets, you are simply creating all binary vectors with the same length as the cardinality of the original set of items. If there are 39 items, then you are looking at all binary vectors of length 39. Each element of each vector identifies, yes or no, whether or not the item is in the corresponding subset.

As there are 39 items, and each can either be in or not-in a given subset, then there are 2^39 possible subsets. Excluding the empty set, i.e. the all-0 vector, you have 2^39 - 1 possible subsets.

That is, as @joran said, about 549B vectors. Given that the binary vectors are most compactly representing the data (i.e. without strings), then you will need 549B * 39 bits to return all of the subsets. I don't think you want to store this: that's about 2.68E12 bytes. If you insist on using the characters, you're likely to be in the many tens of terabytes.

It's certainly feasible to buy a system that can support this, but not very cost-effective.

At a meta-level, it is very likely, as @JD said, that this is not the path you really need to go. I recommend posting a new question and maybe it can be refined here or on the statistics-related SE site.

安静被遗忘 2024-12-05 06:53:38

您可以尝试使用 expand.grid。

从提供的向量的所有组合创建一个数据框或
因素。具体细节请参见返回值的描述
这是如何完成的。

You might try using expand.grid.

Create a data frame from all combinations of the supplied vectors or
factors. See the description of the return value for precise details
of the way this is done.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文