从列表中查找唯一值

发布于 2024-09-26 07:08:43 字数 228 浏览 3 评论 0原文

假设您有一个值列表，

x <- list(a=c(1,2,3), b = c(2,3,4), c=c(4,5,6))

我想从组合的所有列表元素中找到唯一值。到目前为止，以下代码可以解决问题

unique(unlist(x))

有人知道更有效的方法吗？我有一个包含很多值的庞大列表，并且希望能加快速度。

原文

Suppose you have a list of values

x <- list(a=c(1,2,3), b = c(2,3,4), c=c(4,5,6))

I would like to find unique values from all list elements combined. So far, the following code did the trick

unique(unlist(x))

Does anyone know a more efficient way? I have a hefty list with a lot of values and would appreciate any speed-up.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

故乡的云 2024-10-03 07:08:43

Marek 提出的这个解决方案是对原始问题的最佳答案。请参阅下文，了解其他方法的讨论以及为什么 Marek 的方法最有用。

> unique(unlist(x, use.names = FALSE))
[1] 1 2 3 4 5 6

讨论

更快的解决方案是首先对 x 的组件计算 unique()，然后对这些结果执行最终的unique()。仅当列表的组件具有相同数量的唯一值时，这才有效，就像下面两个示例中的那样。例如：

首先是你的版本，然后是我的双重唯一方法：

> unique(unlist(x))
[1] 1 2 3 4 5 6
> unique.default(sapply(x, unique))
[1] 1 2 3 4 5 6

我们必须调用 unique.default 因为有一个 matrix 方法用于 unique 保持固定边距；这很好，因为矩阵可以被视为向量。

Marek 在此答案的评论中指出，unlist 方法的速度较慢可能是由于列表中的names 造成的。 Marek 的解决方案是使用 unlist 的 use.names 参数，如果使用该参数，会产生比上面的双唯一版本更快的解决方案。对于 Roman 帖子中的简单 x，我们得到

> unique(unlist(x, use.names = FALSE))
[1] 1 2 3 4 5 6

Marek 的解决方案，即使组件之间的唯一元素数量不同，也能正常工作。

这是一个更大的示例，其中包含所有三种方法的一些计时：

## Create a large list (1000 components of length 100 each)
DF <- as.list(data.frame(matrix(sample(1:10, 1000*1000, replace = TRUE), 
                                ncol = 1000)))

以下是使用 DF 的两种方法的结果：

> ## Do the three approaches give the same result:
> all.equal(unique.default(sapply(DF, unique)), unique(unlist(DF)))
[1] TRUE
> all.equal(unique(unlist(DF, use.names = FALSE)), unique(unlist(DF)))
[1] TRUE
> ## Timing Roman's original:
> system.time(replicate(10, unique(unlist(DF))))
   user  system elapsed 
  12.884   0.077  12.966
> ## Timing double unique version:
> system.time(replicate(10, unique.default(sapply(DF, unique))))
   user  system elapsed 
  0.648   0.000   0.653
> ## timing of Marek's solution:
> system.time(replicate(10, unique(unlist(DF, use.names = FALSE))))
   user  system elapsed 
  0.510   0.000   0.512

这表明双 unique 的应用速度要快得多 < code>unique() 到各个组件，然后 unique() 这些较小的唯一值集，但这种加速纯粹是由于名称在列表 DF 上。如果我们告诉unlist不要使用names，对于这个问题，Marek 的解决方案比 double unique 稍微快一些。由于 Marek 的解决方案正确使用了正确的工具，并且比解决方法更快，因此它是首选解决方案。

双unique方法的一个大问题是，它只有if才起作用，就像这里的两个示例一样，输入列表的每个组件（DF 或 x) 具有相同数量的唯一值。在这种情况下，sapply 将结果简化为矩阵，这允许我们应用unique.default。如果输入列表的组件具有不同数量的唯一值，则双重唯一解决方案将失败。

This solution suggested by Marek is the best answer to the original Q. See below for a discussion of other approaches and why Marek's is the most useful.

> unique(unlist(x, use.names = FALSE))
[1] 1 2 3 4 5 6

Discussion

A faster solution is to compute unique() on the components of your x first and then do a final unique() on those results. This will only work if the components of the list have the same number of unique values, as they do in both examples below. E.g.:

First your version, then my double unique approach:

> unique(unlist(x))
[1] 1 2 3 4 5 6
> unique.default(sapply(x, unique))
[1] 1 2 3 4 5 6

We have to call unique.default as there is a matrix method for unique that keeps one margin fixed; this is fine as a matrix can be treated as a vector.

Marek, in the comments to this answer, notes that the slow speed of the unlist approach is potentially due to the names on the list. Marek's solution is to make use of the use.names argument to unlist, which if used, results in a faster solution than the double unique version above. For the simple x of Roman's post we get

> unique(unlist(x, use.names = FALSE))
[1] 1 2 3 4 5 6

Marek's solution will work even when the number of unique elements differs between components.

Here is a larger example with some timings of all three methods:

## Create a large list (1000 components of length 100 each)
DF <- as.list(data.frame(matrix(sample(1:10, 1000*1000, replace = TRUE), 
                                ncol = 1000)))

Here are results for the two approaches using DF:

> ## Do the three approaches give the same result:
> all.equal(unique.default(sapply(DF, unique)), unique(unlist(DF)))
[1] TRUE
> all.equal(unique(unlist(DF, use.names = FALSE)), unique(unlist(DF)))
[1] TRUE
> ## Timing Roman's original:
> system.time(replicate(10, unique(unlist(DF))))
   user  system elapsed 
  12.884   0.077  12.966
> ## Timing double unique version:
> system.time(replicate(10, unique.default(sapply(DF, unique))))
   user  system elapsed 
  0.648   0.000   0.653
> ## timing of Marek's solution:
> system.time(replicate(10, unique(unlist(DF, use.names = FALSE))))
   user  system elapsed 
  0.510   0.000   0.512

Which shows that the double unique is a lot quicker to applying unique() to the individual components and then unique() those smaller sets of unique values, but this speed-up is purely due to the names on the list DF. If we tell unlist to not use the names, Marek's solution is marginally quicker than the double unique for this problem. As Marek's solution is using the correct tool properly, and it is quicker than the work-around, it is the preferred solution.

The big gotcha with the double unique approach is that it will only work if, as in the two examples here, each component of the input list (DF or x) has the same number of unique values. In such cases sapply simplifies the result to a matrix which allows us to apply unique.default. If the components of the input list have differing numbers of unique values, the double unique solution will fail.

回复收藏 0 原文

~没有更多了~