当前位置：文江博客话题详情

如何删除所有重复项，以便数据框中不留下任何重复项？

发布于 2025-01-16 16:27:16 字数 333 浏览 5 评论 0原文

有一个类似的问题 PHP，但我正在使用 R，无法将解决方案转化为我的问题。

我有一个包含 10 行和 50 列的数据框，其中一些行完全相同。如果我在它上面使用 unique，我会得到一行 - 比方说 - “类型”，但我真正想要的是只得到那些只出现一次的行。有谁知道我怎样才能实现这一目标？

我可以查看集群和热图来手动对其进行排序，但我的数据框比上面提到的数据框更大（最多 100 行），这有点棘手。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

内心荒芜 2025-01-23 16:27:16

这将提取仅出现一次的行（假设您的数据框名为df）：

df[!(duplicated(df) | duplicated(df, fromLast = TRUE)), ]

如何工作：函数duplicated测试是否该行从第一行开始至少第二次出现。如果使用参数 fromLast = TRUE，则函数从最后一行开始。

两个布尔结果都与 | （逻辑“或”）组合成一个新向量，该向量指示所有行出现多次。使用 ! 对其结果求反，从而创建一个布尔向量，指示仅出现一次的行。

This will extract the rows which appear only once (assuming your data frame is named df):

df[!(duplicated(df) | duplicated(df, fromLast = TRUE)), ]

How it works: The function duplicated tests whether a line appears at least for the second time starting at line one. If the argument fromLast = TRUE is used, the function starts at the last line.

Boths boolean results are combined with | (logical 'or') into a new vector which indicates all lines appearing more than once. The result of this is negated using ! thereby creating a boolean vector indicating lines appearing only once.

回复收藏 0 原文

请别遗忘我 2025-01-23 16:27:16

涉及 dplyr 的可能性可能是：

df %>%
 group_by_all() %>%
 filter(n() == 1)

或者：

df %>%
 group_by_all() %>%
 filter(!any(row_number() > 1))

从 dplyr 1.0.0 开始，更好的方法是：

data %>%
    group_by(across(everything())) %>%
    filter(n() == 1)

A possibility involving dplyr could be:

df %>%
 group_by_all() %>%
 filter(n() == 1)

Or:

df %>%
 group_by_all() %>%
 filter(!any(row_number() > 1))

Since dplyr 1.0.0, the preferable way would be:

data %>%
    group_by(across(everything())) %>%
    filter(n() == 1)

回复收藏 0 原文

七分※倦醒 2025-01-23 16:27:16

使用 vctrs::vec_duplicate_detect 的方法

原始示例

library(vctrs)

vec <- c(1, 2, 2, 3, 4, 3, 2)

vec[!vec_duplicate_detect(vec)]
[1] 1 4

在 data.frame

df
  a b d
1 1 1 1
2 1 1 1
3 2 2 2
4 3 3 4

df[!vec_duplicate_detect(df),]
  a b d
3 2 2 2
4 3 3 4

基准上

length(vec)
[1] 175120

library(microbenchmark)

microbenchmark(
  base = {vec[!(duplicated(vec) | duplicated(vec, fromLast=T))]}, 
  vctrs = {vec[!vec_duplicate_detect(vec)]})
Unit: milliseconds
  expr       min        lq     mean   median       uq      max neval
  base 12.241369 14.408094 16.70000 16.94082 17.26830 26.69546   100
 vctrs  7.526593  9.701161 11.43675 10.80420 11.64395 19.80494   100

An approach using vctrs::vec_duplicate_detect

Original example

library(vctrs)

vec <- c(1, 2, 2, 3, 4, 3, 2)

vec[!vec_duplicate_detect(vec)]
[1] 1 4

On a data.frame

df
  a b d
1 1 1 1
2 1 1 1
3 2 2 2
4 3 3 4

df[!vec_duplicate_detect(df),]
  a b d
3 2 2 2
4 3 3 4

Benchmark

length(vec)
[1] 175120

library(microbenchmark)

microbenchmark(
  base = {vec[!(duplicated(vec) | duplicated(vec, fromLast=T))]}, 
  vctrs = {vec[!vec_duplicate_detect(vec)]})
Unit: milliseconds
  expr       min        lq     mean   median       uq      max neval
  base 12.241369 14.408094 16.70000 16.94082 17.26830 26.69546   100
 vctrs  7.526593  9.701161 11.43675 10.80420 11.64395 19.80494   100

回复收藏 0 原文

~没有更多了~