如何删除所有重复项,以便数据框中不留下任何重复项?
有一个类似的问题 PHP,但我正在使用 R,无法将解决方案转化为我的问题。
我有一个包含 10 行和 50 列的数据框,其中一些行完全相同。如果我在它上面使用 unique,我会得到一行 - 比方说 - “类型”,但我真正想要的是只得到那些只出现一次的行。有谁知道我怎样才能实现这一目标?
我可以查看集群和热图来手动对其进行排序,但我的数据框比上面提到的数据框更大(最多 100 行),这有点棘手。
There is a similar question for PHP, but I'm working with R and am unable to translate the solution to my problem.
I have this data frame with 10 rows and 50 columns, where some of the rows are absolutely identical. If I use unique on it, I get one row per - let's say - "type", but what I actually want is to get only those rows which only appear once. Does anyone know how I can achieve this?
I can have a look at clusters and heatmaps to sort it out manually, but I have bigger data frames than the one mentioned above (with up to 100 rows) where this gets a bit tricky.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
这将提取仅出现一次的行(假设您的数据框名为
df
):如何工作:函数
duplicated
测试是否该行从第一行开始至少第二次出现。如果使用参数fromLast = TRUE
,则函数从最后一行开始。两个布尔结果都与
|
(逻辑“或”)组合成一个新向量,该向量指示所有行出现多次。使用!
对其结果求反,从而创建一个布尔向量,指示仅出现一次的行。This will extract the rows which appear only once (assuming your data frame is named
df
):How it works: The function
duplicated
tests whether a line appears at least for the second time starting at line one. If the argumentfromLast = TRUE
is used, the function starts at the last line.Boths boolean results are combined with
|
(logical 'or') into a new vector which indicates all lines appearing more than once. The result of this is negated using!
thereby creating a boolean vector indicating lines appearing only once.涉及 dplyr 的可能性可能是:
或者:
从 dplyr 1.0.0 开始,更好的方法是:
A possibility involving
dplyr
could be:Or:
Since
dplyr 1.0.0
, the preferable way would be:使用
vctrs::vec_duplicate_detect
的方法原始示例
在 data.frame
基准上
An approach using
vctrs::vec_duplicate_detect
Original example
On a data.frame
Benchmark