根据 R 中列中值的出现对数据集进行子集化
我有一个包含两列的数据集(ds)。 “match”中存在一个或两个具有相同编号的条目。 “状态”是一个二进制变量。存在对,例如,match 中的值 12 出现两次,一次代表状态为 1,另一次代表 0。然而,在比赛中也有观察到没有搭档的人,在这个数据集中,没有搭档的人分别是 3,8,33,17。
match status
12 1
3 1
5 0
8 1
33 0
5 1
12 0
17 0
我想要做的是创建一个新数据集,其中仅包含对的观察结果(因此,如果一个值出现两次)。在我的示例中,
match status
12 1
12 0
5 0
5 1
最终数据集中的状态变量将为 50/50,因为匹配中的值(例如 12)具有状态 = 0 的观察值和状态 = 1 的观察值。 我正在使用的实际数据集有超过 50k 个观察值,因此我不能只按每个数字进行搜索和过滤。我尝试的是:
numbers <- table(ds$match)
numbers <- as.data.frame(numbers)
numbers <- numbers[numbers$Freq == 2,]
numbers <- numbers$Var1
ds$keep <- ifelse(numbers %in% ds$match, 1, 0)
这里我收到错误“替换有 23005 行,数据有 39021”如果我可以解决这个错误,我想我可以运行
ds <- filter(ds, ds$keep == 1)
来获取我想要的数据集。这是我最有希望的方法。我尝试了一些其他的事情,但总是发现状态变量不是 50/50,所以我无法在没有一对的情况下排除所有观察结果。有人知道如何修复我的代码或者是否有更快/更流畅的解决方案?感谢您提前提供任何帮助!
I have a dataset (ds) with two columns. There are either one or two entries with the same number in "match". "status" is a binary variable. There are pairs, for example, the value 12 in match appears twice, one for where status is 1 and 0 for the other. Yet, there are also observations in match who do not have a partner, in this dataset it would be 3,8,33,17 who have no partner.
match status
12 1
3 1
5 0
8 1
33 0
5 1
12 0
17 0
What I want to do is to create a new dataset that only contains observations of pairs (thus if a value appears twice). In my example, it would be
match status
12 1
12 0
5 0
5 1
The status variable in the final dataset would be 50/50 because a value in match (for example 12) has an observation where status = 0 and one where status = 1.
The actual dataset I´m working with has over 50k observations so I cannot just search and filter by each number. What I tried is:
numbers <- table(ds$match)
numbers <- as.data.frame(numbers)
numbers <- numbers[numbers$Freq == 2,]
numbers <- numbers$Var1
ds$keep <- ifelse(numbers %in% ds$match, 1, 0)
Here I get the error "replacement has 23005 rows, data has 39021" If I could get around this error I think I could just run
ds <- filter(ds, ds$keep == 1)
to get the dataset that I want. This was my most promising approach. I tried a few other things but it always came done to the fact that the status variable wasn´t 50/50 so I couldn´t manage to exclude all observations without a pair. Does someone have an idea how I could fix my code or is there a solution that would be quicker/more smooth? Thanks for any help in advance!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

你也可以这样做:
You can also do something like this: