使用其他数据框中的 2 个链接条件(值的组合)删除 r 数据框中的行
我在 R 中有两个不同的数据集(下面是 MRE)。
一个包含每个模块访问的一个日志 (ModuleViews),另一个 (PageViews) 记录模块访问中的每个特定页面访问。
moduleid 列包含相同的模块代码,session_id 包含两个数据集相同的会话代码。
我处理了 PageViews 数据集,现在想相应地更新 ModuleViews 数据集。
为此,我需要 R 检查/匹配 moduleid 和 session_id 行。因为在 1 个会话(例如 25 个)中,用户可以访问多个模块(对于会话 25 个模块 1697、1698 和 1755 的情况)。
在本例中,我的处理删除了会话 25、模块 1697 的所有页面视图。
我现在想要从 ModuleViews 数据集中删除此行(以及所有其他行),其中 moduleid 和 session_id 与 PageViews 数据集中的不同。
我尝试了以下 3 种方法:
ModuleViews <- subset(ModuleViews, ModuleViews$session_id %in% PageViews$session_id &
ModuleViews$moduleid %in% PageViews$moduleid)
ModuleViews <- ModuleViews[(ModuleViews$session_id %in% PageViews$session_id) &&
(ModuleViews$moduleid %in% PageViews$moduleid),]
ModuleViews$moduleid <- ifelse((ModuleViews$session_id %in% PageViews$session_id) &
(ModuleViews$moduleid %in% PageViews$moduleid), ModuleViews$moduleid, NA)
但它不会组合查看这两列,而是单独查看,将会话 25 模块 1697 留在输出中。
我用 %in%
和 ==
尝试了这些,但是用 ==
我得到了一个长度错误(显然是由于不同的数据集长度)
错误:必须使用有效的下标向量对行进行子集化。 ℹ 逻辑下标必须与索引输入的大小匹配。 x 输入的大小为 220099,但下标 r
的大小为 2024529。
如何实现它查看每行的两个条件?
蒂亚!
模块浏览量:
structure(list(session_id = c(19L, 19L, 24L, 25L, 25L, 25L, 28L
), moduleid = c(397L, 902L, 690L, 1697L, 1698L, 1755L, 1271L),
numslidesread = c(1L, 1L, 31L, 2L, 31L, 44L, 3L), totalsecondsspent = c(5L,
13L, 5829L, 10955L, 6942L, 9725L, 667L)), row.names = c(NA,
-7L), class = c("tbl_df", "tbl", "data.frame"))
页面浏览量:
structure(list(session_id = c(19L, 19L, 24L, 24L, 24L, 24L, 24L,
24L, 24L, 24L, 24L, 24L, 24L, 24L, 24L, 24L, 24L, 24L, 24L, 24L,
24L, 24L, 24L, 24L, 24L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L,
25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L,
25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L,
25L), slideitem_id = c(19974L, 53092L, 37143L, 37004L, 37061L,
37055L, 37061L, 37062L, 37073L, 37079L, 37079L, 37080L, 37097L,
37124L, 37131L, 37136L, 37138L, 37143L, 37143L, 37144L, 37145L,
37170L, 65628L, 37191L, 37192L, 85817L, 85818L, 85819L, 85820L,
85821L, 85821L, 85822L, 85823L, 85824L, 85825L, 85826L, 85827L,
85828L, 85829L, 85828L, 85829L, 85830L, 85831L, 85832L, 85833L,
85834L, 85835L, 85836L, 85837L, 85838L, 85839L, 85840L, 85841L,
85842L, 85624L, 85234L, 85235L, 85607L, 85614L, 85619L), moduleid = c(397L,
902L, 690L, 690L, 690L, 690L, 690L, 690L, 690L, 690L, 690L, 690L,
690L, 690L, 690L, 690L, 690L, 690L, 690L, 690L, 690L, 690L, 690L,
690L, 690L, 1698L, 1698L, 1698L, 1698L, 1698L, 1698L, 1698L,
1698L, 1698L, 1698L, 1698L, 1698L, 1698L, 1698L, 1698L, 1698L,
1698L, 1698L, 1698L, 1698L, 1698L, 1698L, 1698L, 1698L, 1698L,
1698L, 1698L, 1698L, 1698L, 1755L, 1755L, 1755L, 1755L, 1755L,
1755L), secondsspentonslide = c(5L, 13L, 154L, 9L, 5L, 9L, 248L,
17L, 385L, 209L, 364L, 61L, 81L, 175L, 45L, 352L, 23L, 216L,
35L, 227L, 80L, 375L, 7L, 3L, 3L, 21L, 8L, 43L, 211L, 61L, 37L,
58L, 50L, 96L, 67L, 36L, 21L, 11L, 3L, 7L, 96L, 66L, 9L, 79L,
180L, 144L, 127L, 168L, 22L, 49L, 22L, 51L, 127L, 33L, 19L, 5L,
25L, 73L, 7L, 15L)), row.names = c(NA, -60L), class = c("tbl_df",
"tbl", "data.frame"))
I have two different datasets in R (MRE below).
One contains one log per module visit (ModuleViews) and the other (PageViews) logs each specific page visit within a module visit.
The columns moduleid contain the same module code, session_id contain the same session code both datasets.
I processed the PageViews dataset and I would now like to update the ModuleViews dataset accordingly.
For this I need R to check / match both the moduleid AND session_id rows. Because in 1 session (e.g. 25) a user can visit several modules (for the case of session 25 modules 1697, 1698 and 1755).
In this case my processing removed all page views of session 25, module 1697.
I now want to remove this row (and all other rows) from the ModuleViews dataset where moduleid and session_id are not the same as in the PageViews dataset.
I tried the following 3 ways:
ModuleViews <- subset(ModuleViews, ModuleViews$session_id %in% PageViews$session_id &
ModuleViews$moduleid %in% PageViews$moduleid)
ModuleViews <- ModuleViews[(ModuleViews$session_id %in% PageViews$session_id) &&
(ModuleViews$moduleid %in% PageViews$moduleid),]
ModuleViews$moduleid <- ifelse((ModuleViews$session_id %in% PageViews$session_id) &
(ModuleViews$moduleid %in% PageViews$moduleid), ModuleViews$moduleid, NA)
But it does not look at both the columns in combination, but rather separately, leaving session 25 module 1697 in the output.
I tried these both with %in%
as well as ==
, but with ==
I get a length error (obviously due to the different dataset lengths)
Error: Must subset rows with a valid subscript vector.
ℹ Logical subscripts must match the size of the indexed input.
x Input has size 220099 but subscript r
has size 2024529.
How can I achieve that it looks at both conditions per row?
TIA!
ModuleViews:
structure(list(session_id = c(19L, 19L, 24L, 25L, 25L, 25L, 28L
), moduleid = c(397L, 902L, 690L, 1697L, 1698L, 1755L, 1271L),
numslidesread = c(1L, 1L, 31L, 2L, 31L, 44L, 3L), totalsecondsspent = c(5L,
13L, 5829L, 10955L, 6942L, 9725L, 667L)), row.names = c(NA,
-7L), class = c("tbl_df", "tbl", "data.frame"))
PageViews:
structure(list(session_id = c(19L, 19L, 24L, 24L, 24L, 24L, 24L,
24L, 24L, 24L, 24L, 24L, 24L, 24L, 24L, 24L, 24L, 24L, 24L, 24L,
24L, 24L, 24L, 24L, 24L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L,
25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L,
25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L,
25L), slideitem_id = c(19974L, 53092L, 37143L, 37004L, 37061L,
37055L, 37061L, 37062L, 37073L, 37079L, 37079L, 37080L, 37097L,
37124L, 37131L, 37136L, 37138L, 37143L, 37143L, 37144L, 37145L,
37170L, 65628L, 37191L, 37192L, 85817L, 85818L, 85819L, 85820L,
85821L, 85821L, 85822L, 85823L, 85824L, 85825L, 85826L, 85827L,
85828L, 85829L, 85828L, 85829L, 85830L, 85831L, 85832L, 85833L,
85834L, 85835L, 85836L, 85837L, 85838L, 85839L, 85840L, 85841L,
85842L, 85624L, 85234L, 85235L, 85607L, 85614L, 85619L), moduleid = c(397L,
902L, 690L, 690L, 690L, 690L, 690L, 690L, 690L, 690L, 690L, 690L,
690L, 690L, 690L, 690L, 690L, 690L, 690L, 690L, 690L, 690L, 690L,
690L, 690L, 1698L, 1698L, 1698L, 1698L, 1698L, 1698L, 1698L,
1698L, 1698L, 1698L, 1698L, 1698L, 1698L, 1698L, 1698L, 1698L,
1698L, 1698L, 1698L, 1698L, 1698L, 1698L, 1698L, 1698L, 1698L,
1698L, 1698L, 1698L, 1698L, 1755L, 1755L, 1755L, 1755L, 1755L,
1755L), secondsspentonslide = c(5L, 13L, 154L, 9L, 5L, 9L, 248L,
17L, 385L, 209L, 364L, 61L, 81L, 175L, 45L, 352L, 23L, 216L,
35L, 227L, 80L, 375L, 7L, 3L, 3L, 21L, 8L, 43L, 211L, 61L, 37L,
58L, 50L, 96L, 67L, 36L, 21L, 11L, 3L, 7L, 96L, 66L, 9L, 79L,
180L, 144L, 127L, 168L, 22L, 49L, 22L, 51L, 127L, 33L, 19L, 5L,
25L, 73L, 7L, 15L)), row.names = c(NA, -60L), class = c("tbl_df",
"tbl", "data.frame"))
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
如果我理解正确的话,您需要一个
inner_join
。使用 dplyr:或者使用基本 R 得到相同的结果:
If I'm understanding correctly, you want an
inner_join
. Withdplyr
:Or with base R for the same result: