使用其他数据框中的 2 个链接条件(值的组合)删除 r 数据框中的行

发布于 2025-01-12 05:23:23 字数 3375 浏览 0 评论 0原文

我在 R 中有两个不同的数据集(下面是 MRE)。

一个包含每个模块访问的一个日志 (ModuleViews),另一个 (PageViews) 记录模块访问中的每个特定页面访问。

moduleid 列包含相同的模块代码,session_id 包含两个数据集相同的会话代码。

我处理了 PageViews 数据集,现在想相应地更新 ModuleViews 数据集。

为此,我需要 R 检查/匹配 moduleid 和 session_id 行。因为在 1 个会话(例如 25 个)中,用户可以访问多个模块(对于会话 25 个模块 1697、1698 和 1755 的情况)。

在本例中,我的处理删除了会话 25、模块 1697 的所有页面视图。

我现在想要从 ModuleViews 数据集中删除此行(以及所有其他行),其中 moduleid 和 session_id 与 PageViews 数据集中的不同。

我尝试了以下 3 种方法:

ModuleViews <- subset(ModuleViews, ModuleViews$session_id %in% PageViews$session_id & 
                         ModuleViews$moduleid %in% PageViews$moduleid)

ModuleViews <- ModuleViews[(ModuleViews$session_id %in% PageViews$session_id) && 
                         (ModuleViews$moduleid %in% PageViews$moduleid),]

ModuleViews$moduleid <- ifelse((ModuleViews$session_id %in% PageViews$session_id) & 
                         (ModuleViews$moduleid %in% PageViews$moduleid), ModuleViews$moduleid, NA) 

但它不会组合查看这两列,而是单独查看,将会话 25 模块 1697 留在输出中。

我用 %in%== 尝试了这些,但是用 == 我得到了一个长度错误(显然是由于不同的数据集长度)

错误:必须使用有效的下标向量对行进行子集化。 ℹ 逻辑下标必须与索引输入的大小匹配。 x 输入的大小为 220099,但下标 r 的大小为 2024529。

如何实现它查看每行的两个条件?

蒂亚!

模块浏览量:

structure(list(session_id = c(19L, 19L, 24L, 25L, 25L, 25L, 28L
), moduleid = c(397L, 902L, 690L, 1697L, 1698L, 1755L, 1271L), 
    numslidesread = c(1L, 1L, 31L, 2L, 31L, 44L, 3L), totalsecondsspent = c(5L, 
    13L, 5829L, 10955L, 6942L, 9725L, 667L)), row.names = c(NA, 
-7L), class = c("tbl_df", "tbl", "data.frame"))

页面浏览量:

structure(list(session_id = c(19L, 19L, 24L, 24L, 24L, 24L, 24L, 
24L, 24L, 24L, 24L, 24L, 24L, 24L, 24L, 24L, 24L, 24L, 24L, 24L, 
24L, 24L, 24L, 24L, 24L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 
25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 
25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 
25L), slideitem_id = c(19974L, 53092L, 37143L, 37004L, 37061L, 
37055L, 37061L, 37062L, 37073L, 37079L, 37079L, 37080L, 37097L, 
37124L, 37131L, 37136L, 37138L, 37143L, 37143L, 37144L, 37145L, 
37170L, 65628L, 37191L, 37192L, 85817L, 85818L, 85819L, 85820L, 
85821L, 85821L, 85822L, 85823L, 85824L, 85825L, 85826L, 85827L, 
85828L, 85829L, 85828L, 85829L, 85830L, 85831L, 85832L, 85833L, 
85834L, 85835L, 85836L, 85837L, 85838L, 85839L, 85840L, 85841L, 
85842L, 85624L, 85234L, 85235L, 85607L, 85614L, 85619L), moduleid = c(397L, 
902L, 690L, 690L, 690L, 690L, 690L, 690L, 690L, 690L, 690L, 690L, 
690L, 690L, 690L, 690L, 690L, 690L, 690L, 690L, 690L, 690L, 690L, 
690L, 690L, 1698L, 1698L, 1698L, 1698L, 1698L, 1698L, 1698L, 
1698L, 1698L, 1698L, 1698L, 1698L, 1698L, 1698L, 1698L, 1698L, 
1698L, 1698L, 1698L, 1698L, 1698L, 1698L, 1698L, 1698L, 1698L, 
1698L, 1698L, 1698L, 1698L, 1755L, 1755L, 1755L, 1755L, 1755L, 
1755L), secondsspentonslide = c(5L, 13L, 154L, 9L, 5L, 9L, 248L, 
17L, 385L, 209L, 364L, 61L, 81L, 175L, 45L, 352L, 23L, 216L, 
35L, 227L, 80L, 375L, 7L, 3L, 3L, 21L, 8L, 43L, 211L, 61L, 37L, 
58L, 50L, 96L, 67L, 36L, 21L, 11L, 3L, 7L, 96L, 66L, 9L, 79L, 
180L, 144L, 127L, 168L, 22L, 49L, 22L, 51L, 127L, 33L, 19L, 5L, 
25L, 73L, 7L, 15L)), row.names = c(NA, -60L), class = c("tbl_df", 
"tbl", "data.frame"))

I have two different datasets in R (MRE below).

One contains one log per module visit (ModuleViews) and the other (PageViews) logs each specific page visit within a module visit.

The columns moduleid contain the same module code, session_id contain the same session code both datasets.

I processed the PageViews dataset and I would now like to update the ModuleViews dataset accordingly.

For this I need R to check / match both the moduleid AND session_id rows. Because in 1 session (e.g. 25) a user can visit several modules (for the case of session 25 modules 1697, 1698 and 1755).

In this case my processing removed all page views of session 25, module 1697.

I now want to remove this row (and all other rows) from the ModuleViews dataset where moduleid and session_id are not the same as in the PageViews dataset.

I tried the following 3 ways:

ModuleViews <- subset(ModuleViews, ModuleViews$session_id %in% PageViews$session_id & 
                         ModuleViews$moduleid %in% PageViews$moduleid)

ModuleViews <- ModuleViews[(ModuleViews$session_id %in% PageViews$session_id) && 
                         (ModuleViews$moduleid %in% PageViews$moduleid),]

ModuleViews$moduleid <- ifelse((ModuleViews$session_id %in% PageViews$session_id) & 
                         (ModuleViews$moduleid %in% PageViews$moduleid), ModuleViews$moduleid, NA) 

But it does not look at both the columns in combination, but rather separately, leaving session 25 module 1697 in the output.

I tried these both with %in% as well as ==, but with == I get a length error (obviously due to the different dataset lengths)

Error: Must subset rows with a valid subscript vector.
ℹ Logical subscripts must match the size of the indexed input.
x Input has size 220099 but subscript r has size 2024529.

How can I achieve that it looks at both conditions per row?

TIA!

ModuleViews:

structure(list(session_id = c(19L, 19L, 24L, 25L, 25L, 25L, 28L
), moduleid = c(397L, 902L, 690L, 1697L, 1698L, 1755L, 1271L), 
    numslidesread = c(1L, 1L, 31L, 2L, 31L, 44L, 3L), totalsecondsspent = c(5L, 
    13L, 5829L, 10955L, 6942L, 9725L, 667L)), row.names = c(NA, 
-7L), class = c("tbl_df", "tbl", "data.frame"))

PageViews:

structure(list(session_id = c(19L, 19L, 24L, 24L, 24L, 24L, 24L, 
24L, 24L, 24L, 24L, 24L, 24L, 24L, 24L, 24L, 24L, 24L, 24L, 24L, 
24L, 24L, 24L, 24L, 24L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 
25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 
25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 
25L), slideitem_id = c(19974L, 53092L, 37143L, 37004L, 37061L, 
37055L, 37061L, 37062L, 37073L, 37079L, 37079L, 37080L, 37097L, 
37124L, 37131L, 37136L, 37138L, 37143L, 37143L, 37144L, 37145L, 
37170L, 65628L, 37191L, 37192L, 85817L, 85818L, 85819L, 85820L, 
85821L, 85821L, 85822L, 85823L, 85824L, 85825L, 85826L, 85827L, 
85828L, 85829L, 85828L, 85829L, 85830L, 85831L, 85832L, 85833L, 
85834L, 85835L, 85836L, 85837L, 85838L, 85839L, 85840L, 85841L, 
85842L, 85624L, 85234L, 85235L, 85607L, 85614L, 85619L), moduleid = c(397L, 
902L, 690L, 690L, 690L, 690L, 690L, 690L, 690L, 690L, 690L, 690L, 
690L, 690L, 690L, 690L, 690L, 690L, 690L, 690L, 690L, 690L, 690L, 
690L, 690L, 1698L, 1698L, 1698L, 1698L, 1698L, 1698L, 1698L, 
1698L, 1698L, 1698L, 1698L, 1698L, 1698L, 1698L, 1698L, 1698L, 
1698L, 1698L, 1698L, 1698L, 1698L, 1698L, 1698L, 1698L, 1698L, 
1698L, 1698L, 1698L, 1698L, 1755L, 1755L, 1755L, 1755L, 1755L, 
1755L), secondsspentonslide = c(5L, 13L, 154L, 9L, 5L, 9L, 248L, 
17L, 385L, 209L, 364L, 61L, 81L, 175L, 45L, 352L, 23L, 216L, 
35L, 227L, 80L, 375L, 7L, 3L, 3L, 21L, 8L, 43L, 211L, 61L, 37L, 
58L, 50L, 96L, 67L, 36L, 21L, 11L, 3L, 7L, 96L, 66L, 9L, 79L, 
180L, 144L, 127L, 168L, 22L, 49L, 22L, 51L, 127L, 33L, 19L, 5L, 
25L, 73L, 7L, 15L)), row.names = c(NA, -60L), class = c("tbl_df", 
"tbl", "data.frame"))

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

难得心□动 2025-01-19 05:23:23

如果我理解正确的话,您需要一个inner_join。使用 dplyr:

library(dplyr)
result = ModuleViews %>%
  inner_join(distinct(PageViews, session_id, moduleid))

result
# # A tibble: 5 × 4
#   session_id moduleid numslidesread totalsecondsspent
#        <int>    <int>         <int>             <int>
# 1         19      397             1                 5
# 2         19      902             1                13
# 3         24      690            31              5829
# 4         25     1698            31              6942
# 5         25     1755            44              9725

或者使用基本 R 得到相同的结果:

result = merge(
  ModuleViews,
  unique(PageViews[c("session_id", "moduleid")])
)

If I'm understanding correctly, you want an inner_join. With dplyr:

library(dplyr)
result = ModuleViews %>%
  inner_join(distinct(PageViews, session_id, moduleid))

result
# # A tibble: 5 × 4
#   session_id moduleid numslidesread totalsecondsspent
#        <int>    <int>         <int>             <int>
# 1         19      397             1                 5
# 2         19      902             1                13
# 3         24      690            31              5829
# 4         25     1698            31              6942
# 5         25     1755            44              9725

Or with base R for the same result:

result = merge(
  ModuleViews,
  unique(PageViews[c("session_id", "moduleid")])
)
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文