R 中同一列中的条件字符串连接
我是 R 新手,数据框中有一个非常大的不规则列,如下所示:
x <- data.frame(section = c("BOOK I: Introduction", "Page one: presentation", "Page two: acknowledgments", "MAGAZINE II: Considerations", "Page one: characters", "Page two: index", "BOOK III: General Principles", "BOOK III: General Principles", "Page one: invitation"))
section
BOOK I: Introduction
Page one: presentation
Page two: acknowledgments
MAGAZINE II: Considerations
Page one: characters
Page two: index
BOOK III: General principles
BOOK III: General principles
Page one: invitation
我需要连接该列,使其看起来像这样:
section
BOOK I: Introduction
BOOK I: Introduction / Page one: presentation
BOOK I: Introduction / Page two: acknowledgments
MAGAZINE II: Considerations
MAGAZINE II: Considerations / Page one: characters
MAGAZINE II: Considerations / Page two: index
BOOK III: General Principles
BOOK III: General Principles
BOOK III: General Principles / Page one: invitation
基本上,目标是根据条件提取上部字符串的值,然后与使用正则表达式实现值的较低,但我真的不知道该怎么做。
提前致谢。
I am new to R and have a very large irregular column in a data frame like this:
x <- data.frame(section = c("BOOK I: Introduction", "Page one: presentation", "Page two: acknowledgments", "MAGAZINE II: Considerations", "Page one: characters", "Page two: index", "BOOK III: General Principles", "BOOK III: General Principles", "Page one: invitation"))
section
BOOK I: Introduction
Page one: presentation
Page two: acknowledgments
MAGAZINE II: Considerations
Page one: characters
Page two: index
BOOK III: General principles
BOOK III: General principles
Page one: invitation
I need to concatenate this column to look like this:
section
BOOK I: Introduction
BOOK I: Introduction / Page one: presentation
BOOK I: Introduction / Page two: acknowledgments
MAGAZINE II: Considerations
MAGAZINE II: Considerations / Page one: characters
MAGAZINE II: Considerations / Page two: index
BOOK III: General Principles
BOOK III: General Principles
BOOK III: General Principles / Page one: invitation
Basically the goal is to extract the value of the upper string based in a condition and then concatenate with the lower actualizing the value with a regex expression, but I really don't know how to do it.
Thanks in advance.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

发布评论
评论(5)
半步萧音过轻尘2025-01-23 21:34:42
使用数据表:
library(data.table)
setDT(x)[grepl("^Page.",section)==F, header:=section] %>%
.[,header:=zoo::na.locf(header)] %>%
.[section!=header,header:=paste0(header, " / ",section)] %>%
.[,.(section = header)] %>%
.[]
1: BOOK I: Introduction
2: BOOK I: Introduction / Page one: presentation
3: BOOK I: Introduction / Page two: acknowledgments
4: MAGAZINE II: Considerations
5: MAGAZINE II: Considerations / Page one: characters
6: MAGAZINE II: Considerations / Page two: index
7: BOOK III: General Principles
8: BOOK III: General Principles
9: BOOK III: General Principles / Page one: invitation
清风挽心2025-01-23 21:34:42
滚动连接可以实现这一点。在 data.table 中:
library( data.table )
# add a row column for joining by reference
x[ , row := .I ]
# pick out just the title rows. It looks like these start with either "BOOK" or "MAGAZINE"
books_magazines <- x[ grepl("^BOOK|^MAGAZINE", section),
.(row, book_magazine = section) ]
# join the 2 tables, using a rolling join to add the title row to subsequent rows
both_cols <- books_magazines[ x, on = .(row), roll = TRUE ]
# concatenate the 2 columns together where necessary, leave it alone if it's the title row
result <- both_cols[ , .(
section_string = fifelse( book_magazine == section,
book_magazine,
sprintf("%s / %s", book_magazine, section) )
) ]
这给出了:
> result$section_string
[1] "BOOK I: Introduction"
[2] "BOOK I: Introduction / Page one: presentation"
[3] "BOOK I: Introduction / Page two: acknowledgments"
[4] "MAGAZINE II: Considerations"
[5] "MAGAZINE II: Considerations / Page one: characters"
[6] "MAGAZINE II: Considerations / Page two: index"
[7] "BOOK III: General Principles"
[8] "BOOK III: General Principles"
[9] "BOOK III: General Principles / Page one: invitation"
柠栀2025-01-23 21:34:42
你可以这样做:
unlist(lapply(split(x$section, cumsum(grepl('^[A-Z]{3}', x$section))),
function(y) {
if(length(y) == 1) return(y)
else c(y[1], paste(y[1], y[-1], sep = " / "))
}), use.names = FALSE)
#> [1] "BOOK I: Introduction"
#> [2] "BOOK I: Introduction / Page one: presentation"
#> [3] "BOOK I: Introduction / Page two: acknowledgments"
#> [4] "MAGAZINE II: Considerations"
#> [5] "MAGAZINE II: Considerations / Page one: characters"
#> [6] "MAGAZINE II: Considerations / Page two: index"
#> [7] "BOOK III: General Principles"
#> [8] "BOOK III: General Principles"
#> [9] "BOOK III: General Principles / Page one: invitation"
一梦浮鱼2025-01-23 21:34:42
稍微简单一点的 data.table
方法:
library(data.table)
setDT(x)
x[, g := cumsum(grepl('(BOOK|MAGAZINE)', section))]
x[, section := ifelse(seq_along(section) == 1,
section, paste(section[1], section, sep = ' / ')), by = .(g)]
x[, g := NULL]
输出是:
> x
section
1: BOOK I: Introduction
2: BOOK I: Introduction / Page one: presentation
3: BOOK I: Introduction / Page two: acknowledgments
4: MAGAZINE II: Considerations
5: MAGAZINE II: Considerations / Page one: characters
6: MAGAZINE II: Considerations / Page two: index
7: BOOK III: General Principles
8: BOOK III: General Principles
9: BOOK III: General Principles / Page one: invitation
~没有更多了~
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
以下是一种方法:
由 reprex 包 (v2.0.1) 创建于 2022 年 3 月 25 日< /sup>
在这里,我们首先确定
section
是节标题还是页面标题,并将其保存为TRUE
或FALSE
。然后,我们使用 cumsum()(累积和)标记属于某个部分的页面。当我们将
TRUE
和FALSE
值相加时,TRUE
(此处为部分)变为1
并增加累积和,但FALSE
(此处为页面)变为0
并且不会增加累积总和,因此特定部分中的所有页面都会收到相同的值。最后,我们创建一个新的节变量,这次使用
group_by()
和if_else()
有条件地设置值。如果isSection
为TRUE
,我们只保留section
的现有值(节标题)。如果isSection
为FALSE
,我们将组中section
的第一个值与section
的现有值连接起来,用" / "
分隔。Here is one method:
Created on 2022-03-25 by the reprex package (v2.0.1)
Here, we first determine if the
section
is a section title or a page title and save that asTRUE
orFALSE
.Then, we label the pages belonging to a section by using
cumsum()
(cumulative sum). When we add upTRUE
andFALSE
values,TRUE
(here, sections) become1
and increment the cumulative sum, butFALSE
(here, pages) become0
and don't increment the cumulative sum, so all of the pages within a specific section receive the same value.Lastly, we make a new section variable, this time using
group_by()
andif_else()
to conditionally set the value. IfisSection
isTRUE
, we just keep the existing value ofsection
(the section title). IfisSection
isFALSE
, we concatenate the first value ofsection
from the group with the existing value ofsection
, separated by" / "
.