R 中同一列中的条件字符串连接

发布于 01-16 21:34 字数 1039 浏览 7 评论 0原文

我是 R 新手，数据框中有一个非常大的不规则列，如下所示：

x <- data.frame(section = c("BOOK I: Introduction", "Page one: presentation", "Page two: acknowledgments", "MAGAZINE II: Considerations", "Page one: characters", "Page two: index", "BOOK III: General Principles", "BOOK III: General Principles", "Page one: invitation"))

section
BOOK I: Introduction
Page one: presentation
Page two: acknowledgments
MAGAZINE II: Considerations 
Page one: characters
Page two: index
BOOK III: General principles
BOOK III: General principles
Page one: invitation

我需要连接该列，使其看起来像这样：

section
BOOK I: Introduction 
BOOK I: Introduction / Page one: presentation
BOOK I: Introduction / Page two: acknowledgments
MAGAZINE II: Considerations
MAGAZINE II: Considerations / Page one: characters
MAGAZINE II: Considerations / Page two: index
BOOK III: General Principles
BOOK III: General Principles
BOOK III: General Principles / Page one: invitation

基本上，目标是根据条件提取上部字符串的值，然后与使用正则表达式实现值的较低，但我真的不知道该怎么做。

提前致谢。

原文

I am new to R and have a very large irregular column in a data frame like this:

x <- data.frame(section = c("BOOK I: Introduction", "Page one: presentation", "Page two: acknowledgments", "MAGAZINE II: Considerations", "Page one: characters", "Page two: index", "BOOK III: General Principles", "BOOK III: General Principles", "Page one: invitation"))

section
BOOK I: Introduction
Page one: presentation
Page two: acknowledgments
MAGAZINE II: Considerations 
Page one: characters
Page two: index
BOOK III: General principles
BOOK III: General principles
Page one: invitation

I need to concatenate this column to look like this:

section
BOOK I: Introduction 
BOOK I: Introduction / Page one: presentation
BOOK I: Introduction / Page two: acknowledgments
MAGAZINE II: Considerations
MAGAZINE II: Considerations / Page one: characters
MAGAZINE II: Considerations / Page two: index
BOOK III: General Principles
BOOK III: General Principles
BOOK III: General Principles / Page one: invitation

Basically the goal is to extract the value of the upper string based in a condition and then concatenate with the lower actualizing the value with a regex expression, but I really don't know how to do it.

Thanks in advance.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

差↓一点笑了2025-01-23 21:34:42

以下是一种方法：

x <- data.frame(section = c("BOOK I: Introduction", "Page one: presentation", "Page two: acknowledgments", "MAGAZINE II: Considerations", "Page one: characters", "Page two: index", "BOOK III: General Principles", "BOOK III: General Principles", "Page one: invitation"))

x <- dplyr::mutate(x,
  isSection = stringr::str_starts(section, "Page", negate = TRUE),
  sectionNum = cumsum(isSection)
) |> 
  dplyr::group_by(sectionNum) |> 
  dplyr::mutate(newSection = dplyr::if_else(
    condition = isSection, 
    true = section, 
    false = paste(dplyr::first(section), section, sep = " / ")
  )) |>
  ungroup()

x
#> # A tibble: 9 × 4
#>   section                      isSection sectionNum newSection                  
#>   <chr>                        <lgl>          <int> <chr>                       
#> 1 BOOK I: Introduction         TRUE               1 BOOK I: Introduction        
#> 2 Page one: presentation       FALSE              1 BOOK I: Introduction / Page…
#> 3 Page two: acknowledgments    FALSE              1 BOOK I: Introduction / Page…
#> 4 MAGAZINE II: Considerations  TRUE               2 MAGAZINE II: Considerations 
#> 5 Page one: characters         FALSE              2 MAGAZINE II: Considerations…
#> 6 Page two: index              FALSE              2 MAGAZINE II: Considerations…
#> 7 BOOK III: General Principles TRUE               3 BOOK III: General Principles
#> 8 BOOK III: General Principles TRUE               4 BOOK III: General Principles
#> 9 Page one: invitation         FALSE              4 BOOK III: General Principle…

^{由 reprex 包 (v2.0.1) 创建于 2022 年 3 月 25 日< /sup>}

在这里，我们首先确定 section 是节标题还是页面标题，并将其保存为 TRUE 或 FALSE。

然后，我们使用 cumsum()（累积和）标记属于某个部分的页面。当我们将 TRUE 和 FALSE 值相加时，TRUE（此处为部分）变为 1 并增加累积和，但 FALSE（此处为页面）变为 0 并且不会增加累积总和，因此特定部分中的所有页面都会收到相同的值。

最后，我们创建一个新的节变量，这次使用 group_by() 和 if_else() 有条件地设置值。如果 isSection 为 TRUE，我们只保留 section 的现有值（节标题）。如果 isSection 为 FALSE，我们将组中 section 的第一个值与 section 的现有值连接起来，用 " / " 分隔。

Here is one method:

x <- data.frame(section = c("BOOK I: Introduction", "Page one: presentation", "Page two: acknowledgments", "MAGAZINE II: Considerations", "Page one: characters", "Page two: index", "BOOK III: General Principles", "BOOK III: General Principles", "Page one: invitation"))

x <- dplyr::mutate(x,
  isSection = stringr::str_starts(section, "Page", negate = TRUE),
  sectionNum = cumsum(isSection)
) |> 
  dplyr::group_by(sectionNum) |> 
  dplyr::mutate(newSection = dplyr::if_else(
    condition = isSection, 
    true = section, 
    false = paste(dplyr::first(section), section, sep = " / ")
  )) |>
  ungroup()

x
#> # A tibble: 9 × 4
#>   section                      isSection sectionNum newSection                  
#>   <chr>                        <lgl>          <int> <chr>                       
#> 1 BOOK I: Introduction         TRUE               1 BOOK I: Introduction        
#> 2 Page one: presentation       FALSE              1 BOOK I: Introduction / Page…
#> 3 Page two: acknowledgments    FALSE              1 BOOK I: Introduction / Page…
#> 4 MAGAZINE II: Considerations  TRUE               2 MAGAZINE II: Considerations 
#> 5 Page one: characters         FALSE              2 MAGAZINE II: Considerations…
#> 6 Page two: index              FALSE              2 MAGAZINE II: Considerations…
#> 7 BOOK III: General Principles TRUE               3 BOOK III: General Principles
#> 8 BOOK III: General Principles TRUE               4 BOOK III: General Principles
#> 9 Page one: invitation         FALSE              4 BOOK III: General Principle…

^{Created on 2022-03-25 by the reprex package (v2.0.1)}

Here, we first determine if the section is a section title or a page title and save that as TRUE or FALSE.

Then, we label the pages belonging to a section by using cumsum() (cumulative sum). When we add up TRUE and FALSE values, TRUE (here, sections) become 1 and increment the cumulative sum, but FALSE (here, pages) become 0 and don't increment the cumulative sum, so all of the pages within a specific section receive the same value.

Lastly, we make a new section variable, this time using group_by() and if_else() to conditionally set the value. If isSection is TRUE, we just keep the existing value of section (the section title). If isSection is FALSE, we concatenate the first value of section from the group with the existing value of section, separated by " / ".

回复收藏 0 原文

半步萧音过轻尘2025-01-23 21:34:42

使用数据表：

library(data.table)

setDT(x)[grepl("^Page.",section)==F, header:=section] %>% 
  .[,header:=zoo::na.locf(header)] %>% 
  .[section!=header,header:=paste0(header, " / ",section)] %>% 
  .[,.(section = header)] %>% 
  .[]

1:                                BOOK I: Introduction
2:       BOOK I: Introduction / Page one: presentation
3:    BOOK I: Introduction / Page two: acknowledgments
4:                         MAGAZINE II: Considerations
5:  MAGAZINE II: Considerations / Page one: characters
6:       MAGAZINE II: Considerations / Page two: index
7:                        BOOK III: General Principles
8:                        BOOK III: General Principles
9: BOOK III: General Principles / Page one: invitation

using data.table:

library(data.table)

setDT(x)[grepl("^Page.",section)==F, header:=section] %>% 
  .[,header:=zoo::na.locf(header)] %>% 
  .[section!=header,header:=paste0(header, " / ",section)] %>% 
  .[,.(section = header)] %>% 
  .[]

1:                                BOOK I: Introduction
2:       BOOK I: Introduction / Page one: presentation
3:    BOOK I: Introduction / Page two: acknowledgments
4:                         MAGAZINE II: Considerations
5:  MAGAZINE II: Considerations / Page one: characters
6:       MAGAZINE II: Considerations / Page two: index
7:                        BOOK III: General Principles
8:                        BOOK III: General Principles
9: BOOK III: General Principles / Page one: invitation

回复收藏 0 原文

清风挽心2025-01-23 21:34:42

滚动连接可以实现这一点。在 data.table 中：


library( data.table )

# add a row column for joining by reference
x[ , row := .I ]

# pick out just the title rows. It looks like these start with either "BOOK" or "MAGAZINE"
books_magazines <- x[ grepl("^BOOK|^MAGAZINE", section),
                      .(row, book_magazine = section) ]

# join the 2 tables, using a rolling join to add the title row to subsequent rows
both_cols <- books_magazines[ x, on = .(row), roll = TRUE ]

# concatenate the 2 columns together where necessary, leave it alone if it's the title row
result <- both_cols[ , .(
    section_string = fifelse( book_magazine == section,
                              book_magazine,
                              sprintf("%s / %s", book_magazine, section) )
) ]

这给出了：

> result$section_string

[1] "BOOK I: Introduction"                               
[2] "BOOK I: Introduction / Page one: presentation"      
[3] "BOOK I: Introduction / Page two: acknowledgments"   
[4] "MAGAZINE II: Considerations"                        
[5] "MAGAZINE II: Considerations / Page one: characters" 
[6] "MAGAZINE II: Considerations / Page two: index"      
[7] "BOOK III: General Principles"                       
[8] "BOOK III: General Principles"                       
[9] "BOOK III: General Principles / Page one: invitation"

A rolling join could achieve this. In data.table:


library( data.table )

# add a row column for joining by reference
x[ , row := .I ]

# pick out just the title rows. It looks like these start with either "BOOK" or "MAGAZINE"
books_magazines <- x[ grepl("^BOOK|^MAGAZINE", section),
                      .(row, book_magazine = section) ]

# join the 2 tables, using a rolling join to add the title row to subsequent rows
both_cols <- books_magazines[ x, on = .(row), roll = TRUE ]

# concatenate the 2 columns together where necessary, leave it alone if it's the title row
result <- both_cols[ , .(
    section_string = fifelse( book_magazine == section,
                              book_magazine,
                              sprintf("%s / %s", book_magazine, section) )
) ]

This gives:

> result$section_string

[1] "BOOK I: Introduction"                               
[2] "BOOK I: Introduction / Page one: presentation"      
[3] "BOOK I: Introduction / Page two: acknowledgments"   
[4] "MAGAZINE II: Considerations"                        
[5] "MAGAZINE II: Considerations / Page one: characters" 
[6] "MAGAZINE II: Considerations / Page two: index"      
[7] "BOOK III: General Principles"                       
[8] "BOOK III: General Principles"                       
[9] "BOOK III: General Principles / Page one: invitation"

回复收藏 0 原文

柠栀2025-01-23 21:34:42

你可以这样做：

unlist(lapply(split(x$section, cumsum(grepl('^[A-Z]{3}', x$section))), 
              function(y) {
                  if(length(y) == 1) return(y)
                  else c(y[1], paste(y[1], y[-1], sep = " / "))
                }), use.names = FALSE)
#> [1] "BOOK I: Introduction"                               
#> [2] "BOOK I: Introduction / Page one: presentation"      
#> [3] "BOOK I: Introduction / Page two: acknowledgments"   
#> [4] "MAGAZINE II: Considerations"                        
#> [5] "MAGAZINE II: Considerations / Page one: characters" 
#> [6] "MAGAZINE II: Considerations / Page two: index"      
#> [7] "BOOK III: General Principles"                       
#> [8] "BOOK III: General Principles"                       
#> [9] "BOOK III: General Principles / Page one: invitation"

You can do:

unlist(lapply(split(x$section, cumsum(grepl('^[A-Z]{3}', x$section))), 
              function(y) {
                  if(length(y) == 1) return(y)
                  else c(y[1], paste(y[1], y[-1], sep = " / "))
                }), use.names = FALSE)
#> [1] "BOOK I: Introduction"                               
#> [2] "BOOK I: Introduction / Page one: presentation"      
#> [3] "BOOK I: Introduction / Page two: acknowledgments"   
#> [4] "MAGAZINE II: Considerations"                        
#> [5] "MAGAZINE II: Considerations / Page one: characters" 
#> [6] "MAGAZINE II: Considerations / Page two: index"      
#> [7] "BOOK III: General Principles"                       
#> [8] "BOOK III: General Principles"                       
#> [9] "BOOK III: General Principles / Page one: invitation"

回复收藏 0 原文

一梦浮鱼2025-01-23 21:34:42

稍微简单一点的 data.table 方法：

library(data.table)
setDT(x)

x[, g := cumsum(grepl('(BOOK|MAGAZINE)', section))]
x[, section := ifelse(seq_along(section) == 1,
    section, paste(section[1], section, sep = ' / ')), by = .(g)]
x[, g := NULL]

输出是：

> x
                                               section
1:                                BOOK I: Introduction
2:       BOOK I: Introduction / Page one: presentation
3:    BOOK I: Introduction / Page two: acknowledgments
4:                         MAGAZINE II: Considerations
5:  MAGAZINE II: Considerations / Page one: characters
6:       MAGAZINE II: Considerations / Page two: index
7:                        BOOK III: General Principles
8:                        BOOK III: General Principles
9: BOOK III: General Principles / Page one: invitation

An slightly simpler data.table approach:

library(data.table)
setDT(x)

x[, g := cumsum(grepl('(BOOK|MAGAZINE)', section))]
x[, section := ifelse(seq_along(section) == 1,
    section, paste(section[1], section, sep = ' / ')), by = .(g)]
x[, g := NULL]

The output is:

> x
                                               section
1:                                BOOK I: Introduction
2:       BOOK I: Introduction / Page one: presentation
3:    BOOK I: Introduction / Page two: acknowledgments
4:                         MAGAZINE II: Considerations
5:  MAGAZINE II: Considerations / Page one: characters
6:       MAGAZINE II: Considerations / Page two: index
7:                        BOOK III: General Principles
8:                        BOOK III: General Principles
9: BOOK III: General Principles / Page one: invitation

回复收藏 0 原文

~没有更多了~