R中分割大字符串的有效方法

发布于 2025-01-17 02:57:59 字数 857 浏览 1 评论 0原文

我有一个巨大的字符串(> 500MB),实际上它是一整本书的集合。我在另一个数据框中有一些元信息,例如页码、(不同的)作者和标题。我尝试检测巨大字符串中的标题字符串并按标题拆分它。我认为标题是独一无二的。

数据如下所示:

mystring <- "Lorem ipsum dolor sit amet, sollicitudin duis maecenas habitasse ultrices aenean tempus"

# a dataframe of meta data, e.g. page numbers and titles
mydf <- data.frame(page = c(1, 2),
                   title = c( "Lorem", "maecenas"))
mydf

  page   title
1    1   Lorem
2    2 vivamus

mygoal <- mydf  # text that comes after the title
mygoal$text <- c("ipsum dolor sit amet, sollicitudin duis", "habitasse ultrices aenean tempus")
mygoal 

  page   title                                    text
1    1   Lorem ipsum dolor sit amet, sollicitudin duis
2    2 vivamus        habitasse ultrices aenean tempus

如何以最有效的方式拆分字符串,使两个标题之间的所有内容都是第一个文本,第二个标题之后和第三个标题之前的所有内容都成为第二个文本元素。

I have a huge string (> 500MB), actually it's an entire book collection in one. I have some meta information in another dataframe, e.g. page numbers, (different) authors and titles. I try to detect the title strings in my huge string and split it by title. I assume titles are unique.

The data looks like this:

mystring <- "Lorem ipsum dolor sit amet, sollicitudin duis maecenas habitasse ultrices aenean tempus"

# a dataframe of meta data, e.g. page numbers and titles
mydf <- data.frame(page = c(1, 2),
                   title = c( "Lorem", "maecenas"))
mydf

  page   title
1    1   Lorem
2    2 vivamus

mygoal <- mydf  # text that comes after the title
mygoal$text <- c("ipsum dolor sit amet, sollicitudin duis", "habitasse ultrices aenean tempus")
mygoal 

  page   title                                    text
1    1   Lorem ipsum dolor sit amet, sollicitudin duis
2    2 vivamus        habitasse ultrices aenean tempus

How can I split the string such that everything between two titles is the first text, everything that comes after the second title and before the third title, becomes the second text element - in the most efficient way.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

自我难过 2025-01-24 02:57:59

如果您想以管道 tidyverse 方式执行操作,您可以尝试使用 stringr::str_extract 和一些正则表达式:

library(dplyr)
library(stringr)
library(glue)

mydf |>  
  mutate(next_title = lead(title, default = "$")) |> 
  mutate(text = str_extract(mystring, glue::glue("(?<={title}\\s?)(.*)(?:{next_title})"))) |> 
  select(-next_title)

Yielding:

page    title                                      text
1    1    Lorem  ipsum dolor sit amet, sollicitudin duis 
2    2 maecenas          habitasse ultrices aenean tempus

如果性能是一个问题,则可以使用 data 的类似方法。表将是:

library(data.table)
library(stringr)
library(glue)

mydt <- setDT(mydf)

mydt[, next_title :=shift(title, fill = "$", type = "lead")][
  ,text := str_extract(..mystring, glue_data(.SD,"(?<={title}\\s?)(.*)(?={next_title})"))][,
    !("next_title")]

导致:

   page    title                                      text
1:    1    Lorem  ipsum dolor sit amet, sollicitudin duis 
2:    2 maecenas          habitasse ultrices aenean tempus

编辑

添加以获得更好的性能选项:

通常,str_splitstr_split_fixed将是更快的方法比 str_extract 更进一步。

str_split 的问题是,具有许多备用管道的正则表达式也会减慢该过程,因此另一种解决方案是首先用一些固定字符串替换字符串中的所有标题,然后拆分那些。加快分割速度的另一件事是使用 str_split_fixed 并预先分配要处理的分割数量。

    # create named character vector for str_replace_all function
split_at <- rep("@@",nrow(mydf))
names(split_at) <- mydf$title
mystring <- str_replace_all(mystring, split_at)

# used fixed in str_split
mydf$text <- str_split(mystring,fixed("@@ "))[[1]][-1]

# Alternative (maybe faster) define number of splits by nrow
mydf$text <- str_split_fixed(mystring,fixed("@@ "), n = nrow(mydf)+1)[,-1]


## using str_split_fixed in data.table
mydt <- setDT(mydf)
mydt[, text := 
       str_split_fixed(mystring,fixed("@@ "), nrow(mydt)+1)[,-1]

In case you wanted to do the operation in a piped tidyverse way, you could try using stringr::str_extract with some regex:

library(dplyr)
library(stringr)
library(glue)

mydf |>  
  mutate(next_title = lead(title, default = "
quot;)) |> 
  mutate(text = str_extract(mystring, glue::glue("(?<={title}\\s?)(.*)(?:{next_title})"))) |> 
  select(-next_title)

Yielding:

page    title                                      text
1    1    Lorem  ipsum dolor sit amet, sollicitudin duis 
2    2 maecenas          habitasse ultrices aenean tempus

If performance is a concern, a similar approach with data.table would be:

library(data.table)
library(stringr)
library(glue)

mydt <- setDT(mydf)

mydt[, next_title :=shift(title, fill = "
quot;, type = "lead")][
  ,text := str_extract(..mystring, glue_data(.SD,"(?<={title}\\s?)(.*)(?={next_title})"))][,
    !("next_title")]

Resulting in:

   page    title                                      text
1:    1    Lorem  ipsum dolor sit amet, sollicitudin duis 
2:    2 maecenas          habitasse ultrices aenean tempus

EDIT

Added for better performance options:

Generally, str_split or str_split_fixed will be a faster way to go than str_extract.

The problem for str_split is that a regex with many alternate pipes will also slow down the process, so another solution would be to replace all the titles in the string first with some fixed character string, and then split on those. Another thing you can do to speed up the splitting is use str_split_fixed and pre-assign how many splits to process.

    # create named character vector for str_replace_all function
split_at <- rep("@@",nrow(mydf))
names(split_at) <- mydf$title
mystring <- str_replace_all(mystring, split_at)

# used fixed in str_split
mydf$text <- str_split(mystring,fixed("@@ "))[[1]][-1]

# Alternative (maybe faster) define number of splits by nrow
mydf$text <- str_split_fixed(mystring,fixed("@@ "), n = nrow(mydf)+1)[,-1]


## using str_split_fixed in data.table
mydt <- setDT(mydf)
mydt[, text := 
       str_split_fixed(mystring,fixed("@@ "), nrow(mydt)+1)[,-1]
生活了然无味 2025-01-24 02:57:59

我们可以使用 strsplit

mygoal$text <- trimws(strsplit(mystring,
      paste(mydf$title, collapse = "|"))[[1]][-1])

-output

> mygoal
  page    title                                    text
1    1    Lorem ipsum dolor sit amet, sollicitudin duis
2    2 maecenas        habitasse ultrices aenean tempus

We could use strsplit

mygoal$text <- trimws(strsplit(mystring,
      paste(mydf$title, collapse = "|"))[[1]][-1])

-output

> mygoal
  page    title                                    text
1    1    Lorem ipsum dolor sit amet, sollicitudin duis
2    2 maecenas        habitasse ultrices aenean tempus
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文