R中分割大字符串的有效方法
我有一个巨大的字符串(> 500MB),实际上它是一整本书的集合。我在另一个数据框中有一些元信息,例如页码、(不同的)作者和标题。我尝试检测巨大字符串中的标题字符串并按标题拆分它。我认为标题是独一无二的。
数据如下所示:
mystring <- "Lorem ipsum dolor sit amet, sollicitudin duis maecenas habitasse ultrices aenean tempus"
# a dataframe of meta data, e.g. page numbers and titles
mydf <- data.frame(page = c(1, 2),
title = c( "Lorem", "maecenas"))
mydf
page title
1 1 Lorem
2 2 vivamus
mygoal <- mydf # text that comes after the title
mygoal$text <- c("ipsum dolor sit amet, sollicitudin duis", "habitasse ultrices aenean tempus")
mygoal
page title text
1 1 Lorem ipsum dolor sit amet, sollicitudin duis
2 2 vivamus habitasse ultrices aenean tempus
如何以最有效的方式拆分字符串,使两个标题之间的所有内容都是第一个文本,第二个标题之后和第三个标题之前的所有内容都成为第二个文本元素。
I have a huge string (> 500MB), actually it's an entire book collection in one. I have some meta information in another dataframe, e.g. page numbers, (different) authors and titles. I try to detect the title strings in my huge string and split it by title. I assume titles are unique.
The data looks like this:
mystring <- "Lorem ipsum dolor sit amet, sollicitudin duis maecenas habitasse ultrices aenean tempus"
# a dataframe of meta data, e.g. page numbers and titles
mydf <- data.frame(page = c(1, 2),
title = c( "Lorem", "maecenas"))
mydf
page title
1 1 Lorem
2 2 vivamus
mygoal <- mydf # text that comes after the title
mygoal$text <- c("ipsum dolor sit amet, sollicitudin duis", "habitasse ultrices aenean tempus")
mygoal
page title text
1 1 Lorem ipsum dolor sit amet, sollicitudin duis
2 2 vivamus habitasse ultrices aenean tempus
How can I split the string such that everything between two titles is the first text, everything that comes after the second title and before the third title, becomes the second text element - in the most efficient way.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
如果您想以管道 tidyverse 方式执行操作,您可以尝试使用
stringr::str_extract
和一些正则表达式:Yielding:
如果性能是一个问题,则可以使用
data 的类似方法。表
将是:导致:
编辑
添加以获得更好的性能选项:
通常,
str_split
或str_split_fixed
将是更快的方法比str_extract
更进一步。str_split
的问题是,具有许多备用管道的正则表达式也会减慢该过程,因此另一种解决方案是首先用一些固定字符串替换字符串中的所有标题,然后拆分那些。加快分割速度的另一件事是使用 str_split_fixed 并预先分配要处理的分割数量。In case you wanted to do the operation in a piped tidyverse way, you could try using
stringr::str_extract
with some regex:Yielding:
If performance is a concern, a similar approach with
data.table
would be:Resulting in:
EDIT
Added for better performance options:
Generally,
str_split
orstr_split_fixed
will be a faster way to go thanstr_extract
.The problem for
str_split
is that a regex with many alternate pipes will also slow down the process, so another solution would be to replace all the titles in the string first with some fixed character string, and then split on those. Another thing you can do to speed up the splitting is usestr_split_fixed
and pre-assign how many splits to process.我们可以使用
strsplit
-output
We could use
strsplit
-output