将字符串切成固定宽度字符元素的向量

发布于 2024-08-21 14:13:19 字数 241 浏览 3 评论 0原文

我有一个包含文本字符串的对象:

x <- "xxyyxyxy"

我想将其拆分为一个向量,每个元素包含两个字母:

[1] "xx" "yy" "xy" "xy"

看起来 strsplit 应该是我的票,但因为我没有正则表达式 foo ,我不知道如何让这个函数按照我想要的方式将字符串切成块。我该怎么做?

I have an object containing a text string:

x <- "xxyyxyxy"

and I want to split that into a vector with each element containing two letters:

[1] "xx" "yy" "xy" "xy"

It seems like the strsplit should be my ticket, but since I have no regular expression foo, I can't figure out how to make this function chop the string up into chunks the way I want it. How should I do this?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(13

兔姬 2024-08-28 14:13:19

使用 substring 是最好的方法:

substring(x, seq(1, nchar(x), 2), seq(2, nchar(x), 2))

但这里有一个 plyr 的解决方案:

library("plyr")
laply(seq(1, nchar(x), 2), function(i) substr(x, i, i+1))

Using substring is the best approach:

substring(x, seq(1, nchar(x), 2), seq(2, nchar(x), 2))

But here's a solution with plyr:

library("plyr")
laply(seq(1, nchar(x), 2), function(i) substr(x, i, i+1))
三寸金莲 2024-08-28 14:13:19

这是一个快速解决方案,将字符串拆分为字符,然后将偶数元素和奇数元素粘贴在一起。

x <- "xxyyxyxy"
sst <- strsplit(x, "")[[1]]
paste0(sst[c(TRUE, FALSE)], sst[c(FALSE, TRUE)])

基准设置:

library(microbenchmark)

GSee <- function(x) {
  sst <- strsplit(x, "")[[1]]
  paste0(sst[c(TRUE, FALSE)], sst[c(FALSE, TRUE)])
}

Shane1 <- function(x) {
  substring(x, seq(1,nchar(x),2), seq(2,nchar(x),2))
}

library("plyr")
Shane2 <- function(x) {
  laply(seq(1,nchar(x),2), function(i) substr(x, i, i+1))
}

seth <- function(x) {
  strsplit(gsub("([[:alnum:]]{2})", "\\1 ", x), " ")[[1]]
}

geoffjentry <- function(x) {
  idx <- 1:nchar(x)  
  odds <- idx[(idx %% 2) == 1]  
  evens <- idx[(idx %% 2) == 0]  
  substring(x, odds, evens)  
}

drewconway <- function(x) {
  c<-strsplit(x,"")[[1]]
  sapply(seq(2,nchar(x),by=2),function(y) paste(c[y-1],c[y],sep=""))
}

KenWilliams <- function(x) {
  n <- 2
  sapply(seq(1,nchar(x),by=n), function(xx) substr(x, xx, xx+n-1))
}

RichardScriven <- function(x) {
  regmatches(x, gregexpr("(.{2})", x))[[1]]
}

基准 1:

x <- "xxyyxyxy"

microbenchmark(
  GSee(x),
  Shane1(x),
  Shane2(x),
  seth(x),
  geoffjentry(x),
  drewconway(x),
  KenWilliams(x),
  RichardScriven(x)
)

# Unit: microseconds
#               expr      min        lq    median        uq      max neval
#            GSee(x)    8.032   12.7460   13.4800   14.1430   17.600   100
#          Shane1(x)   74.520   80.0025   84.8210   88.1385  102.246   100
#          Shane2(x) 1271.156 1288.7185 1316.6205 1358.5220 3839.300   100
#            seth(x)   36.318   43.3710   45.3270   47.5960   67.536   100
#     geoffjentry(x)    9.150   13.5500   15.3655   16.3080   41.066   100
#      drewconway(x)   92.329   98.1255  102.2115  105.6335  115.027   100
#     KenWilliams(x)   77.802   83.0395   87.4400   92.1540  163.705   100
#  RichardScriven(x)   55.034   63.1360   65.7545   68.4785  108.043   100

基准 2:

现在,使用更大的数据。

x <- paste(sample(c("xx", "yy", "xy"), 1e5, replace=TRUE), collapse="")

microbenchmark(
  GSee(x),
  Shane1(x),
  Shane2(x),
  seth(x),
  geoffjentry(x),
  drewconway(x),
  KenWilliams(x),
  RichardScriven(x),
  times=3
)

# Unit: milliseconds
#               expr          min            lq       median            uq          max neval
#            GSee(x)    29.029226    31.3162690    33.603312    35.7046155    37.805919     3
#          Shane1(x) 11754.522290 11866.0042600 11977.486230 12065.3277955 12153.169361     3
#          Shane2(x) 13246.723591 13279.2927180 13311.861845 13371.2202695 13430.578694     3
#            seth(x)    86.668439    89.6322615    92.596084    92.8162885    93.036493     3
#     geoffjentry(x) 11670.845728 11681.3830375 11691.920347 11965.3890110 12238.857675     3
#      drewconway(x)   384.863713   438.7293075   492.594902   515.5538020   538.512702     3
#     KenWilliams(x) 12213.514508 12277.5285215 12341.542535 12403.2315015 12464.920468     3
#  RichardScriven(x) 11549.934241 11730.5723030 11911.210365 11989.4930080 12067.775651     3

Here is a fast solution that splits the string into characters, then pastes together the even elements and the odd elements.

x <- "xxyyxyxy"
sst <- strsplit(x, "")[[1]]
paste0(sst[c(TRUE, FALSE)], sst[c(FALSE, TRUE)])

Benchmark Setup:

library(microbenchmark)

GSee <- function(x) {
  sst <- strsplit(x, "")[[1]]
  paste0(sst[c(TRUE, FALSE)], sst[c(FALSE, TRUE)])
}

Shane1 <- function(x) {
  substring(x, seq(1,nchar(x),2), seq(2,nchar(x),2))
}

library("plyr")
Shane2 <- function(x) {
  laply(seq(1,nchar(x),2), function(i) substr(x, i, i+1))
}

seth <- function(x) {
  strsplit(gsub("([[:alnum:]]{2})", "\\1 ", x), " ")[[1]]
}

geoffjentry <- function(x) {
  idx <- 1:nchar(x)  
  odds <- idx[(idx %% 2) == 1]  
  evens <- idx[(idx %% 2) == 0]  
  substring(x, odds, evens)  
}

drewconway <- function(x) {
  c<-strsplit(x,"")[[1]]
  sapply(seq(2,nchar(x),by=2),function(y) paste(c[y-1],c[y],sep=""))
}

KenWilliams <- function(x) {
  n <- 2
  sapply(seq(1,nchar(x),by=n), function(xx) substr(x, xx, xx+n-1))
}

RichardScriven <- function(x) {
  regmatches(x, gregexpr("(.{2})", x))[[1]]
}

Benchmark 1:

x <- "xxyyxyxy"

microbenchmark(
  GSee(x),
  Shane1(x),
  Shane2(x),
  seth(x),
  geoffjentry(x),
  drewconway(x),
  KenWilliams(x),
  RichardScriven(x)
)

# Unit: microseconds
#               expr      min        lq    median        uq      max neval
#            GSee(x)    8.032   12.7460   13.4800   14.1430   17.600   100
#          Shane1(x)   74.520   80.0025   84.8210   88.1385  102.246   100
#          Shane2(x) 1271.156 1288.7185 1316.6205 1358.5220 3839.300   100
#            seth(x)   36.318   43.3710   45.3270   47.5960   67.536   100
#     geoffjentry(x)    9.150   13.5500   15.3655   16.3080   41.066   100
#      drewconway(x)   92.329   98.1255  102.2115  105.6335  115.027   100
#     KenWilliams(x)   77.802   83.0395   87.4400   92.1540  163.705   100
#  RichardScriven(x)   55.034   63.1360   65.7545   68.4785  108.043   100

Benchmark 2:

Now, with bigger data.

x <- paste(sample(c("xx", "yy", "xy"), 1e5, replace=TRUE), collapse="")

microbenchmark(
  GSee(x),
  Shane1(x),
  Shane2(x),
  seth(x),
  geoffjentry(x),
  drewconway(x),
  KenWilliams(x),
  RichardScriven(x),
  times=3
)

# Unit: milliseconds
#               expr          min            lq       median            uq          max neval
#            GSee(x)    29.029226    31.3162690    33.603312    35.7046155    37.805919     3
#          Shane1(x) 11754.522290 11866.0042600 11977.486230 12065.3277955 12153.169361     3
#          Shane2(x) 13246.723591 13279.2927180 13311.861845 13371.2202695 13430.578694     3
#            seth(x)    86.668439    89.6322615    92.596084    92.8162885    93.036493     3
#     geoffjentry(x) 11670.845728 11681.3830375 11691.920347 11965.3890110 12238.857675     3
#      drewconway(x)   384.863713   438.7293075   492.594902   515.5538020   538.512702     3
#     KenWilliams(x) 12213.514508 12277.5285215 12341.542535 12403.2315015 12464.920468     3
#  RichardScriven(x) 11549.934241 11730.5723030 11911.210365 11989.4930080 12067.775651     3
原谅我要高飞 2024-08-28 14:13:19

怎么样

strsplit(gsub("([[:alnum:]]{2})", "\\1 ", x), " ")[[1]]

基本上,添加一个分隔符(此处为“”),然后然后使用strsplit

How about

strsplit(gsub("([[:alnum:]]{2})", "\\1 ", x), " ")[[1]]

Basically, add a separator (here " ") and then use strsplit

魔法少女 2024-08-28 14:13:19

strsplit 将会有问题,看看这样的正则表达式,

strsplit(z, '[[:alnum:]]{2}')  

它会在正确的点处分割,但什么也没有留下。

您可以使用子字符串 &朋友们

z <- 'xxyyxyxy'  
idx <- 1:nchar(z)  
odds <- idx[(idx %% 2) == 1]  
evens <- idx[(idx %% 2) == 0]  
substring(z, odds, evens)  

strsplit is going to be problematic, look at a regexp like this

strsplit(z, '[[:alnum:]]{2}')  

it will split at the right points but nothing is left.

You could use substring & friends

z <- 'xxyyxyxy'  
idx <- 1:nchar(z)  
odds <- idx[(idx %% 2) == 1]  
evens <- idx[(idx %% 2) == 0]  
substring(z, odds, evens)  
小帐篷 2024-08-28 14:13:19

这是一种方法,但不使用正则表达式:

a <- "xxyyxyxy"
n <- 2
sapply(seq(1,nchar(a),by=n), function(x) substr(a, x, x+n-1))

Here's one way, but not using regexen:

a <- "xxyyxyxy"
n <- 2
sapply(seq(1,nchar(a),by=n), function(x) substr(a, x, x+n-1))
☆獨立☆ 2024-08-28 14:13:19

注意对于子字符串,如果字符串长度不是您请求的长度的倍数,那么您将在第二个序列中需要一个 +(n-1)

substring(x,seq(1,nchar(x),n),seq(n,nchar(x)+n-1,n)) 

ATTENTION with substring, if string length is not a multiple of your requested length, then you will need a +(n-1) in the second sequence:

substring(x,seq(1,nchar(x),n),seq(n,nchar(x)+n-1,n)) 
篱下浅笙歌 2024-08-28 14:13:19

完全黑客,JD,但它完成了

x <- "xxyyxyxy"
c<-strsplit(x,"")[[1]]
sapply(seq(2,nchar(x),by=2),function(y) paste(c[y-1],c[y],sep=""))
[1] "xx" "yy" "xy" "xy"

Total hack, JD, but it gets it done

x <- "xxyyxyxy"
c<-strsplit(x,"")[[1]]
sapply(seq(2,nchar(x),by=2),function(y) paste(c[y-1],c[y],sep=""))
[1] "xx" "yy" "xy" "xy"
趁年轻赶紧闹 2024-08-28 14:13:19

辅助函数:

fixed_split <- function(text, n) {
  strsplit(text, paste0("(?<=.{",n,"})"), perl=TRUE)
}

fixed_split(x, 2)
[[1]]
[1] "xx" "yy" "xy" "xy"

A helper function:

fixed_split <- function(text, n) {
  strsplit(text, paste0("(?<=.{",n,"})"), perl=TRUE)
}

fixed_split(x, 2)
[[1]]
[1] "xx" "yy" "xy" "xy"
墨洒年华 2024-08-28 14:13:19

使用 C++ 甚至可以更快。与GSee版本比较:

GSee <- function(x) {
  sst <- strsplit(x, "")[[1]]
  paste0(sst[c(TRUE, FALSE)], sst[c(FALSE, TRUE)])
}

rstub <- Rcpp::cppFunction( code = '
CharacterVector strsplit2(const std::string& hex) {
  unsigned int length = hex.length()/2;
  CharacterVector res(length);
  for (unsigned int i = 0; i < length; ++i) {
    res(i) = hex.substr(2*i, 2);
  }
  return res;
}')

x <- "xxyyxyxy"
all.equal(GSee(x), rstub(x))
#> [1] TRUE
microbenchmark::microbenchmark(GSee(x), rstub(x))
#> Unit: microseconds
#>      expr   min     lq      mean median     uq       max neval
#>   GSee(x) 4.272 4.4575  41.74284 4.5855 4.7105  3702.289   100
#>  rstub(x) 1.710 1.8990 139.40519 2.0665 2.1250 13722.075   100

set.seed(42)
x <- paste(sample(c("xx", "yy", "xy"), 1e5, replace = TRUE), collapse = "")
all.equal(GSee(x), rstub(x))
#> [1] TRUE
microbenchmark::microbenchmark(GSee(x), rstub(x))
#> Unit: milliseconds
#>      expr       min        lq      mean    median       uq       max neval
#>   GSee(x) 17.931801 18.431504 19.282877 18.738836 19.47943 27.191390   100
#>  rstub(x)  3.197587  3.261109  3.404973  3.341099  3.45852  4.872195   100

Using C++ one can be even faster. Comparing with GSee's version:

GSee <- function(x) {
  sst <- strsplit(x, "")[[1]]
  paste0(sst[c(TRUE, FALSE)], sst[c(FALSE, TRUE)])
}

rstub <- Rcpp::cppFunction( code = '
CharacterVector strsplit2(const std::string& hex) {
  unsigned int length = hex.length()/2;
  CharacterVector res(length);
  for (unsigned int i = 0; i < length; ++i) {
    res(i) = hex.substr(2*i, 2);
  }
  return res;
}')

x <- "xxyyxyxy"
all.equal(GSee(x), rstub(x))
#> [1] TRUE
microbenchmark::microbenchmark(GSee(x), rstub(x))
#> Unit: microseconds
#>      expr   min     lq      mean median     uq       max neval
#>   GSee(x) 4.272 4.4575  41.74284 4.5855 4.7105  3702.289   100
#>  rstub(x) 1.710 1.8990 139.40519 2.0665 2.1250 13722.075   100

set.seed(42)
x <- paste(sample(c("xx", "yy", "xy"), 1e5, replace = TRUE), collapse = "")
all.equal(GSee(x), rstub(x))
#> [1] TRUE
microbenchmark::microbenchmark(GSee(x), rstub(x))
#> Unit: milliseconds
#>      expr       min        lq      mean    median       uq       max neval
#>   GSee(x) 17.931801 18.431504 19.282877 18.738836 19.47943 27.191390   100
#>  rstub(x)  3.197587  3.261109  3.404973  3.341099  3.45852  4.872195   100
梦里泪两行 2024-08-28 14:13:19

好吧,我使用以下伪代码来完成此任务:

  1. 在每个长度为 n 的块处插入一个特殊序列。
  2. 按所述序列拆分字符串。

在代码中,我做了

chopS <- function( text, chunk_len = 2, seqn)
{
    # Specify select and replace patterns
    insert <- paste("(.{",chunk_len,"})", sep = "")
    replace <- paste("\\1", seqn, sep = "")

    # Insert sequence with replaced pattern, then split by the sequence
    interp_text <- gsub( pattern, replace, text)
    strsplit( interp_text, seqn)
}

这会返回一个内部包含分割向量的列表,但不是向量。

Well, I used the following pseudo-code to fulfill this task:

  1. Insert a special sequence at each chunk of length n.
  2. Split the string by said sequence.

In code, I did

chopS <- function( text, chunk_len = 2, seqn)
{
    # Specify select and replace patterns
    insert <- paste("(.{",chunk_len,"})", sep = "")
    replace <- paste("\\1", seqn, sep = "")

    # Insert sequence with replaced pattern, then split by the sequence
    interp_text <- gsub( pattern, replace, text)
    strsplit( interp_text, seqn)
}

This returns a list with the split vector inside, though, not a vector.

ゃ懵逼小萝莉 2024-08-28 14:13:19

根据我的测试,下面的代码比之前进行基准测试的方法更快。 stri_sub 相当快,并且 seq.int 比 seq 更好。通过将所有 2L 更改为其他值,也可以轻松更改琴弦的大小。

library(stringi)

split_line <- function(x) {
  row_length <- stri_length(x)
  stri_sub(x, seq.int(1L, row_length, 2L), seq.int(2L, row_length, 2L))
}

当字符串块长度为 2 个字符时,我没有注意到差异,但对于更大的块,这会稍微好一些。

split_line <- function(x) {
  stri_sub(x, seq.int(1L, stri_length(x), 109L), length = 109L)
}

From my testing, the code below is faster than the previous methods that were benchmarked. stri_sub is pretty fast, and seq.int is better than seq. It's also easy to change the size of the strings by changing all the 2Ls to something else.

library(stringi)

split_line <- function(x) {
  row_length <- stri_length(x)
  stri_sub(x, seq.int(1L, row_length, 2L), seq.int(2L, row_length, 2L))
}

I didn't notice a difference when string chunks were 2 characters long, but for bigger chunks this is slightly better.

split_line <- function(x) {
  stri_sub(x, seq.int(1L, stri_length(x), 109L), length = 109L)
}
走过海棠暮 2024-08-28 14:13:19

我开始寻找一个矢量化的解决方案,以避免
lapply()跨长向量的单字符串解决方案之一。失败
为了找到现有的解决方案,我不知何故掉进了一个兔子洞
煞费苦心地用 C 语言写了一个。相比之下,它最终变得非常复杂
此处显示的许多单行 R 解决方案(不,感谢我决定也
想要处理 Unicode 字符串以匹配 R 版本),但我想我会
分享结果,以防有一天它能以某种方式帮助某人。这是什么
最终变成了这样:

#define R_NO_REMAP
#include <R.h>
#include <Rinternals.h>

// Find the width (in bytes) of a UTF-8 character, given its first byte
size_t utf8charw(char b) {
  if (b == 0x00) return 0;
  if ((b & 0x80) == 0x00) return 1;
  if ((b & 0xe0) == 0xc0) return 2;
  if ((b & 0xf0) == 0xe0) return 3;
  if ((b & 0xf8) == 0xf0) return 4;
  return 1; // Really an invalid character, but move on
}

// Find the number of UTF-8 characters in a string
size_t utf8nchar(const char* str) {
  size_t nchar = 0;
  while (*str != '\0') {
    str += utf8charw(*str); nchar++;
  }
  return nchar;
}

SEXP C_str_chunk(SEXP x, SEXP size_) {
  // Allocate a list to store the result
  R_xlen_t n = Rf_xlength(x);
  SEXP result = PROTECT(Rf_allocVector(VECSXP, n));

  int size = Rf_asInteger(size_);

  for (R_xlen_t i = 0; i < n; i++) {
    const char* str = Rf_translateCharUTF8(STRING_ELT(x, i));

    // Figure out number of chunks
    size_t nchar = utf8nchar(str);
    size_t nchnk = nchar / size + (nchar % size != 0);
    SEXP chunks = PROTECT(Rf_allocVector(STRSXP, nchnk));

    for (size_t j = 0, nbytes = 0; j < nchnk; j++, str += nbytes) {
      // Find size of next chunk in bytes
      nbytes = 0;
      for (int cp = 0; cp < size; cp++) {
        nbytes += utf8charw(str[nbytes]);
      }
      
      // Assign to chunks vector as an R string
      SET_STRING_ELT(chunks, j, Rf_mkCharLenCE(str, nbytes, CE_UTF8));
    }

    SET_VECTOR_ELT(result, i, chunks);
  }

  // Clean up
  UNPROTECT(n);
  UNPROTECT(1);

  return result;
}

然后我将这个怪物放入一个名为 str_chunk.c 的文件中,并使用 R CMD SHLIB str_chunk.c 进行编译。
为了尝试一下,我们需要在 R 端进行一些设置:

str_chunk <- function(x, n) {
  .Call("C_str_chunk", x, as.integer(n))
}

# The (currently) accepted answer
str_chunk_one <- function(x, n) {
  substring(x, seq(1, nchar(x), n), seq(n, nchar(x), n))
}

dyn.load("str_chunk.dll")

所以我们在 C 版本中实现的是获取向量输入并返回一个列表:

str_chunk(rep("0123456789AB", 2), 2)
#> [[1]]
#> [1] "01" "23" "45" "67" "89" "AB"
#> 
#> [[2]]
#> [1] "01" "23" "45" "67" "89" "AB"

现在我们开始进行基准测试。

我们以 200 倍的改进开始,对于长向量
短弦:

x <- rep("0123456789AB", 1000)
microbenchmark::microbenchmark(
  accepted = lapply(x, str_chunk_one, 2),
  str_chunk(x, 2)
) |> print(unit = "relative")
#> Unit: relative
#>             expr      min       lq     mean  median       uq      max neval
#>         accepted 229.5826 216.8246 182.5449 203.785 182.3662 25.88823   100
#>  str_chunk(x, 2)   1.0000   1.0000   1.0000   1.000   1.0000  1.00000   100

……然后缩小到明显不那么令人印象深刻的 3 倍改进
大字符串。

x <- rep(strrep("0123456789AB", 1000), 10)
microbenchmark::microbenchmark(
  accepted = lapply(x, str_chunk_one, 2),
  str_chunk(x, 2)
) |> print(unit = "relative")
#> Unit: relative
#>             expr     min       lq     mean   median       uq      max neval
#>         accepted 2.77981 2.802641 3.304573 2.787173 2.846268 13.62319   100
#>  str_chunk(x, 2) 1.00000 1.000000 1.000000 1.000000 1.000000  1.00000   100

dyn.unload("str_chunk.dll")

那么,值得吗?好吧,绝对不考虑花了多长时间
实际上可以正常工作 - 但如果这是在一个包中,它就会
在我的用例中节省了大量时间(短字符串,长向量)。

I set out looking for a vectorised solution to this, in order to avoid
lapply()ing one of the single string solutions across long vectors. Failing
to find an existing solution, I somehow fell down a rabbit hole of
painstakingly writing one in C. It ended up hilariously complicated compared
to the many one-line R solutions shown here (no thanks to me deciding to also
want to handle Unicode strings to match the R versions), but I thought I’d
share the result, in case it somehow someday helps somebody. Here’s what
eventually became of that:

#define R_NO_REMAP
#include <R.h>
#include <Rinternals.h>

// Find the width (in bytes) of a UTF-8 character, given its first byte
size_t utf8charw(char b) {
  if (b == 0x00) return 0;
  if ((b & 0x80) == 0x00) return 1;
  if ((b & 0xe0) == 0xc0) return 2;
  if ((b & 0xf0) == 0xe0) return 3;
  if ((b & 0xf8) == 0xf0) return 4;
  return 1; // Really an invalid character, but move on
}

// Find the number of UTF-8 characters in a string
size_t utf8nchar(const char* str) {
  size_t nchar = 0;
  while (*str != '\0') {
    str += utf8charw(*str); nchar++;
  }
  return nchar;
}

SEXP C_str_chunk(SEXP x, SEXP size_) {
  // Allocate a list to store the result
  R_xlen_t n = Rf_xlength(x);
  SEXP result = PROTECT(Rf_allocVector(VECSXP, n));

  int size = Rf_asInteger(size_);

  for (R_xlen_t i = 0; i < n; i++) {
    const char* str = Rf_translateCharUTF8(STRING_ELT(x, i));

    // Figure out number of chunks
    size_t nchar = utf8nchar(str);
    size_t nchnk = nchar / size + (nchar % size != 0);
    SEXP chunks = PROTECT(Rf_allocVector(STRSXP, nchnk));

    for (size_t j = 0, nbytes = 0; j < nchnk; j++, str += nbytes) {
      // Find size of next chunk in bytes
      nbytes = 0;
      for (int cp = 0; cp < size; cp++) {
        nbytes += utf8charw(str[nbytes]);
      }
      
      // Assign to chunks vector as an R string
      SET_STRING_ELT(chunks, j, Rf_mkCharLenCE(str, nbytes, CE_UTF8));
    }

    SET_VECTOR_ELT(result, i, chunks);
  }

  // Clean up
  UNPROTECT(n);
  UNPROTECT(1);

  return result;
}

I then put this monstrosity into a file called str_chunk.c, and compiled with R CMD SHLIB str_chunk.c.
To try it out, we need some set-up on the R side:

str_chunk <- function(x, n) {
  .Call("C_str_chunk", x, as.integer(n))
}

# The (currently) accepted answer
str_chunk_one <- function(x, n) {
  substring(x, seq(1, nchar(x), n), seq(n, nchar(x), n))
}

dyn.load("str_chunk.dll")

So what we’ve achieved with the C version is to take a vector inputs and return a list:

str_chunk(rep("0123456789AB", 2), 2)
#> [[1]]
#> [1] "01" "23" "45" "67" "89" "AB"
#> 
#> [[2]]
#> [1] "01" "23" "45" "67" "89" "AB"

Now off we go with benchmarking.

We start off strong with a 200x improvement for a long(ish) vector of
short strings:

x <- rep("0123456789AB", 1000)
microbenchmark::microbenchmark(
  accepted = lapply(x, str_chunk_one, 2),
  str_chunk(x, 2)
) |> print(unit = "relative")
#> Unit: relative
#>             expr      min       lq     mean  median       uq      max neval
#>         accepted 229.5826 216.8246 182.5449 203.785 182.3662 25.88823   100
#>  str_chunk(x, 2)   1.0000   1.0000   1.0000   1.000   1.0000  1.00000   100

… which then shrinks to a distinctly less impressive 3x improvement for
large strings.

x <- rep(strrep("0123456789AB", 1000), 10)
microbenchmark::microbenchmark(
  accepted = lapply(x, str_chunk_one, 2),
  str_chunk(x, 2)
) |> print(unit = "relative")
#> Unit: relative
#>             expr     min       lq     mean   median       uq      max neval
#>         accepted 2.77981 2.802641 3.304573 2.787173 2.846268 13.62319   100
#>  str_chunk(x, 2) 1.00000 1.000000 1.000000 1.000000 1.000000  1.00000   100

dyn.unload("str_chunk.dll")

So, was it worth it? Well, absolutely not considering how long it took to
actually get working properly – But if this was in a package, it would have
saved quite a lot of time in my use-case (short strings, long vectors).

紫竹語嫣☆ 2024-08-28 14:13:19

这是使用 stringi::stri_sub() 的一个选项。尝试:

x <- "xxyyxyxy"
stringi::stri_sub(x, seq(1, stringi::stri_length(x), by = 2), length = 2)
# [1] "xx" "yy" "xy" "xy"

Here is one option using stringi::stri_sub(). Try:

x <- "xxyyxyxy"
stringi::stri_sub(x, seq(1, stringi::stri_length(x), by = 2), length = 2)
# [1] "xx" "yy" "xy" "xy"
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文