如何向量化 R strsplit？

发布于 2024-09-06 01:43:13 字数 653 浏览 13 评论 0原文

创建使用 strsplit 的函数时，向量输入的行为不符合预期，需要使用 sapply。这是由于 strsplit 生成的列表输出造成的。有没有一种方法可以向量化该过程 - 也就是说，该函数为输入的每个元素在列表中生成正确的元素？

例如，要计算字符向量中单词的长度：

words <- c("a","quick","brown","fox")

> length(strsplit(words,""))
[1] 4 # The number of words (length of the list)

> length(strsplit(words,"")[[1]])
[1] 1 # The length of the first word only

> sapply(words,function (x) length(strsplit(x,"")[[1]]))
a quick brown   fox 
1     5     5     3 
# Success, but potentially very slow

理想情况下，类似于 length(strsplit(words,"")[[.]]) ，其中 . 是解释为输入向量的相关部分。

原文

When creating functions that use strsplit, vector inputs do not behave as desired, and sapply needs to be used. This is due to the list output that strsplit produces. Is there a way to vectorize the process - that is, the function produces the correct element in the list for each of the elements of the input?

For example, to count the lengths of words in a character vector:

words <- c("a","quick","brown","fox")

> length(strsplit(words,""))
[1] 4 # The number of words (length of the list)

> length(strsplit(words,"")[[1]])
[1] 1 # The length of the first word only

> sapply(words,function (x) length(strsplit(x,"")[[1]]))
a quick brown   fox 
1     5     5     3 
# Success, but potentially very slow

Ideally, something like length(strsplit(words,"")[[.]]) where . is interpreted as the being the relevant part of the input vector.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

苄①跕圉湢 2024-09-13 01:43:13

一般来说，您应该首先尝试使用向量化函数。使用 strsplit 之后经常需要某种迭代（这会更慢），所以如果可能的话尽量避免它。在您的示例中，您应该使用 nchar 代替：

> nchar(words)
[1] 1 5 5 3

更一般地，利用 strsplit 返回列表并使用 lapply 的事实：

> as.numeric(lapply(strsplit(words,""), length))
[1] 1 5 5 3

否则使用 plyr 中的 l*ply 系列函数。例如：

> laply(strsplit(words,""), length)
[1] 1 5 5 3

编辑：

为了纪念Bloomsday，我决定使用乔伊斯的《尤利西斯》来测试这些方法的性能：

joyce <- readLines("http://www.gutenberg.org/files/4300/4300-8.txt")
joyce <- unlist(strsplit(joyce, " "))

现在我已经掌握了所有单词，我们可以进行计数：

> # original version
> system.time(print(summary(sapply(joyce, function (x) length(strsplit(x,"")[[1]])))))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   3.000   4.000   4.666   6.000  69.000 
   user  system elapsed 
   2.65    0.03    2.73 
> # vectorized function
> system.time(print(summary(nchar(joyce))))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   3.000   4.000   4.666   6.000  69.000 
   user  system elapsed 
   0.05    0.00    0.04 
> # with lapply
> system.time(print(summary(as.numeric(lapply(strsplit(joyce,""), length)))))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   3.000   4.000   4.666   6.000  69.000 
   user  system elapsed 
    0.8     0.0     0.8 
> # with laply (from plyr)
> system.time(print(summary(laply(strsplit(joyce,""), length))))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   3.000   4.000   4.666   6.000  69.000 
   user  system elapsed 
  17.20    0.05   17.30
> # with ldply (from plyr)
> system.time(print(summary(ldply(strsplit(joyce,""), length))))
       V1        
 Min.   : 0.000  
 1st Qu.: 3.000  
 Median : 4.000  
 Mean   : 4.666  
 3rd Qu.: 6.000  
 Max.   :69.000  
   user  system elapsed 
   7.97    0.00    8.03

矢量化函数和 lapply 比原始 快得多>sapply 版本。所有解决方案都返回相同的答案（如摘要输出所示）。

显然最新版本的 plyr 速度更快（这是使用稍旧的版本）。

In general, you should try to use a vectorized function to begin with. Using strsplit will frequently require some kind of iteration afterwards (which will be slower), so try to avoid it if possible. In your example, you should use nchar instead:

> nchar(words)
[1] 1 5 5 3

More generally, take advantage of the fact that strsplit returns a list and use lapply:

> as.numeric(lapply(strsplit(words,""), length))
[1] 1 5 5 3

Or else use an l*ply family function from plyr. For instance:

> laply(strsplit(words,""), length)
[1] 1 5 5 3

Edit:

In honor of Bloomsday, I decided to test the performance of these approaches using Joyce's Ulysses:

joyce <- readLines("http://www.gutenberg.org/files/4300/4300-8.txt")
joyce <- unlist(strsplit(joyce, " "))

Now that I have all the words, we can do our counts:

> # original version
> system.time(print(summary(sapply(joyce, function (x) length(strsplit(x,"")[[1]])))))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   3.000   4.000   4.666   6.000  69.000 
   user  system elapsed 
   2.65    0.03    2.73 
> # vectorized function
> system.time(print(summary(nchar(joyce))))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   3.000   4.000   4.666   6.000  69.000 
   user  system elapsed 
   0.05    0.00    0.04 
> # with lapply
> system.time(print(summary(as.numeric(lapply(strsplit(joyce,""), length)))))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   3.000   4.000   4.666   6.000  69.000 
   user  system elapsed 
    0.8     0.0     0.8 
> # with laply (from plyr)
> system.time(print(summary(laply(strsplit(joyce,""), length))))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   3.000   4.000   4.666   6.000  69.000 
   user  system elapsed 
  17.20    0.05   17.30
> # with ldply (from plyr)
> system.time(print(summary(ldply(strsplit(joyce,""), length))))
       V1        
 Min.   : 0.000  
 1st Qu.: 3.000  
 Median : 4.000  
 Mean   : 4.666  
 3rd Qu.: 6.000  
 Max.   :69.000  
   user  system elapsed 
   7.97    0.00    8.03

The vectorized function and lapply are considerably faster than the original sapply version. All solutions return the same answer (as seen by the summary output).

Apparently the latest version of plyr is faster (this is using a slightly older version).

回复收藏 0 原文

~没有更多了~