如何向量化 R strsplit?
创建使用 strsplit
的函数时,向量输入的行为不符合预期,需要使用 sapply
。这是由于 strsplit
生成的列表输出造成的。有没有一种方法可以向量化该过程 - 也就是说,该函数为输入的每个元素在列表中生成正确的元素?
例如,要计算字符向量中单词的长度:
words <- c("a","quick","brown","fox")
> length(strsplit(words,""))
[1] 4 # The number of words (length of the list)
> length(strsplit(words,"")[[1]])
[1] 1 # The length of the first word only
> sapply(words,function (x) length(strsplit(x,"")[[1]]))
a quick brown fox
1 5 5 3
# Success, but potentially very slow
理想情况下,类似于 length(strsplit(words,"")[[.]])
,其中 .
是解释为输入向量的相关部分。
When creating functions that use strsplit
, vector inputs do not behave as desired, and sapply
needs to be used. This is due to the list output that strsplit
produces. Is there a way to vectorize the process - that is, the function produces the correct element in the list for each of the elements of the input?
For example, to count the lengths of words in a character vector:
words <- c("a","quick","brown","fox")
> length(strsplit(words,""))
[1] 4 # The number of words (length of the list)
> length(strsplit(words,"")[[1]])
[1] 1 # The length of the first word only
> sapply(words,function (x) length(strsplit(x,"")[[1]]))
a quick brown fox
1 5 5 3
# Success, but potentially very slow
Ideally, something like length(strsplit(words,"")[[.]])
where .
is interpreted as the being the relevant part of the input vector.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
一般来说,您应该首先尝试使用向量化函数。使用
strsplit
之后经常需要某种迭代(这会更慢),所以如果可能的话尽量避免它。在您的示例中,您应该使用nchar
代替:更一般地,利用
strsplit
返回列表并使用lapply
的事实:否则使用
plyr
中的l*ply
系列函数。例如:编辑:
为了纪念Bloomsday,我决定使用乔伊斯的《尤利西斯》来测试这些方法的性能:
现在我已经掌握了所有单词,我们可以进行计数:
矢量化函数和
lapply
比原始快得多>sapply
版本。所有解决方案都返回相同的答案(如摘要输出所示)。显然最新版本的
plyr
速度更快(这是使用稍旧的版本)。In general, you should try to use a vectorized function to begin with. Using
strsplit
will frequently require some kind of iteration afterwards (which will be slower), so try to avoid it if possible. In your example, you should usenchar
instead:More generally, take advantage of the fact that
strsplit
returns a list and uselapply
:Or else use an
l*ply
family function fromplyr
. For instance:Edit:
In honor of Bloomsday, I decided to test the performance of these approaches using Joyce's Ulysses:
Now that I have all the words, we can do our counts:
The vectorized function and
lapply
are considerably faster than the originalsapply
version. All solutions return the same answer (as seen by the summary output).Apparently the latest version of
plyr
is faster (this is using a slightly older version).