R 中的多线程
我的一个项目需要取消一长串 URL(+300k)的缩短。
我附上下面的代码:
# This is the function through which I extract an URL and check whether they are
# redirecting towards Twitter or to another external website.
exturl <- function(tweet){
text_tw <- tweet
locunshort2 <- NA # starting value is NA
indtext <- which(substr(unlist(strsplit(text_tw, " ")),1,4)=="http")
if(length(indtext)>0){
for (i in indtext){
skip_to_next <- FALSE
url <- unlist(strsplit(text_tw, " "))[i]
tryCatch({
r <- httr::HEAD(url, timeout(30)) # open the link
print(r$url)
if (str_detect(r$url, "twitter") == FALSE) {# we check if the URL is simply redirecting towards Twitter or not
locunshort2 <- TRUE # if the link has not twitter, we set it to TRUE
}
}, error = function(e) {
skip_to_next <- TRUE
locunshort2 <- 'Error' # set it to error if we haven't been able to open the URL
})
if(skip_to_next) { next }
}}
if (is.na(locunshort2)){ # if the variable is not TRUE nor Error, then it is FALSE (i.e. all links redirected to twitter)
locunshort2 <- FALSE # if it is not TRUE nor NA, then it is false
}
return(locunshort2)}
我尝试按如下方式使用 parLapply 函数:
numCores <- detectCores()
cl <- makeCluster(numCores)
df$url <- unlist(parLapply(cl, df$text, exturl))
stopCluster(cl)
df$url 列只是一个初始化的空列,而 df$text 是我正在分析的推文(因此是文本字符串 - 未处理/原始)。
我对这段代码有两个问题:
parLapply 函数似乎没有按预期工作 - 即,如果我在前 10 个观察中运行它,我只会得到 FALSE 值,而如果我只使用标准 lapply,我会得到很多 TRUE值也是如此。
即使我们设法解决了上述问题,parLapply 的过程也非常慢。对于 Python 中的一个类似项目,我设法打开多个线程(我可以通过 ThreadPool 函数每次打开 1000 个观察)。由于我被迫使用R,我想知道是否存在类似的替代方案(我试图寻找它但没有成功)。
非常感谢您的支持。
上工作,
PS:我正在 Windows UPDATE
tweet 应该是一个文本字符串,其中通常包含一个 URL。
示例:
"Ukraine-Russia talks end and delegations will return home for consultations, officials say, as explosions are heard near Kyiv. Follow live updates:
https://cnn.it/3ps1tBJ"
函数 exturl 查找 url 的位置并将其取消缩短。如果有多个 URL,则一次会取消缩短一个 URL。最终目标是,如果 URL 是指向外部网站的重定向或者只是指向另一条推文,则返回的最终值为 TRUE(我对 exturl 的功能非常肯定,并且它与 lapply 一起工作,因为我说,但是当使用 parLapply 时,某些东西不再起作用)。
数据集本身只有一列包含这些字符串。
更新2:
我尝试运行一个更简化版本的 exturl 函数,只提取未缩短的 URL 列表,我什至尝试使用只有 1 个观察值的数据帧来运行它(所以从技术上来说parLapply 并不是真正需要的)。没有成功,我确信 parLapply 中的某些东西搞砸了。
下面是简化的 exturl:
ext_url <- function(tweet){
text_tw <- tweet
full_url <- list() # initialize an empty list to hold the URLs
indtext <- which(substr(unlist(strsplit(text_tw, " ")),1,4)=="http")
if(length(indtext)>0){
for (i in indtext){
skip_to_next <- FALSE
url <- unlist(strsplit(text_tw, " "))[i]
tryCatch({
r <- httr::HEAD(url, timeout(5)) # open the link
full_url <- append(full_url, r$url)
}, error = function(e) {
skip_to_next <- TRUE
full_url <- append(full_url, 'Error')
})
if(skip_to_next) { next }
}}
return(full_url)}
I need for a project of mine to unshorten a long list of URLs (+300k).
I attach the code below:
# This is the function through which I extract an URL and check whether they are
# redirecting towards Twitter or to another external website.
exturl <- function(tweet){
text_tw <- tweet
locunshort2 <- NA # starting value is NA
indtext <- which(substr(unlist(strsplit(text_tw, " ")),1,4)=="http")
if(length(indtext)>0){
for (i in indtext){
skip_to_next <- FALSE
url <- unlist(strsplit(text_tw, " "))[i]
tryCatch({
r <- httr::HEAD(url, timeout(30)) # open the link
print(r$url)
if (str_detect(r$url, "twitter") == FALSE) {# we check if the URL is simply redirecting towards Twitter or not
locunshort2 <- TRUE # if the link has not twitter, we set it to TRUE
}
}, error = function(e) {
skip_to_next <- TRUE
locunshort2 <- 'Error' # set it to error if we haven't been able to open the URL
})
if(skip_to_next) { next }
}}
if (is.na(locunshort2)){ # if the variable is not TRUE nor Error, then it is FALSE (i.e. all links redirected to twitter)
locunshort2 <- FALSE # if it is not TRUE nor NA, then it is false
}
return(locunshort2)}
I tried to use the parLapply function as follows:
numCores <- detectCores()
cl <- makeCluster(numCores)
df$url <- unlist(parLapply(cl, df$text, exturl))
stopCluster(cl)
The df$url column is just an initialized empty column, whereas df$text are the tweets I am analizing (thus strings of text - unprocessed/raw).
I have two issues with this code:
The parLapply function seems not to work as intended - i.e. if I run it on my first 10 observations I only get FALSE values, whereas if I just use a standard lapply, I get many TRUE values as well.
Even if we manage to fix the point above, the process with parLapply is very slow. For a similar project in Python I managed to open multiple threads (I could open 1000 observations per time through the ThreadPool function). Since I am forced to use R, I was wondering if there existed a similar alternative (I tried to look for it unsuccessfully).
Thank you a lot for your support.
P.S.: I am working on Windows
UPDATE
tweet is supposed to be a string of text with an URL usually inside it.
Example:
"Ukraine-Russia talks end and delegations will return home for consultations, officials say, as explosions are heard near Kyiv. Follow live updates:
https://cnn.it/3ps1tBJ"
The function exturl finds the position of the url and unshortens it. If there are multiple URLs it unshorten one at a time. The end goal is to have a final value returned being TRUE if the URL is a redirect towards an external website or if it is just pointing at another tweet (I'm pretty positive about the functioning of exturl, and it works with lapply as I said, however when using parLapply something doesn't work anymore).
The dataset itself has just a column containing these strings.
UPDATE 2:
I tried to run a much more simplified version of the exturl function, to just extract a list of unshortened URLs, and I even tried to run it with a dataframe with just 1 observation (so technically the parLapply isn't really needed). No success, I'm convinced something in parLapply is messing up.
Below the simplified exturl:
ext_url <- function(tweet){
text_tw <- tweet
full_url <- list() # initialize an empty list to hold the URLs
indtext <- which(substr(unlist(strsplit(text_tw, " ")),1,4)=="http")
if(length(indtext)>0){
for (i in indtext){
skip_to_next <- FALSE
url <- unlist(strsplit(text_tw, " "))[i]
tryCatch({
r <- httr::HEAD(url, timeout(5)) # open the link
full_url <- append(full_url, r$url)
}, error = function(e) {
skip_to_next <- TRUE
full_url <- append(full_url, 'Error')
})
if(skip_to_next) { next }
}}
return(full_url)}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我通过切换到 foreach 包解决了这个问题。
它的速度不如我用 Python 实现的速度,但它比使用简单的 lapply 快得多。为什么
parLapply
给我带来了这样的问题还有待理解,但就我而言,问题已经解决了!I sort of solved the problem by switching to the
foreach
packageIt is not as fast as what I could achieve with Python, yet it is much faster than using the simple
lapply
. It remains to be understood whyparLapply
gave me such problems, however as far as I am concerned, problem solved!不确定,如果这是一个 R 问题:您的代码是否会在接收端从多个线程(并行版本)而不是单个 IP 抛出 300k HTTP 请求不受限制(?)?如果是这样,服务器可能会拒绝它可能只从较慢的 (
lapply
) 版本接受的内容。Not sure, if this is an R problem: doesn't your code throw a 300k HTTP-requests unthrottled(?) at the receiving end, from multiple threads (parallelized version) but a single IP? If so, the server might deny what it maybe only just accepts from the slower (
lapply
) version.