URL multithreading r parallel-processing

R 中的多线程

发布于 2025-01-10 18:54:53 字数 3190 浏览 0 评论 0原文

我的一个项目需要取消一长串 URL（+300k）的缩短。

我附上下面的代码：

# This is the function through which I extract an URL and check whether they are 
# redirecting towards Twitter or to another external website.
exturl <- function(tweet){
  text_tw <- tweet
  locunshort2 <- NA # starting value is NA
  indtext <- which(substr(unlist(strsplit(text_tw, " ")),1,4)=="http")
  if(length(indtext)>0){
    for (i in indtext){
      skip_to_next <- FALSE
      url <- unlist(strsplit(text_tw, " "))[i]
      tryCatch({
        r <- httr::HEAD(url, timeout(30)) # open the link
        print(r$url)
        if (str_detect(r$url, "twitter") == FALSE) {# we check if the URL is simply redirecting towards Twitter or not
          locunshort2 <- TRUE # if the link has not twitter, we set it to TRUE
        }
      }, error = function(e) {
        skip_to_next <- TRUE
        locunshort2 <- 'Error' # set it to error if we haven't been able to open the URL
      })
      
      if(skip_to_next) { next }
    }}
  if (is.na(locunshort2)){ # if the variable is not TRUE nor Error, then it is FALSE (i.e. all links redirected to twitter)
    locunshort2 <- FALSE # if it is not TRUE nor NA, then it is false
  }
  return(locunshort2)}

我尝试按如下方式使用 parLapply 函数：

numCores <- detectCores()
cl <- makeCluster(numCores)
df$url <- unlist(parLapply(cl, df$text, exturl))
stopCluster(cl)

df$url 列只是一个初始化的空列，而 df$text 是我正在分析的推文（因此是文本字符串 - 未处理/原始）。

我对这段代码有两个问题：

parLapply 函数似乎没有按预期工作 - 即，如果我在前 10 个观察中运行它，我只会得到 FALSE 值，而如果我只使用标准 lapply，我会得到很多 TRUE值也是如此。
即使我们设法解决了上述问题，parLapply 的过程也非常慢。对于 Python 中的一个类似项目，我设法打开多个线程（我可以通过 ThreadPool 函数每次打开 1000 个观察）。由于我被迫使用R，我想知道是否存在类似的替代方案（我试图寻找它但没有成功）。

非常感谢您的支持。

上工作，

PS：我正在 Windows UPDATE

tweet 应该是一个文本字符串，其中通常包含一个 URL。

示例：

"Ukraine-Russia talks end and delegations will return home for consultations, officials say, as explosions are heard near Kyiv. Follow live updates:
https://cnn.it/3ps1tBJ"

函数 exturl 查找 url 的位置并将其取消缩短。如果有多个 URL，则一次会取消缩短一个 URL。最终目标是，如果 URL 是指向外部网站的重定向或者只是指向另一条推文，则返回的最终值为 TRUE（我对 exturl 的功能非常肯定，并且它与 lapply 一起工作，因为我说，但是当使用 parLapply 时，某些东西不再起作用）。

数据集本身只有一列包含这些字符串。

更新2：

我尝试运行一个更简化版本的 exturl 函数，只提取未缩短的 URL 列表，我什至尝试使用只有 1 个观察值的数据帧来运行它（所以从技术上来说parLapply 并不是真正需要的）。没有成功，我确信 parLapply 中的某些东西搞砸了。

下面是简化的 exturl：

ext_url <- function(tweet){
  text_tw <- tweet
  full_url <- list() # initialize an empty list to hold the URLs
  indtext <- which(substr(unlist(strsplit(text_tw, " ")),1,4)=="http")
  if(length(indtext)>0){
    for (i in indtext){
      skip_to_next <- FALSE
      url <- unlist(strsplit(text_tw, " "))[i]
      tryCatch({
        r <- httr::HEAD(url, timeout(5)) # open the link
        full_url <- append(full_url, r$url)
      }, error = function(e) {
        skip_to_next <- TRUE
        full_url <- append(full_url, 'Error')
      })
      
      if(skip_to_next) { next }
    }}
  return(full_url)}

原文

I need for a project of mine to unshorten a long list of URLs (+300k).

I attach the code below:

# This is the function through which I extract an URL and check whether they are 
# redirecting towards Twitter or to another external website.
exturl <- function(tweet){
  text_tw <- tweet
  locunshort2 <- NA # starting value is NA
  indtext <- which(substr(unlist(strsplit(text_tw, " ")),1,4)=="http")
  if(length(indtext)>0){
    for (i in indtext){
      skip_to_next <- FALSE
      url <- unlist(strsplit(text_tw, " "))[i]
      tryCatch({
        r <- httr::HEAD(url, timeout(30)) # open the link
        print(r$url)
        if (str_detect(r$url, "twitter") == FALSE) {# we check if the URL is simply redirecting towards Twitter or not
          locunshort2 <- TRUE # if the link has not twitter, we set it to TRUE
        }
      }, error = function(e) {
        skip_to_next <- TRUE
        locunshort2 <- 'Error' # set it to error if we haven't been able to open the URL
      })
      
      if(skip_to_next) { next }
    }}
  if (is.na(locunshort2)){ # if the variable is not TRUE nor Error, then it is FALSE (i.e. all links redirected to twitter)
    locunshort2 <- FALSE # if it is not TRUE nor NA, then it is false
  }
  return(locunshort2)}

I tried to use the parLapply function as follows:

numCores <- detectCores()
cl <- makeCluster(numCores)
df$url <- unlist(parLapply(cl, df$text, exturl))
stopCluster(cl)

The df$url column is just an initialized empty column, whereas df$text are the tweets I am analizing (thus strings of text - unprocessed/raw).

I have two issues with this code:

The parLapply function seems not to work as intended - i.e. if I run it on my first 10 observations I only get FALSE values, whereas if I just use a standard lapply, I get many TRUE values as well.
Even if we manage to fix the point above, the process with parLapply is very slow. For a similar project in Python I managed to open multiple threads (I could open 1000 observations per time through the ThreadPool function). Since I am forced to use R, I was wondering if there existed a similar alternative (I tried to look for it unsuccessfully).

Thank you a lot for your support.

P.S.: I am working on Windows

UPDATE

tweet is supposed to be a string of text with an URL usually inside it.

Example:

"Ukraine-Russia talks end and delegations will return home for consultations, officials say, as explosions are heard near Kyiv. Follow live updates:
https://cnn.it/3ps1tBJ"

The function exturl finds the position of the url and unshortens it. If there are multiple URLs it unshorten one at a time. The end goal is to have a final value returned being TRUE if the URL is a redirect towards an external website or if it is just pointing at another tweet (I'm pretty positive about the functioning of exturl, and it works with lapply as I said, however when using parLapply something doesn't work anymore).

The dataset itself has just a column containing these strings.

UPDATE 2:

I tried to run a much more simplified version of the exturl function, to just extract a list of unshortened URLs, and I even tried to run it with a dataframe with just 1 observation (so technically the parLapply isn't really needed). No success, I'm convinced something in parLapply is messing up.

Below the simplified exturl:

ext_url <- function(tweet){
  text_tw <- tweet
  full_url <- list() # initialize an empty list to hold the URLs
  indtext <- which(substr(unlist(strsplit(text_tw, " ")),1,4)=="http")
  if(length(indtext)>0){
    for (i in indtext){
      skip_to_next <- FALSE
      url <- unlist(strsplit(text_tw, " "))[i]
      tryCatch({
        r <- httr::HEAD(url, timeout(5)) # open the link
        full_url <- append(full_url, r$url)
      }, error = function(e) {
        skip_to_next <- TRUE
        full_url <- append(full_url, 'Error')
      })
      
      if(skip_to_next) { next }
    }}
  return(full_url)}

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

峩卟喜欢 2025-01-17 18:54:53

我通过切换到 foreach 包解决了这个问题。

df$url <- foreach(d = iter(df$text, by = 'row'), .combine = rbind, .packages = c("dplyr","stringr","httr")) %dopar% {
      exturl(d)
    }

它的速度不如我用 Python 实现的速度，但它比使用简单的 lapply 快得多。为什么 parLapply 给我带来了这样的问题还有待理解，但就我而言，问题已经解决了！

I sort of solved the problem by switching to the foreach package

df$url <- foreach(d = iter(df$text, by = 'row'), .combine = rbind, .packages = c("dplyr","stringr","httr")) %dopar% {
      exturl(d)
    }

It is not as fast as what I could achieve with Python, yet it is much faster than using the simple lapply. It remains to be understood why parLapply gave me such problems, however as far as I am concerned, problem solved!

回复收藏 0 原文