R 中的多线程

发布于 2025-01-10 18:54:53 字数 3190 浏览 0 评论 0原文

我的一个项目需要取消一长串 URL(+300k)的缩短。

我附上下面的代码:

# This is the function through which I extract an URL and check whether they are 
# redirecting towards Twitter or to another external website.
exturl <- function(tweet){
  text_tw <- tweet
  locunshort2 <- NA # starting value is NA
  indtext <- which(substr(unlist(strsplit(text_tw, " ")),1,4)=="http")
  if(length(indtext)>0){
    for (i in indtext){
      skip_to_next <- FALSE
      url <- unlist(strsplit(text_tw, " "))[i]
      tryCatch({
        r <- httr::HEAD(url, timeout(30)) # open the link
        print(r$url)
        if (str_detect(r$url, "twitter") == FALSE) {# we check if the URL is simply redirecting towards Twitter or not
          locunshort2 <- TRUE # if the link has not twitter, we set it to TRUE
        }
      }, error = function(e) {
        skip_to_next <- TRUE
        locunshort2 <- 'Error' # set it to error if we haven't been able to open the URL
      })
      
      if(skip_to_next) { next }
    }}
  if (is.na(locunshort2)){ # if the variable is not TRUE nor Error, then it is FALSE (i.e. all links redirected to twitter)
    locunshort2 <- FALSE # if it is not TRUE nor NA, then it is false
  }
  return(locunshort2)}

我尝试按如下方式使用 parLapply 函数:

numCores <- detectCores()
cl <- makeCluster(numCores)
df$url <- unlist(parLapply(cl, df$text, exturl))
stopCluster(cl)

df$url 列只是一个初始化的空列,而 df$text 是我正在分析的推文(因此是文本字符串 - 未处理/原始)。

我对这段代码有两个问题:

  1. parLapply 函数似乎没有按预期工作 - 即,如果我在前 10 个观察中运行它,我只会得到 FALSE 值,而如果我只使用标准 lapply,我会得到很多 TRUE值也是如此。

  2. 即使我们设法解决了上述问题,parLapply 的过程也非常慢。对于 Python 中的一个类似项目,我设法打开多个线程(我可以通过 ThreadPool 函数每次打开 1000 个观察)。由于我被迫使用R,我想知道是否存在类似的替代方案(我试图寻找它但没有成功)。

非常感谢您的支持。

上工作,

PS:我正在 Windows UPDATE

tweet 应该是一个文本字符串,其中通常包含一个 URL。

示例:

"Ukraine-Russia talks end and delegations will return home for consultations, officials say, as explosions are heard near Kyiv. Follow live updates:
https://cnn.it/3ps1tBJ"

函数 exturl 查找 url 的位置并将其取消缩短。如果有多个 URL,则一次会取消缩短一个 URL。最终目标是,如果 URL 是指向外部网站的重定向或者只是指向另一条推文,则返回的最终值为 TRUE(我对 exturl 的功能非常肯定,并且它与 lapply 一起工作,因为我说,但是当使用 parLapply 时,某些东西不再起作用)。

数据集本身只有一列包含这些字符串。

更新2

我尝试运行一个更简化版本的 exturl 函数,只提取未缩短的 URL 列表,我什至尝试使用只有 1 个观察值的数据帧来运行它(所以从技术上来说parLapply 并不是真正需要的)。没有成功,我确信 parLapply 中的某些东西搞砸了。

下面是简化的 exturl:

ext_url <- function(tweet){
  text_tw <- tweet
  full_url <- list() # initialize an empty list to hold the URLs
  indtext <- which(substr(unlist(strsplit(text_tw, " ")),1,4)=="http")
  if(length(indtext)>0){
    for (i in indtext){
      skip_to_next <- FALSE
      url <- unlist(strsplit(text_tw, " "))[i]
      tryCatch({
        r <- httr::HEAD(url, timeout(5)) # open the link
        full_url <- append(full_url, r$url)
      }, error = function(e) {
        skip_to_next <- TRUE
        full_url <- append(full_url, 'Error')
      })
      
      if(skip_to_next) { next }
    }}
  return(full_url)}

I need for a project of mine to unshorten a long list of URLs (+300k).

I attach the code below:

# This is the function through which I extract an URL and check whether they are 
# redirecting towards Twitter or to another external website.
exturl <- function(tweet){
  text_tw <- tweet
  locunshort2 <- NA # starting value is NA
  indtext <- which(substr(unlist(strsplit(text_tw, " ")),1,4)=="http")
  if(length(indtext)>0){
    for (i in indtext){
      skip_to_next <- FALSE
      url <- unlist(strsplit(text_tw, " "))[i]
      tryCatch({
        r <- httr::HEAD(url, timeout(30)) # open the link
        print(r$url)
        if (str_detect(r$url, "twitter") == FALSE) {# we check if the URL is simply redirecting towards Twitter or not
          locunshort2 <- TRUE # if the link has not twitter, we set it to TRUE
        }
      }, error = function(e) {
        skip_to_next <- TRUE
        locunshort2 <- 'Error' # set it to error if we haven't been able to open the URL
      })
      
      if(skip_to_next) { next }
    }}
  if (is.na(locunshort2)){ # if the variable is not TRUE nor Error, then it is FALSE (i.e. all links redirected to twitter)
    locunshort2 <- FALSE # if it is not TRUE nor NA, then it is false
  }
  return(locunshort2)}

I tried to use the parLapply function as follows:

numCores <- detectCores()
cl <- makeCluster(numCores)
df$url <- unlist(parLapply(cl, df$text, exturl))
stopCluster(cl)

The df$url column is just an initialized empty column, whereas df$text are the tweets I am analizing (thus strings of text - unprocessed/raw).

I have two issues with this code:

  1. The parLapply function seems not to work as intended - i.e. if I run it on my first 10 observations I only get FALSE values, whereas if I just use a standard lapply, I get many TRUE values as well.

  2. Even if we manage to fix the point above, the process with parLapply is very slow. For a similar project in Python I managed to open multiple threads (I could open 1000 observations per time through the ThreadPool function). Since I am forced to use R, I was wondering if there existed a similar alternative (I tried to look for it unsuccessfully).

Thank you a lot for your support.

P.S.: I am working on Windows

UPDATE

tweet is supposed to be a string of text with an URL usually inside it.

Example:

"Ukraine-Russia talks end and delegations will return home for consultations, officials say, as explosions are heard near Kyiv. Follow live updates:
https://cnn.it/3ps1tBJ"

The function exturl finds the position of the url and unshortens it. If there are multiple URLs it unshorten one at a time. The end goal is to have a final value returned being TRUE if the URL is a redirect towards an external website or if it is just pointing at another tweet (I'm pretty positive about the functioning of exturl, and it works with lapply as I said, however when using parLapply something doesn't work anymore).

The dataset itself has just a column containing these strings.

UPDATE 2:

I tried to run a much more simplified version of the exturl function, to just extract a list of unshortened URLs, and I even tried to run it with a dataframe with just 1 observation (so technically the parLapply isn't really needed). No success, I'm convinced something in parLapply is messing up.

Below the simplified exturl:

ext_url <- function(tweet){
  text_tw <- tweet
  full_url <- list() # initialize an empty list to hold the URLs
  indtext <- which(substr(unlist(strsplit(text_tw, " ")),1,4)=="http")
  if(length(indtext)>0){
    for (i in indtext){
      skip_to_next <- FALSE
      url <- unlist(strsplit(text_tw, " "))[i]
      tryCatch({
        r <- httr::HEAD(url, timeout(5)) # open the link
        full_url <- append(full_url, r$url)
      }, error = function(e) {
        skip_to_next <- TRUE
        full_url <- append(full_url, 'Error')
      })
      
      if(skip_to_next) { next }
    }}
  return(full_url)}

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

峩卟喜欢 2025-01-17 18:54:53

我通过切换到 foreach 包解决了这个问题。

df$url <- foreach(d = iter(df$text, by = 'row'), .combine = rbind, .packages = c("dplyr","stringr","httr")) %dopar% {
      exturl(d)
    }

它的速度不如我用 Python 实现的速度,但它比使用简单的 lapply 快得多。为什么 parLapply 给我带来了这样的问题还有待理解,但就我而言,问题已经解决了!

I sort of solved the problem by switching to the foreach package

df$url <- foreach(d = iter(df$text, by = 'row'), .combine = rbind, .packages = c("dplyr","stringr","httr")) %dopar% {
      exturl(d)
    }

It is not as fast as what I could achieve with Python, yet it is much faster than using the simple lapply. It remains to be understood why parLapply gave me such problems, however as far as I am concerned, problem solved!

你与清晨阳光 2025-01-17 18:54:53

不确定,如果这是一个 R 问题:您的代码是否会在接收端从多个线程(并行版本)而不是单个 IP 抛出 300k HTTP 请求不受限制(?)?如果是这样,服务器可能会拒绝它可能只从较慢的 (lapply) 版本接受的内容。

Not sure, if this is an R problem: doesn't your code throw a 300k HTTP-requests unthrottled(?) at the receiving end, from multiple threads (parallelized version) but a single IP? If so, the server might deny what it maybe only just accepts from the slower (lapply) version.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文