如何重写我的 R 代码以进行多核处理？

发布于 2024-11-29 16:39:45 字数 2191 浏览 9 评论 0原文

我有需要进入“并行化”阶段的 R 代码。我是新手，所以如果我使用了错误的术语，请原谅我。我有一个流程，只需一次一个人地处理一个人，然后最后对每个人进行平均。对于每个人来说，这个过程都是完全相同的（它是布朗桥），我只需为超过 300 个人执行此操作。所以，我希望这里有人知道如何更改我的代码以便生成它？或并行化？或者不管怎么说，确保我现在可以使用的 48 个 CPU 可以帮助减少用我的小笔记本电脑计算这个问题所需的 58 天。在我的脑海里，我只会将 1 个个体发送到一个处理器。让它运行脚本，然后发送另一个脚本......如果有意义的话。

下面是我的代码。我试图在其中发表评论，并指出我认为代码需要更改的地方。

for (n in 1:(length(IDNames))){ #THIS PROCESSES THROUGH EACH INDIVIDUAL 

#THIS FIRST PART IS JUST EXTRACTING THE DATA FROM MY TWO INPUT FILES.  
#I HAVE ONE FILE WITH ALL THE LOCATIONS AND THEN ANOTHER FILE WITH A DATE RANGE.  
#EACH INDIVIDUAL HAS DIFFERENT DATE RANGES, THUS IT HAS TO PULL OUT EACH INDIVIDUALS 
#DATA SET SEPARATELY AND THEN RUN THE FUNCTION ON IT.

    IndivData = MovData[MovData$ID==IDNames[n],]
    IndivData = IndivData[1:(nrow(IndivData)-1),]
    if (UseTimeWindow==T){
      IndivDates = dates[dates$ID==IDNames[n],]
      IndivData = IndivData[IndivData$DateTime>IndivDates$Start[1]&IndivData$DateTime<IndivDates$End[1],]
    }
    IndivData$TimeDif[nrow(IndivData)]=NA

    ########################
#THIS IS THE PROCESS WHERE I THINK I NEED THAT HAS TO HAVE EACH INDIVIDUAL RUN THROUGH IT

    BBMM <- brownian.bridge(x=IndivData$x, y=IndivData$y,
    time.lag = IndivData$TimeDif[1:(nrow(IndivData)-1)], location.error=20,
    area.grid = Grid, time.step = 0.1)

  #############################
  # BELOW IS JUST CODE TO BIND THE RESULTS INTO A GRID DATA FRAME I ALREADY CREATED.  
  #I DO NOT UNDERSTAND HOW THE MULTICORE PROCESSED CODE WOULD JOIN THE DATA BACK 
  #WHICH IS WHY IVE INCLUDED THIS PART OF THE CODE.  

    if(n==1){   #creating a data fram with the x, y, and probabilities for the first individual
      BBMMProbGrid = as.data.frame(1:length(BBMM[[2]]))
      BBMMProbGrid = cbind(BBMMProbGrid,BBMM[[2]],BBMM[[3]],BBMM[[4]])
      colnames(BBMMProbGrid)=c("GrdId","X","Y",paste(IDNames[n],"_Prob", sep=""))
    } else {                #For every other individual just add the new information to the dataframe
      BBMMProbGrid = cbind(BBMMProbGrid,BBMM[[4]])
      colnames(BBMMProbGrid)[n*2+2]=paste(IDNames[n],"_Prob", sep ="")
    }# end if  


    } #end loop through individuals

原文

I have R code that I need to get to A "parallelization" stage. Im new at this so please forgive me if I use the wrong terms. I have a process that just has to chug through individual by individual one at a time and then average across individuals in the end. The process is the exact same for each individual (its a Brownian Bridge), I just have to do this for >300 individuals. So, I was hoping someone here might know how to change my code so that it can be spawned? or parallelized? or whatever the word is to make sure that the 48 CPU's I now have access to can help reduce the 58 days it will take to compute this with my little laptop. In my head I would just send out 1 individual to one processor. Have it run through the script and then send another one....if that makes sense.

Below is my code. I have tried to comment in it and have indicated where I think the code needs to be changed.

for (n in 1:(length(IDNames))){ #THIS PROCESSES THROUGH EACH INDIVIDUAL 

#THIS FIRST PART IS JUST EXTRACTING THE DATA FROM MY TWO INPUT FILES.  
#I HAVE ONE FILE WITH ALL THE LOCATIONS AND THEN ANOTHER FILE WITH A DATE RANGE.  
#EACH INDIVIDUAL HAS DIFFERENT DATE RANGES, THUS IT HAS TO PULL OUT EACH INDIVIDUALS 
#DATA SET SEPARATELY AND THEN RUN THE FUNCTION ON IT.

    IndivData = MovData[MovData$ID==IDNames[n],]
    IndivData = IndivData[1:(nrow(IndivData)-1),]
    if (UseTimeWindow==T){
      IndivDates = dates[dates$ID==IDNames[n],]
      IndivData = IndivData[IndivData$DateTime>IndivDates$Start[1]&IndivData$DateTime<IndivDates$End[1],]
    }
    IndivData$TimeDif[nrow(IndivData)]=NA

    ########################
#THIS IS THE PROCESS WHERE I THINK I NEED THAT HAS TO HAVE EACH INDIVIDUAL RUN THROUGH IT

    BBMM <- brownian.bridge(x=IndivData$x, y=IndivData$y,
    time.lag = IndivData$TimeDif[1:(nrow(IndivData)-1)], location.error=20,
    area.grid = Grid, time.step = 0.1)

  #############################
  # BELOW IS JUST CODE TO BIND THE RESULTS INTO A GRID DATA FRAME I ALREADY CREATED.  
  #I DO NOT UNDERSTAND HOW THE MULTICORE PROCESSED CODE WOULD JOIN THE DATA BACK 
  #WHICH IS WHY IVE INCLUDED THIS PART OF THE CODE.  

    if(n==1){   #creating a data fram with the x, y, and probabilities for the first individual
      BBMMProbGrid = as.data.frame(1:length(BBMM[[2]]))
      BBMMProbGrid = cbind(BBMMProbGrid,BBMM[[2]],BBMM[[3]],BBMM[[4]])
      colnames(BBMMProbGrid)=c("GrdId","X","Y",paste(IDNames[n],"_Prob", sep=""))
    } else {                #For every other individual just add the new information to the dataframe
      BBMMProbGrid = cbind(BBMMProbGrid,BBMM[[4]])
      colnames(BBMMProbGrid)[n*2+2]=paste(IDNames[n],"_Prob", sep ="")
    }# end if  


    } #end loop through individuals

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

江湖彼岸 2024-12-06 16:39:45

也不知道为什么这被否决了。我认为foreach 包就是您想要的。前几个 pdf 文件包含非常清晰的有用信息。基本上将你想要为每个人做的事情写成一个函数。然后使用 foreach 将一个人的数据发送到一个节点来运行该函数（同时将另一个人发送到另一个节点等），然后使用 rbind 之类的东西编译所有结果。我已经使用过几次，效果很好。

编辑：我不想重新编写你的代码，因为我认为鉴于你已经做到了这一点，你将很容易掌握将其包装到函数中然后使用单行 foreach 的技能。

编辑2：评论太长，无法回复您。

我想既然你已经掌握了代码，你就可以将它放入一个函数中:)如果你仍在研究这个，那么考虑编写一个 for 循环来循环你的主题并做可能会有所帮助该主题所需的计算。然后，for 循环就是您在函数中想要的。我认为在您的代码中，一切都取决于“area.grid”。然后您可以摆脱大部分 [n]，因为每次迭代数据只是一次子集。

也许：

pernode <- function(MovData) {
    IndivData = MovData[MovData$ID==IDNames[i],]
    IndivData = IndivData[1:(nrow(IndivData)-1),]
    if (UseTimeWindow==T){
                         IndivDates = dates[dates$ID==IDNames,]
                         IndivData = IndivData[IndivData$DateTime>IndivDates$Start[1]
                         &IndivData$DateTime<IndivDates$End[1],]
                         }
    IndivData$TimeDif[nrow(IndivData)]=NA

    BBMM <- brownian.bridge(x=IndivData$x, y=IndivData$y,
    time.lag = IndivData$TimeDif[1:(nrow(IndivData)-1)], location.error=20,
    area.grid = Grid, time.step = 0.1)

return(BBMM)
}

然后是这样的：

library(doMC)
library(foreach)
registerDoMC(cores=48) # or perhaps a few less than all you have

system.time(
  output <- foreach(i = 1:length(IDNames)), .combine = "rbind", .multicombine=T,
 .inorder = FALSE) %dopar% {pernode(i)}
)

如果没有一些测试数据，很难说是否是这样，让我知道你进展如何。

Not sure why this has been voted down either. I think the foreach package is what you're after. Those first few pdfs have very clear useful information in them. Basically write what you want done for each person as a function. Then use foreach to send the data for one person out to a node to run the function (while sending another persons to another node etc) and then it compiles all the results using something like rbind. I've used this a few times with great results.

Edit: I didn't look to rework your code as I figure given you've got that far you'll easily have the skills to wrap it into a function and then use the one liner foreach.

Edit 2: This was too long for a comment to reply to you.

I thought since you had got that far with the code that you would be able to get it into a function :) If you're still working on this, it might help to think of writing a for loop to loop over your subjects and do the calculations required for that subject. Then, that for loop is what you want in your function. I think in your code that is everything down to 'area.grid'. Then you can get rid of most of your [n]'s since the data is only subset once per iteration.

Perhaps:

pernode <- function(MovData) {
    IndivData = MovData[MovData$ID==IDNames[i],]
    IndivData = IndivData[1:(nrow(IndivData)-1),]
    if (UseTimeWindow==T){
                         IndivDates = dates[dates$ID==IDNames,]
                         IndivData = IndivData[IndivData$DateTime>IndivDates$Start[1]
                         &IndivData$DateTime<IndivDates$End[1],]
                         }
    IndivData$TimeDif[nrow(IndivData)]=NA

    BBMM <- brownian.bridge(x=IndivData$x, y=IndivData$y,
    time.lag = IndivData$TimeDif[1:(nrow(IndivData)-1)], location.error=20,
    area.grid = Grid, time.step = 0.1)

return(BBMM)
}

Then something like:

library(doMC)
library(foreach)
registerDoMC(cores=48) # or perhaps a few less than all you have

system.time(
  output <- foreach(i = 1:length(IDNames)), .combine = "rbind", .multicombine=T,
 .inorder = FALSE) %dopar% {pernode(i)}
)

Hard to say whether that is it without some test data, let me know how you get on.

回复收藏 0 原文

彼岸花ソ最美的依靠 2024-12-06 16:39:45

这是一个一般性的例子，因为我没有耐心阅读你的所有代码。将其传播到多个处理器的最快方法之一是使用 multicore 库和 mclapply（lapply 的并行版本）通过函数推送列表（列表中的各个项目将是您案例中 300 多个人中每个人的数据框）。

例子：

library(multicore)
result=mclapply(data_list, your_function,mc.preschedule=FALSE, mc.set.seed=FALSE)

This is a general example since I didn't have the patience to read through all of your code. One of the quickest ways to spread this across multiple processors would be to use the multicore library and the mclapply (a parallelized version of lapply) to push a list (individual items on the list would be dataframes for each of the 300+ individuals in your case) through a function.

Example:

library(multicore)
result=mclapply(data_list, your_function,mc.preschedule=FALSE, mc.set.seed=FALSE)

回复收藏 0 原文

那一片橙海， 2024-12-06 16:39:45

据我了解您的描述，您可以访问分布式计算机集群。所以多核包将无法工作。你必须使用 Rmpi、snow 或 foreach。根据您现有的循环结构，我建议使用 foreach 和 doSnow 包。但你的代码看起来就像你有很多数据。您可能必须检查以减少将发送到节点的数据（仅需要的数据）。

回复收藏 0 原文

~没有更多了~

关于作者

半夏半凉

暂无简介

文章

27 人气

关注发私信

友情链接

文江博客

如何重写我的 R 代码以进行多核处理？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（3）

关于作者

相关话题

热门标签

推荐作者

櫻之舞

弥枳

m2429

寻找一个思念的角度

野却迷人

我怀念的。

友情链接

如何重写我的 R 代码以进行多核处理？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（3）

关于作者

相关话题

热门标签

推荐作者

櫻之舞

弥枳

m2429

寻找一个思念的角度

野却迷人

我怀念的。

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。