如何重写我的 R 代码以进行多核处理?
我有需要进入“并行化”阶段的 R 代码。我是新手,所以如果我使用了错误的术语,请原谅我。我有一个流程,只需一次一个人地处理一个人,然后最后对每个人进行平均。对于每个人来说,这个过程都是完全相同的(它是布朗桥),我只需为超过 300 个人执行此操作。所以,我希望这里有人知道如何更改我的代码以便生成它?或并行化?或者不管怎么说,确保我现在可以使用的 48 个 CPU 可以帮助减少用我的小笔记本电脑计算这个问题所需的 58 天。在我的脑海里,我只会将 1 个个体发送到一个处理器。让它运行脚本,然后发送另一个脚本......如果有意义的话。
下面是我的代码。我试图在其中发表评论,并指出我认为代码需要更改的地方。
for (n in 1:(length(IDNames))){ #THIS PROCESSES THROUGH EACH INDIVIDUAL
#THIS FIRST PART IS JUST EXTRACTING THE DATA FROM MY TWO INPUT FILES.
#I HAVE ONE FILE WITH ALL THE LOCATIONS AND THEN ANOTHER FILE WITH A DATE RANGE.
#EACH INDIVIDUAL HAS DIFFERENT DATE RANGES, THUS IT HAS TO PULL OUT EACH INDIVIDUALS
#DATA SET SEPARATELY AND THEN RUN THE FUNCTION ON IT.
IndivData = MovData[MovData$ID==IDNames[n],]
IndivData = IndivData[1:(nrow(IndivData)-1),]
if (UseTimeWindow==T){
IndivDates = dates[dates$ID==IDNames[n],]
IndivData = IndivData[IndivData$DateTime>IndivDates$Start[1]&IndivData$DateTime<IndivDates$End[1],]
}
IndivData$TimeDif[nrow(IndivData)]=NA
########################
#THIS IS THE PROCESS WHERE I THINK I NEED THAT HAS TO HAVE EACH INDIVIDUAL RUN THROUGH IT
BBMM <- brownian.bridge(x=IndivData$x, y=IndivData$y,
time.lag = IndivData$TimeDif[1:(nrow(IndivData)-1)], location.error=20,
area.grid = Grid, time.step = 0.1)
#############################
# BELOW IS JUST CODE TO BIND THE RESULTS INTO A GRID DATA FRAME I ALREADY CREATED.
#I DO NOT UNDERSTAND HOW THE MULTICORE PROCESSED CODE WOULD JOIN THE DATA BACK
#WHICH IS WHY IVE INCLUDED THIS PART OF THE CODE.
if(n==1){ #creating a data fram with the x, y, and probabilities for the first individual
BBMMProbGrid = as.data.frame(1:length(BBMM[[2]]))
BBMMProbGrid = cbind(BBMMProbGrid,BBMM[[2]],BBMM[[3]],BBMM[[4]])
colnames(BBMMProbGrid)=c("GrdId","X","Y",paste(IDNames[n],"_Prob", sep=""))
} else { #For every other individual just add the new information to the dataframe
BBMMProbGrid = cbind(BBMMProbGrid,BBMM[[4]])
colnames(BBMMProbGrid)[n*2+2]=paste(IDNames[n],"_Prob", sep ="")
}# end if
} #end loop through individuals
I have R code that I need to get to A "parallelization" stage. Im new at this so please forgive me if I use the wrong terms. I have a process that just has to chug through individual by individual one at a time and then average across individuals in the end. The process is the exact same for each individual (its a Brownian Bridge), I just have to do this for >300 individuals. So, I was hoping someone here might know how to change my code so that it can be spawned? or parallelized? or whatever the word is to make sure that the 48 CPU's I now have access to can help reduce the 58 days it will take to compute this with my little laptop. In my head I would just send out 1 individual to one processor. Have it run through the script and then send another one....if that makes sense.
Below is my code. I have tried to comment in it and have indicated where I think the code needs to be changed.
for (n in 1:(length(IDNames))){ #THIS PROCESSES THROUGH EACH INDIVIDUAL
#THIS FIRST PART IS JUST EXTRACTING THE DATA FROM MY TWO INPUT FILES.
#I HAVE ONE FILE WITH ALL THE LOCATIONS AND THEN ANOTHER FILE WITH A DATE RANGE.
#EACH INDIVIDUAL HAS DIFFERENT DATE RANGES, THUS IT HAS TO PULL OUT EACH INDIVIDUALS
#DATA SET SEPARATELY AND THEN RUN THE FUNCTION ON IT.
IndivData = MovData[MovData$ID==IDNames[n],]
IndivData = IndivData[1:(nrow(IndivData)-1),]
if (UseTimeWindow==T){
IndivDates = dates[dates$ID==IDNames[n],]
IndivData = IndivData[IndivData$DateTime>IndivDates$Start[1]&IndivData$DateTime<IndivDates$End[1],]
}
IndivData$TimeDif[nrow(IndivData)]=NA
########################
#THIS IS THE PROCESS WHERE I THINK I NEED THAT HAS TO HAVE EACH INDIVIDUAL RUN THROUGH IT
BBMM <- brownian.bridge(x=IndivData$x, y=IndivData$y,
time.lag = IndivData$TimeDif[1:(nrow(IndivData)-1)], location.error=20,
area.grid = Grid, time.step = 0.1)
#############################
# BELOW IS JUST CODE TO BIND THE RESULTS INTO A GRID DATA FRAME I ALREADY CREATED.
#I DO NOT UNDERSTAND HOW THE MULTICORE PROCESSED CODE WOULD JOIN THE DATA BACK
#WHICH IS WHY IVE INCLUDED THIS PART OF THE CODE.
if(n==1){ #creating a data fram with the x, y, and probabilities for the first individual
BBMMProbGrid = as.data.frame(1:length(BBMM[[2]]))
BBMMProbGrid = cbind(BBMMProbGrid,BBMM[[2]],BBMM[[3]],BBMM[[4]])
colnames(BBMMProbGrid)=c("GrdId","X","Y",paste(IDNames[n],"_Prob", sep=""))
} else { #For every other individual just add the new information to the dataframe
BBMMProbGrid = cbind(BBMMProbGrid,BBMM[[4]])
colnames(BBMMProbGrid)[n*2+2]=paste(IDNames[n],"_Prob", sep ="")
}# end if
} #end loop through individuals
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
也不知道为什么这被否决了。我认为foreach 包就是您想要的。前几个 pdf 文件包含非常清晰的有用信息。基本上将你想要为每个人做的事情写成一个函数。然后使用 foreach 将一个人的数据发送到一个节点来运行该函数(同时将另一个人发送到另一个节点等),然后使用 rbind 之类的东西编译所有结果。我已经使用过几次,效果很好。
编辑:我不想重新编写你的代码,因为我认为鉴于你已经做到了这一点,你将很容易掌握将其包装到函数中然后使用单行 foreach 的技能。
编辑2:评论太长,无法回复您。
我想既然你已经掌握了代码,你就可以将它放入一个函数中:)如果你仍在研究这个,那么考虑编写一个 for 循环来循环你的主题并做可能会有所帮助该主题所需的计算。然后,for 循环就是您在函数中想要的。我认为在您的代码中,一切都取决于“area.grid”。然后您可以摆脱大部分 [n],因为每次迭代数据只是一次子集。
也许:
然后是这样的:
如果没有一些测试数据,很难说是否是这样,让我知道你进展如何。
Not sure why this has been voted down either. I think the foreach package is what you're after. Those first few pdfs have very clear useful information in them. Basically write what you want done for each person as a function. Then use foreach to send the data for one person out to a node to run the function (while sending another persons to another node etc) and then it compiles all the results using something like rbind. I've used this a few times with great results.
Edit: I didn't look to rework your code as I figure given you've got that far you'll easily have the skills to wrap it into a function and then use the one liner foreach.
Edit 2: This was too long for a comment to reply to you.
I thought since you had got that far with the code that you would be able to get it into a function :) If you're still working on this, it might help to think of writing a for loop to loop over your subjects and do the calculations required for that subject. Then, that for loop is what you want in your function. I think in your code that is everything down to 'area.grid'. Then you can get rid of most of your [n]'s since the data is only subset once per iteration.
Perhaps:
Then something like:
Hard to say whether that is it without some test data, let me know how you get on.
这是一个一般性的例子,因为我没有耐心阅读你的所有代码。将其传播到多个处理器的最快方法之一是使用
multicore
库和mclapply
(lapply
的并行版本)通过函数推送列表(列表中的各个项目将是您案例中 300 多个人中每个人的数据框)。例子:
This is a general example since I didn't have the patience to read through all of your code. One of the quickest ways to spread this across multiple processors would be to use the
multicore
library and themclapply
(a parallelized version oflapply
) to push a list (individual items on the list would be dataframes for each of the 300+ individuals in your case) through a function.Example:
据我了解您的描述,您可以访问分布式计算机集群。所以多核包将无法工作。你必须使用 Rmpi、snow 或 foreach。根据您现有的循环结构,我建议使用 foreach 和 doSnow 包。但你的代码看起来就像你有很多数据。您可能必须检查以减少将发送到节点的数据(仅需要的数据)。
As I understand you description you have access to a distributed computer cluster. So the package multicore will be not working. You have to use Rmpi, snow or foreach. Based on your existing loop structure I would advice to use the foreach and doSnow package. But your codes looks like as you have a lot of data. You probably have to check to reduce the data (only the required ones) which will be send to the nodes.