如何在Julia中声明共享数据框以进行并行计算

发布于 2025-02-13 03:09:55 字数 904 浏览 1 评论 0原文

我在数据框架df上进行了大型仿真，我试图并行化并将模拟结果保存在名为simulation_Results的dataframe中。

并行循环工作正常。问题是，如果我要将结果存储在数组中，则在循环之前将其声明为sharedArray。我不知道如何将simulation_results声明为“共享数据框架”，所有处理器无处可获得并可以修改。

代码段如下：

addprocs(length(Sys.cpu_info()))

@everywhere begin
  using <required packages>

  df = CSV.read("/path/data.csv", DataFrame)

  simulation_results = similar(df, 0) #I need to declare this as shared and modifiable by all processors 
  
  nsims = 100000

end


@sync @distributed for sim in 1:nsims
    nsim_result = similar(df, 0)
    <the code which for one simulation stores the results in nsim_result >
    append!(simulation_results, nsim_result)
end

问题是，由于simulation_Results未声明为处理器共享和修改，因此在循环运行后，它基本上产生一个空数据框，如@中所编码。到处simulation_Results =相似（DF，0）。

真的很感谢您的任何帮助！谢谢！

原文

I have a large simulation on a DataFrame df which I am trying to parallelize and save the results of the simulations in a DataFrame called simulation_results.

The parallelization loop is working just fine. The problem is that if I were to store the results in an array I would declare it as a SharedArray before the loop. I don't know how to declare simulation_results as a "shared DataFrame" which is available everywhere to all processors and can be modified.

A code snippet is as follows:

addprocs(length(Sys.cpu_info()))

@everywhere begin
  using <required packages>

  df = CSV.read("/path/data.csv", DataFrame)

  simulation_results = similar(df, 0) #I need to declare this as shared and modifiable by all processors 
  
  nsims = 100000

end


@sync @distributed for sim in 1:nsims
    nsim_result = similar(df, 0)
    <the code which for one simulation stores the results in nsim_result >
    append!(simulation_results, nsim_result)
end

The problem is that since simulation_results is not declared to be shared and modifiable by processors, after the loop runs, it produces basically an empty DataFrame as was coded in @everywhere simulation_results = similar(df, 0).

Would really appreciate any help on this! Thanks!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

辞旧 2025-02-20 03:09:55

朱莉娅（Julia）中分布式计算的模式比您想做的要简单得多。

您的代码应该或多或少地像这样：

df = CSV.read("/path/data.csv", DataFrame)

@everywhere using <required packages>


simulation_results = @distributed (append!) for sim in 1:nsims
    <the code which for one simulation stores the results in nsim_result >
    nsim_result
end

注意，您不需要在朱莉娅集群中的每个过程中加载df，因为@distributed将确保它可读。您不需要@sync都不需要，因为在我的代码中，您将使用聚合函数（append！）。

一个最小的工作示例（使用AddProcs（4））：

@everywhere using Distributed, DataFrames
df = DataFrame(a=1:5,b=rand())

现在结果：

julia> @distributed (append!) for i in 2:5
           DataFrame(bsum=sum(df.b[1:myid()]),c=myid())
       end
4×2 DataFrame
 Row │ bsum      c
     │ Float64   Int64
─────┼─────────────────
   1 │ 0.518127      2
   2 │ 0.777191      3
   3 │ 1.03625       4
   4 │ 1.29532       5

The pattern for distributed computing in Julia is much simpler than what you are trying to do.

Your code should look more or less like this:

df = CSV.read("/path/data.csv", DataFrame)

@everywhere using <required packages>


simulation_results = @distributed (append!) for sim in 1:nsims
    <the code which for one simulation stores the results in nsim_result >
    nsim_result
end

Note you do not need to load df at every process within the Julia cluster since @distributed will make sure it is readable. You do not need to @sync neither because in my code you would use the aggregator function (append!).

A minimal working example (run with addprocs(4)):

@everywhere using Distributed, DataFrames
df = DataFrame(a=1:5,b=rand())

and now the result:

julia> @distributed (append!) for i in 2:5
           DataFrame(bsum=sum(df.b[1:myid()]),c=myid())
       end
4×2 DataFrame
 Row │ bsum      c
     │ Float64   Int64
─────┼─────────────────
   1 │ 0.518127      2
   2 │ 0.777191      3
   3 │ 1.03625       4
   4 │ 1.29532       5

回复收藏 0 原文

可爱咩 2025-02-20 03:09:55

只要您的dataframe df在您处理的条目中是数字，就可以作为矩阵来回传递：

mynames = names(df)
matrix = Matrix(df)

然后将矩阵转换为共享arrix和Compute。然后返回矩阵。

dfprocessed = DataFrame(matrix, mynames)

请注意，如果您的数据帧的数据并不是所有统一类型，则此方法可能会失败。如果全部都是整数或所有浮点，它将最有效。您可能必须先删除非数字列或将其设置为数字级别。

As long as your dataframe df is numeric in the entries you process, you can pass it back and forth as a matrix:

mynames = names(df)
matrix = Matrix(df)

Then convert matrix to to SharedArray and compute. Then back to matrix.

dfprocessed = DataFrame(matrix, mynames)

Note this method may fail if your dataframe's data are not all of uniform type. It would work best if all are integer or all floating point. You might have to first drop non-numeric columns or set those to numeric levels.

回复收藏 0 原文

~没有更多了~