Mathematica 快速 2D 分箱算法

发布于 2024-12-16 02:36:34 字数 430 浏览 2 评论 0原文

我在 Mathematica 中开发适当快速的分箱算法时遇到一些麻烦。我有一个很大的（~100k 元素）数据集，其形式为 T={{x1,y1,z1},{x2,y2,z2},....} 我想将其分入大约 100x100 个 bin 的 2D 数组中，bin 值由每个 bin 中的 Z 值之和给出。

目前，我正在迭代表的每个元素，使用 Select 根据 bin 边界列表挑选出它应该位于哪个 bin，并将 z 值添加到占用该 bin 的值列表中。最后，我将 Total 映射到 bin 列表上，对它们的内容求和（我这样做是因为我有时想做其他事情，比如最大化）。

我曾尝试使用 Gather 和其他此类函数来执行此操作，但上述方法速度快得离谱，尽管也许我使用 Gather 的效果不佳。无论如何，按照我的方法进行排序仍然需要几分钟，我觉得 Mathematica 可以做得更好。有人有一个方便的高效算法吗？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

波浪屿的海角声 2024-12-23 02:36:35

由于 Szabolcs 的可读性问题，我打算重写下面的代码。在那之前，请知道如果您的垃圾箱是规则的，您可以使用 Round、Floor 或 Ceiling （带有第二个参数） Nearest，下面的代码会快得多。在我的系统上，它的测试速度比同时发布的 GatherBy 解决方案更快。

假设我了解您的要求，我建议：

data = RandomReal[100, {75, 3}];

bins = {0, 20, 40, 60, 80, 100};

Reap[
  Sow[{#3, #2}, bins ~Nearest~ #] & @@@ data,
  bins,
  Reap[Sow[#, bins ~Nearest~ #2] & @@@ #2, bins, Tr@#2 &][[2]] &
][[2]] ~Flatten~ 1 ~Total~ {3} // MatrixForm

重构：

f[bins_] := Reap[Sow[{##2}, bins ~Nearest~ #]& @@@ #, bins, #2][[2]] &

bin2D[data_, X_, Y_] := f[X][data, f[Y][#2, #2~Total~2 &] &] ~Flatten~ 1 ~Total~ {3}

使用：

bin2D[data, xbins, ybins]

I intend to do a rewrite of the code below because of Szabolcs' readability concerns. Until then, know that if your bins are regular, and you can use Round, Floor, or Ceiling (with a second argument) in place of Nearest, the code below will be much faster. On my system, it tests faster than the GatherBy solution also posted.

Assuming I understand your requirements, I propose:

data = RandomReal[100, {75, 3}];

bins = {0, 20, 40, 60, 80, 100};

Reap[
  Sow[{#3, #2}, bins ~Nearest~ #] & @@@ data,
  bins,
  Reap[Sow[#, bins ~Nearest~ #2] & @@@ #2, bins, Tr@#2 &][[2]] &
][[2]] ~Flatten~ 1 ~Total~ {3} // MatrixForm

Refactored:

f[bins_] := Reap[Sow[{##2}, bins ~Nearest~ #]& @@@ #, bins, #2][[2]] &

bin2D[data_, X_, Y_] := f[X][data, f[Y][#2, #2~Total~2 &] &] ~Flatten~ 1 ~Total~ {3}

Use:

bin2D[data, xbins, ybins]

回复收藏 0 原文

江挽川 2024-12-23 02:36:35

这是我的方法：

data = RandomReal[5, {500000, 3}]; (* 500k values *)

zvalues = data[[All, 3]];

epsilon = 1*^-10; (* prevent 101 index *)

(* rescale and round (x,y) coordinates to index pairs in the 1..100 range *)    
indexes = 1 + Floor[(1 - epsilon) 100 Rescale[data[[All, {1, 2}]]]];

(* approach 1: create bin-matrix first, then fill up elements by adding  zvalues *)
res1 = Module[
    {result = ConstantArray[0, {100, 100}]},
    Do[
      AddTo[result[[##]], zvalues[[i]]] & @@ indexes[[i]], 
      {i, Length[indexes]}
    ];
    result
    ]; // Timing

(* approach 2: gather zvalues by indexes, add them up, convert them to a matrix *)
res2 = Module[{gb = GatherBy[Transpose[{indexes, zvalues}], First]},
    SparseArray[gb[[All, 1, 1]] -> (Total /@ gb[[All, All, 2]])]
    ]; // Timing

res1 == res2

这两种方法（res1 和 res2）可以在这台机器上分别每秒处理 100k 和 200k 元素。这是否足够快，或者您是否需要循环运行整个程序？

Here's my approach:

data = RandomReal[5, {500000, 3}]; (* 500k values *)

zvalues = data[[All, 3]];

epsilon = 1*^-10; (* prevent 101 index *)

(* rescale and round (x,y) coordinates to index pairs in the 1..100 range *)    
indexes = 1 + Floor[(1 - epsilon) 100 Rescale[data[[All, {1, 2}]]]];

(* approach 1: create bin-matrix first, then fill up elements by adding  zvalues *)
res1 = Module[
    {result = ConstantArray[0, {100, 100}]},
    Do[
      AddTo[result[[##]], zvalues[[i]]] & @@ indexes[[i]], 
      {i, Length[indexes]}
    ];
    result
    ]; // Timing

(* approach 2: gather zvalues by indexes, add them up, convert them to a matrix *)
res2 = Module[{gb = GatherBy[Transpose[{indexes, zvalues}], First]},
    SparseArray[gb[[All, 1, 1]] -> (Total /@ gb[[All, All, 2]])]
    ]; // Timing

res1 == res2

These two approaches (res1 & res2) can handle 100k and 200k elements per second, respectively, on this machine. Is this sufficiently fast, or do you need to run this whole program in a loop?

回复收藏 0 原文

小瓶盖 2024-12-23 02:36:35

这是我使用 Mathematica 中的内容中定义的函数 SelectEquivalents 的方法工具包？这非常适合解决这样的问题。

data = RandomReal[100, {75, 3}];
bins = Range[0, 100, 20];
binMiddles = (Most@bins + Rest@bins)/2;
nearest = Nearest[binMiddles];

SelectEquivalents[
   data
   ,
   TagElement -> ({First@nearest[#[[1]]], First@nearest[#[[2]]]} &)
   ,
   TransformElement -> (#[[3]] &)
   ,
   TransformResults -> (Total[#2] &)
   ,
   TagPattern -> Flatten[Outer[List, binMiddles, binMiddles], 1]
   , 
   FinalFunction -> (Partition[Flatten[# /. {} -> 0], Length[binMiddles]] &)
]

如果您想根据两个以上的维度进行分组，您可以在 FinalFunction 中使用此函数为列表结果提供所需的维度（我不记得在哪里找到它）。

InverseFlatten[l_,dimensions_]:= Fold[Partition[#, #2] &, l, Most[Reverse[dimensions]]];

Here's my approach using the function SelectEquivalents defined in What is in your Mathematica tool bag? which is perfect for a problem like this one.

data = RandomReal[100, {75, 3}];
bins = Range[0, 100, 20];
binMiddles = (Most@bins + Rest@bins)/2;
nearest = Nearest[binMiddles];

SelectEquivalents[
   data
   ,
   TagElement -> ({First@nearest[#[[1]]], First@nearest[#[[2]]]} &)
   ,
   TransformElement -> (#[[3]] &)
   ,
   TransformResults -> (Total[#2] &)
   ,
   TagPattern -> Flatten[Outer[List, binMiddles, binMiddles], 1]
   , 
   FinalFunction -> (Partition[Flatten[# /. {} -> 0], Length[binMiddles]] &)
]

If you would want to group according to more than two dimensions you could use in FinalFunction this function to give to the list result the desired dimension (I don't remember where I found it).

InverseFlatten[l_,dimensions_]:= Fold[Partition[#, #2] &, l, Most[Reverse[dimensions]]];

回复收藏 0 原文

偷得浮生 2024-12-23 02:36:34

这是一种基于 Szabolcs 帖子的方法，速度大约快一个数量级。

data = RandomReal[5, {500000, 3}];
(*500k values*)
zvalues = data[[All, 3]];

epsilon = 1*^-10;(*prevent 101 index*)
(*rescale and round (x,y) coordinates to index pairs in the 1..100 range*)
indexes = 1 + Floor[(1 - epsilon) 100 Rescale[data[[All, {1, 2}]]]];

res2 = Module[{gb = GatherBy[Transpose[{indexes, zvalues}], First]}, 
    SparseArray[
     gb[[All, 1, 1]] -> 
      Total[gb[[All, All, 2]], {2}]]]; // AbsoluteTiming

给出大约{2.012217，Null}

AbsoluteTiming[
 System`SetSystemOptions[ 
  "SparseArrayOptions" -> {"TreatRepeatedEntries" -> 1}];
 res3 = SparseArray[indexes -> zvalues];
 System`SetSystemOptions[ 
  "SparseArrayOptions" -> {"TreatRepeatedEntries" -> 0}];
 ]

给出大约{0.195228，Null}

res3 == res2
True

“TreatRepeatedEntries”-> 1 添加重复位置。

Here is a method based on Szabolcs's post that is about about an order of magnitude faster.

data = RandomReal[5, {500000, 3}];
(*500k values*)
zvalues = data[[All, 3]];

epsilon = 1*^-10;(*prevent 101 index*)
(*rescale and round (x,y) coordinates to index pairs in the 1..100 range*)
indexes = 1 + Floor[(1 - epsilon) 100 Rescale[data[[All, {1, 2}]]]];

res2 = Module[{gb = GatherBy[Transpose[{indexes, zvalues}], First]}, 
    SparseArray[
     gb[[All, 1, 1]] -> 
      Total[gb[[All, All, 2]], {2}]]]; // AbsoluteTiming

Gives about {2.012217, Null}

AbsoluteTiming[
 System`SetSystemOptions[ 
  "SparseArrayOptions" -> {"TreatRepeatedEntries" -> 1}];
 res3 = SparseArray[indexes -> zvalues];
 System`SetSystemOptions[ 
  "SparseArrayOptions" -> {"TreatRepeatedEntries" -> 0}];
 ]

Gives about {0.195228, Null}