Mathematica 快速 2D 分箱算法
我在 Mathematica 中开发适当快速的分箱算法时遇到一些麻烦。我有一个很大的(~100k 元素)数据集,其形式为 T={{x1,y1,z1},{x2,y2,z2},....} 我想将其分入大约 100x100 个 bin 的 2D 数组中,bin 值由每个 bin 中的 Z 值之和给出。
目前,我正在迭代表的每个元素,使用 Select 根据 bin 边界列表挑选出它应该位于哪个 bin,并将 z 值添加到占用该 bin 的值列表中。最后,我将 Total 映射到 bin 列表上,对它们的内容求和(我这样做是因为我有时想做其他事情,比如最大化)。
我曾尝试使用 Gather 和其他此类函数来执行此操作,但上述方法速度快得离谱,尽管也许我使用 Gather 的效果不佳。无论如何,按照我的方法进行排序仍然需要几分钟,我觉得 Mathematica 可以做得更好。有人有一个方便的高效算法吗?
I am having some trouble developing a suitably fast binning algorithm in Mathematica. I have a large (~100k elements) data set of the form
T={{x1,y1,z1},{x2,y2,z2},....}
and I want to bin it into a 2D array of around 100x100 bins, with the bin value being given by the sum of the Z values that fall into each bin.
Currently I am iterating through each element of the table, using Select to pick out which bin it is supposed to be in based on lists of bin boundaries, and adding the z value to a list of values occupying that bin. At the end I map Total onto the list of bins, summing their contents (I do this because I sometimes want to do other things, like maximize).
I have tried using Gather and other such functions to do this but the above method was ridiculously faster, though perhaps I am using Gather poorly. Anyway It still takes a few minutes to do the sorting by my method and I feel like Mathematica can do better. Does anyone have a nice efficient algorithm handy?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
由于 Szabolcs 的可读性问题,我打算重写下面的代码。在那之前,请知道如果您的垃圾箱是规则的,您可以使用
Round
、Floor
或Ceiling
(带有第二个参数)Nearest
,下面的代码会快得多。在我的系统上,它的测试速度比同时发布的 GatherBy 解决方案更快。假设我了解您的要求,我建议:
重构:
使用:
I intend to do a rewrite of the code below because of Szabolcs' readability concerns. Until then, know that if your bins are regular, and you can use
Round
,Floor
, orCeiling
(with a second argument) in place ofNearest
, the code below will be much faster. On my system, it tests faster than theGatherBy
solution also posted.Assuming I understand your requirements, I propose:
Refactored:
Use:
这是我的方法:
这两种方法(
res1
和res2
)可以在这台机器上分别每秒处理 100k 和 200k 元素。这是否足够快,或者您是否需要循环运行整个程序?Here's my approach:
These two approaches (
res1
&res2
) can handle 100k and 200k elements per second, respectively, on this machine. Is this sufficiently fast, or do you need to run this whole program in a loop?这是我使用 Mathematica 中的内容中定义的函数 SelectEquivalents 的方法工具包? 这非常适合解决这样的问题。
如果您想根据两个以上的维度进行分组,您可以在 FinalFunction 中使用此函数为列表结果提供所需的维度(我不记得在哪里找到它)。
Here's my approach using the function SelectEquivalents defined in What is in your Mathematica tool bag? which is perfect for a problem like this one.
If you would want to group according to more than two dimensions you could use in FinalFunction this function to give to the list result the desired dimension (I don't remember where I found it).
这是一种基于 Szabolcs 帖子的方法,速度大约快一个数量级。
给出大约{2.012217,Null}
给出大约{0.195228,Null}
“TreatRepeatedEntries”-> 1 添加重复位置。
Here is a method based on Szabolcs's post that is about about an order of magnitude faster.
Gives about {2.012217, Null}
Gives about {0.195228, Null}
"TreatRepeatedEntries" -> 1 adds duplicate positions up.