Python:根据特定范围内的项目数量从列表创建分布
我用泊松标记了这个问题,因为我不确定它在这种情况下是否有帮助。
我需要从数据列表创建一个分布(最终可能格式化为图像)。
例如:
data = [1, 2, 2, 2, 2, 3, 3, 3, 4, 4, 5, 10, 10, 10, 22, 30, 30, 35, 46, 58, 59, 59]
数据可用于创建视觉分布。例如,在本例中,我可能会说范围为 10,并且每个范围中至少需要有 3 个项目才能成为有效点。
有了这个示例数据,我希望结果类似于,
ditribution = [1, 2, 4, 6]
因为我有 > 3 个项目,范围为 0-9、10-19、30-39 和 50-59。使用该结果,我可以生成一个图像,其中包含最终分布中存在的部分(较深的颜色)。我尝试创建的图像类型的示例如下所示,并且会使用更多的数据生成。暂时忽略蓝线。
我知道如何以强力方式迭代列表中的每个项目并进行计算。但是,我的数据集可能有几十万,甚至几百万。在现实世界的示例中,我的范围 (10) 和所需的项目数量 (3) 可能会大得多。
感谢您的帮助。
I tagged this question with poisson
as I am not sure if it will be helpful in this case.
I need to create a distribution (probably formatted as an image in the end) from a list of data.
For example:
data = [1, 2, 2, 2, 2, 3, 3, 3, 4, 4, 5, 10, 10, 10, 22, 30, 30, 35, 46, 58, 59, 59]
such that the data can be used to create a visual distribution. I might, for example in this case, say that the ranges are in 10 and there needs to be at least 3 items in each range to be a valid point.
With this example data, I would expect the result to be analogous to
ditribution = [1, 2, 4, 6]
since I have > 3 items in ranges 0-9, 10-19, 30-39 and 50-59. Using that result I could generate an image that has the sections segmented out (darker color) that exist in my final distribution. An example of the type of image I am trying to create can be seen below and would have been generated with far more data. Ignore the blue line for now.
I know how to do this the brute force way of iterating over every item in the list and doing my calculation like that. But, my data set may have hundreds of thousands, or even millions of numbers. My range (10) and my required number of items (3) will likely be much larger in a real world example.
Thanks for any help.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
如果
data
始终已排序,则紧凑的方法可能是:如果
data
未排序,或者如果您不知道,请使用sorted(data)< /code> 作为
itertools.groupby
的第一个参数,而不仅仅是data
。如果您更喜欢不太密集/紧凑的方法,您当然可以扩展它,例如:
在任何一种情况下,该机制都是
groupby
首先应用作为key=
作为第一个参数传递给可迭代对象中的每个项目,以获取每个项目的“key”;对于具有相同“键”的每个连续项目组,groupby
生成一个包含两个项目的元组:键的值,以及对所述组中所有项目的可迭代。这里,key是一项除以10(带截断)得到的;
len(list(g))
是具有该“key”的连续项目的数量。由于项目必须是连续的,因此您需要对数据进行排序(并且,仅对其进行排序比“按值除以 10 并截断”进行排序更简单;-)。If
data
is always sorted, a compact approach might be:If
data
isn't sorted, or if you don't know, usesorted(data)
as the first argument toitertools.groupby
, instead of justdata
.If you prefer a less dense/compact approach, you can of course expand this, e.g. to:
In either case, the mechanism is that
groupby
first applies the callable passed askey=
to each item in the iterable passed as its first argument, to obtain each item's"key"; for each consecutive group of items which have the same "key",groupby
yields a tuple with two items: the value of the key, and an iterable over all items in said group.Here, the key is obtained by dividing an item by 10 (with truncation);
len(list(g))
is the number of consecutive items with that "key". Since the items must be consecutive, you need the data to be sorted (and, it's simpler to just sort it, than sort it "by value divided by 10 with truncation";-).由于
data
可能非常长,您可能需要考虑使用 numpy。它为数值工作提供了许多有用的函数,与 Python list[*] 相比,在 numpy 数组中存储数据所需的内存更少,而且,由于许多 numpy 函数在底层调用 C 函数,你也许可以获得一些速度提升:[*] -- 注意:在上面的代码中,在定义
data
的过程中形成了一个Python列表。因此,这里的最大内存需求实际上比您刚刚使用 Python 列表时要大。但是,如果没有其他对 Python 列表的引用,则应该释放内存。或者,如果数据存储在磁盘上,则可以使用 numpy.loadtxt 直接将其读入 numpy 数组。Since
data
might be very lengthy, you may want to look into using numpy. It provides many useful functions for numerical work, it requires less memory to storedata
in a numpy array than a Python list[*], and, since many of the numpy functions call C functions under the hood, you may be able to obtain some speed gains:[*] -- Note: In the code above, a Python list was formed in the process of defining
data
. So the maximum memory requirement here is actually greater than if you had just used a Python list. The memory should get freed however, if there are no other references to the Python list. Alternatively, if the data is stored on disk,numpy.loadtxt
can be used to read it directly into a numpy array.这听起来像是某种形式的直方图的工作。为了实现这一点,预分类不是必需的。我讨论使用桶排序的变体对附近的元素进行分组 此处,但您需要调整此算法以满足您的目的。请注意,您不需要将数字本身存储在桶中以形成直方图
This sounds like a job for some form of histogram. Presorting should not be necessary in order to accomplish this. I discuss using a variant of bucket sort to group nearby elements here, though you'll need to adjust this algorithm to suit your purposes. Note that you do not need to store the numbers themselves in the buckets in order to form a histogram