Python：根据特定范围内的项目数量从列表创建分布

发布于 2024-09-15 19:38:11 字数 672 浏览 13 评论 0原文

我用泊松标记了这个问题，因为我不确定它在这种情况下是否有帮助。

我需要从数据列表创建一个分布（最终可能格式化为图像）。

例如：

data = [1, 2, 2, 2, 2, 3, 3, 3, 4, 4, 5, 10, 10, 10, 22, 30, 30, 35, 46, 58, 59, 59]

数据可用于创建视觉分布。例如，在本例中，我可能会说范围为 10，并且每个范围中至少需要有 3 个项目才能成为有效点。

有了这个示例数据，我希望结果类似于，

ditribution = [1, 2, 4, 6]

因为我有 > 3 个项目，范围为 0-9、10-19、30-39 和 50-59。使用该结果，我可以生成一个图像，其中包含最终分布中存在的部分（较深的颜色）。我尝试创建的图像类型的示例如下所示，并且会使用更多的数据生成。暂时忽略蓝线。

我知道如何以强力方式迭代列表中的每个项目并进行计算。但是，我的数据集可能有几十万，甚至几百万。在现实世界的示例中，我的范围 (10) 和所需的项目数量 (3) 可能会大得多。

distribution image

感谢您的帮助。

原文

I tagged this question with poisson as I am not sure if it will be helpful in this case.

I need to create a distribution (probably formatted as an image in the end) from a list of data.

For example:

data = [1, 2, 2, 2, 2, 3, 3, 3, 4, 4, 5, 10, 10, 10, 22, 30, 30, 35, 46, 58, 59, 59]

such that the data can be used to create a visual distribution. I might, for example in this case, say that the ranges are in 10 and there needs to be at least 3 items in each range to be a valid point.

With this example data, I would expect the result to be analogous to

ditribution = [1, 2, 4, 6]

since I have > 3 items in ranges 0-9, 10-19, 30-39 and 50-59. Using that result I could generate an image that has the sections segmented out (darker color) that exist in my final distribution. An example of the type of image I am trying to create can be seen below and would have been generated with far more data. Ignore the blue line for now.

I know how to do this the brute force way of iterating over every item in the list and doing my calculation like that. But, my data set may have hundreds of thousands, or even millions of numbers. My range (10) and my required number of items (3) will likely be much larger in a real world example.

distribution image

Thanks for any help.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

棒棒糖 2024-09-22 19:38:11

如果 data 始终已排序，则紧凑的方法可能是：

import itertools as it

d = [k+1 for k, L in
         ((k, len(list(g))) for k, g in it.groupby(data,key=lambda x:x//10))
     if L>=3]

如果 data 未排序，或者如果您不知道，请使用 sorted(data)< /code> 作为 itertools.groupby 的第一个参数，而不仅仅是 data。

如果您更喜欢不太密集/紧凑的方法，您当然可以扩展它，例如：

def divby10(x): return x//10

distribution = []
for k, g in it.groupby(data, key=divby10):
    L = len(list(g))
    if L < 3: continue
    distribution.append(k+1)

在任何一种情况下，该机制都是 groupby 首先应用作为 key= 作为第一个参数传递给可迭代对象中的每个项目，以获取每个项目的“key”；对于具有相同“键”的每个连续项目组，groupby 生成一个包含两个项目的元组：键的值，以及对所述组中所有项目的可迭代。

这里，key是一项除以10（带截断）得到的； len(list(g)) 是具有该“key”的连续项目的数量。由于项目必须是连续的，因此您需要对数据进行排序（并且，仅对其进行排序比“按值除以 10 并截断”进行排序更简单；-）。

If data is always sorted, a compact approach might be:

import itertools as it

d = [k+1 for k, L in
         ((k, len(list(g))) for k, g in it.groupby(data,key=lambda x:x//10))
     if L>=3]

If data isn't sorted, or if you don't know, use sorted(data) as the first argument to itertools.groupby, instead of just data.

If you prefer a less dense/compact approach, you can of course expand this, e.g. to:

def divby10(x): return x//10

distribution = []
for k, g in it.groupby(data, key=divby10):
    L = len(list(g))
    if L < 3: continue
    distribution.append(k+1)

In either case, the mechanism is that groupby first applies the callable passed as key= to each item in the iterable passed as its first argument, to obtain each item's"key"; for each consecutive group of items which have the same "key", groupby yields a tuple with two items: the value of the key, and an iterable over all items in said group.

Here, the key is obtained by dividing an item by 10 (with truncation); len(list(g)) is the number of consecutive items with that "key". Since the items must be consecutive, you need the data to be sorted (and, it's simpler to just sort it, than sort it "by value divided by 10 with truncation";-).

回复收藏 0 原文

无远思近则忧 2024-09-22 19:38:11

由于 data 可能非常长，您可能需要考虑使用 numpy。它为数值工作提供了许多有用的函数，与 Python list[*] 相比，在 numpy 数组中存储数据所需的内存更少，而且，由于许多 numpy 函数在底层调用 C 函数，你也许可以获得一些速度提升：

import numpy as np

data = np.array([1, 2, 2, 2, 2, 3, 3, 3, 4, 4, 5, 10, 10, 10, 22, 30, 30, 35, 46, 58, 59, 59])

hist,bins=np.histogram(data,bins=np.linspace(0,60,7))
print(hist)
# [11  3  1  3  1  3]

distribution=np.where(hist>=3)[0]+1
print(distribution)
# [1 2 4 6]

[*] -- 注意：在上面的代码中，在定义data的过程中形成了一个Python列表。因此，这里的最大内存需求实际上比您刚刚使用 Python 列表时要大。但是，如果没有其他对 Python 列表的引用，则应该释放内存。或者，如果数据存储在磁盘上，则可以使用 numpy.loadtxt 直接将其读入 numpy 数组。

Since data might be very lengthy, you may want to look into using numpy. It provides many useful functions for numerical work, it requires less memory to store data in a numpy array than a Python list[*], and, since many of the numpy functions call C functions under the hood, you may be able to obtain some speed gains:

import numpy as np

data = np.array([1, 2, 2, 2, 2, 3, 3, 3, 4, 4, 5, 10, 10, 10, 22, 30, 30, 35, 46, 58, 59, 59])

hist,bins=np.histogram(data,bins=np.linspace(0,60,7))
print(hist)
# [11  3  1  3  1  3]

distribution=np.where(hist>=3)[0]+1
print(distribution)
# [1 2 4 6]

[*] -- Note: In the code above, a Python list was formed in the process of defining data. So the maximum memory requirement here is actually greater than if you had just used a Python list. The memory should get freed however, if there are no other references to the Python list. Alternatively, if the data is stored on disk, numpy.loadtxt can be used to read it directly into a numpy array.

回复收藏 0 原文