当前位置：文江博客话题详情

将数组中的低值归零的最快方法？

发布于 2024-08-08 20:38:01 字数 326 浏览 5 评论 0 原文

所以，假设我有 100,000 个浮点数组，每个数组有 100 个元素。我需要最大的 X 个值，但前提是它们大于 Y。任何不匹配的元素都应设置为 0。在 Python 中执行此操作的最快方法是什么？秩序必须维持。大多数元素已设置为 0。

示例变量：

array = [.06, .25, 0, .15, .5, 0, 0, 0.04, 0, 0]
highCountX = 3
lowValY = .1

预期结果：

array = [0, .25, 0, .15, .5, 0, 0, 0, 0, 0]

原文

So, lets say I have 100,000 float arrays with 100 elements each. I need the highest X number of values, BUT only if they are greater than Y. Any element not matching this should be set to 0. What would be the fastest way to do this in Python? Order must be maintained. Most of the elements are already set to 0.

sample variables:

array = [.06, .25, 0, .15, .5, 0, 0, 0.04, 0, 0]
highCountX = 3
lowValY = .1

expected result:

array = [0, .25, 0, .15, .5, 0, 0, 0, 0, 0]

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

风透绣罗衣 2024-08-15 20:38:01

这是 NumPy 的典型工作，对于这些来说非常快现在，如果您只需要 highCountX 个最大元素，您甚至可以“忘记”小元素（而不是将它们设置为 0 并对它们进行排序），而只对

array_np = numpy.asarray(array)
low_values_flags = array_np < lowValY  # Where values are low
array_np[low_values_flags] = 0  # All low values set to 0

大元素列表进行排序：

array_np = numpy.asarray(array)
print numpy.sort(array_np[array_np >= lowValY])[-highCountX:]

当然，对整个列表进行排序如果您只需要几个元素，数组可能不是最佳选择。根据您的需求，您可能需要考虑标准 heapq 模块。

This is a typical job for NumPy, which is very fast for these kinds of operations:

array_np = numpy.asarray(array)
low_values_flags = array_np < lowValY  # Where values are low
array_np[low_values_flags] = 0  # All low values set to 0

Now, if you only need the highCountX largest elements, you can even "forget" the small elements (instead of setting them to 0 and sorting them) and only sort the list of large elements:

array_np = numpy.asarray(array)
print numpy.sort(array_np[array_np >= lowValY])[-highCountX:]

Of course, sorting the whole array if you only need a few elements might not be optimal. Depending on your needs, you might want to consider the standard heapq module.

回复收藏 0 原文

英雄似剑 2024-08-15 20:38:01

from scipy.stats import threshold
thresholded = threshold(array, 0.5)

from scipy.stats import threshold
thresholded = threshold(array, 0.5)

回复收藏 0 原文

心舞飞扬 2024-08-15 20:38:01

NumPy 中有一个特殊的 MaskedArray 类可以做到这一点。您可以根据任何前提条件“屏蔽”元素。这比分配零更好地代表您的需求：numpy 操作将在适当时忽略屏蔽值（例如，查找平均值）。

>>> from numpy import ma
>>> x = ma.array([.06, .25, 0, .15, .5, 0, 0, 0.04, 0, 0])
>>> x1 = ma.masked_inside(0, 0.1) # mask everything in 0..0.1 range
>>> x1
masked_array(data = [-- 0.25 -- 0.15 0.5 -- -- -- -- --],
         mask = [ True False True False False True True True True True],
   fill_value = 1e+20)
>>> print x.filled(0) # Fill with zeroes
[ 0 0.25 0 0.15 0.5 0 0 0 0 0 ]

作为一个额外的好处，如果您需要的话，matplotlib 可视化库可以很好地支持屏蔽数组。

有关 numpy 中屏蔽数组的文档

There's a special MaskedArray class in NumPy that does exactly that. You can "mask" elements based on any precondition. This better represent your need than assigning zeroes: numpy operations will ignore masked values when appropriate (for example, finding mean value).

>>> from numpy import ma
>>> x = ma.array([.06, .25, 0, .15, .5, 0, 0, 0.04, 0, 0])
>>> x1 = ma.masked_inside(0, 0.1) # mask everything in 0..0.1 range
>>> x1
masked_array(data = [-- 0.25 -- 0.15 0.5 -- -- -- -- --],
         mask = [ True False True False False True True True True True],
   fill_value = 1e+20)
>>> print x.filled(0) # Fill with zeroes
[ 0 0.25 0 0.15 0.5 0 0 0 0 0 ]

As an addded benefit, masked arrays are well supported in matplotlib visualisation library if you need this.

Docs on masked arrays in numpy

回复收藏 0 原文

善良天后 2024-08-15 20:38:01

使用 numpy：

# assign zero to all elements less than or equal to `lowValY`
a[a<=lowValY] = 0 
# find n-th largest element in the array (where n=highCountX)
x = partial_sort(a, highCountX, reverse=True)[:highCountX][-1]
# 
a[a<x] = 0 #NOTE: it might leave more than highCountX non-zero elements
           # . if there are duplicates

其中 partial_sort 可以是：

def partial_sort(a, n, reverse=False):
    #NOTE: in general it should return full list but in your case this will do
    return sorted(a, reverse=reverse)[:n]

表达式 a[a 可以在不使用 numpy 的情况下编写> 如下：

for i, x in enumerate(a):
    if x < value:
       a[i] = 0

Using numpy:

# assign zero to all elements less than or equal to `lowValY`
a[a<=lowValY] = 0 
# find n-th largest element in the array (where n=highCountX)
x = partial_sort(a, highCountX, reverse=True)[:highCountX][-1]
# 
a[a<x] = 0 #NOTE: it might leave more than highCountX non-zero elements
           # . if there are duplicates

Where partial_sort could be:

def partial_sort(a, n, reverse=False):
    #NOTE: in general it should return full list but in your case this will do
    return sorted(a, reverse=reverse)[:n]

The expression a[a<value] = 0 can be written without numpy as follows:

for i, x in enumerate(a):
    if x < value:
       a[i] = 0

回复收藏 0 原文

他不在意 2024-08-15 20:38:01

最简单的方法是：

topX = sorted([x for x in array if x > lowValY], reverse=True)[highCountX-1]
print [x if x >= topX else 0 for x in array]

分片选择所有大于lowValY的元素：

[x for x in array if x > lowValY]

该数组仅包含大于阈值的元素数量。然后，对其进行排序，使最大值位于开头：

sorted(..., reverse=True)

然后列表索引采用顶部 highCountX 元素的阈值：

sorted(...)[highCountX-1]

最后，使用另一个列表理解来填充原始数组：

[x if x >= topX else 0 for x in array]

存在边界有两个或多个相等元素（在您的示例中）是第三高元素的条件。生成的数组将多次包含该元素。

还有其他边界条件，例如 if len(array) len(array) < highCountX。处理此类情况的任务由实施者负责。

The simplest way would be:

topX = sorted([x for x in array if x > lowValY], reverse=True)[highCountX-1]
print [x if x >= topX else 0 for x in array]

In pieces, this selects all the elements greater than lowValY:

[x for x in array if x > lowValY]

This array only contains the number of elements greater than the threshold. Then, sorting it so the largest values are at the start:

sorted(..., reverse=True)

Then a list index takes the threshold for the top highCountX elements:

sorted(...)[highCountX-1]

Finally, the original array is filled out using another list comprehension:

[x if x >= topX else 0 for x in array]

There is a boundary condition where there are two or more equal elements that (in your example) are 3rd highest elements. The resulting array will contain that element more than once.

There are other boundary conditions as well, such as if len(array) < highCountX. Handling such conditions is left to the implementor.

回复收藏 0 原文

柒七 2024-08-15 20:38:01

将低于某个阈值的元素设置为零很容易：（

array = [ x if x > threshold else 0.0 for x in array ]

如果需要的话，偶尔还可以使用abs()。）

但是，N 个最高数字的要求有点模糊。如果有 N+1 个相等的数字高于阈值怎么办？截断哪一个？

您可以先对数组进行排序，然后将阈值设置为第 N 个元素的值：

threshold = sorted(array, reverse=True)[N]
array = [ x if x >= threshold else 0.0 for x in array ]

注意：此解决方案针对可读性而不是性能进行了优化。

Settings elements below some threshold to zero is easy:

array = [ x if x > threshold else 0.0 for x in array ]

(plus the occasional abs() if needed.)

The requirement of the N highest numbers is a bit vague, however. What if there are e.g. N+1 equal numbers above the threshold? Which one to truncate?

You could sort the array first, then set the threshold to the value of the Nth element:

threshold = sorted(array, reverse=True)[N]
array = [ x if x >= threshold else 0.0 for x in array ]

Note: this solution is optimized for readability not performance.

回复收藏 0 原文

掩耳倾听 2024-08-15 20:38:01

你可以使用map和lambda，它应该足够快。

new_array = map(lambda x: x if x>y else 0, array)

You can use map and lambda, it should be fast enough.

new_array = map(lambda x: x if x>y else 0, array)

回复收藏 0 原文

橘虞初梦 2024-08-15 20:38:01

使用堆。

这在O(n*lg(HighCountX))时间内有效。

import heapq

heap = []
array =  [.06, .25, 0, .15, .5, 0, 0, 0.04, 0, 0]
highCountX = 3
lowValY = .1

for i in range(1,highCountX):
    heappush(heap, lowValY)
    heappop(heap)

for i in range( 0, len(array) - 1)
    if array[i] > heap[0]:
        heappush(heap, array[i])

min = heap[0]

array = [x if x >= min else 0 for x in array]

deletemin 在堆 O(lg(k)) 和插入 O(lg(k)) 或 O(1) 中工作，具体取决于哪个堆您使用的类型。

Use a heap.

This works in time O(n*lg(HighCountX)).

import heapq

heap = []
array =  [.06, .25, 0, .15, .5, 0, 0, 0.04, 0, 0]
highCountX = 3
lowValY = .1

for i in range(1,highCountX):
    heappush(heap, lowValY)
    heappop(heap)

for i in range( 0, len(array) - 1)
    if array[i] > heap[0]:
        heappush(heap, array[i])

min = heap[0]

array = [x if x >= min else 0 for x in array]

deletemin works in heap O(lg(k)) and insertion O(lg(k)) or O(1) depending on which heap type you use.

回复收藏 0 原文

忆梦 2024-08-15 20:38:01

正如egon所说，使用堆是一个好主意。但是您可以使用 heapq.nlargest 函数来减少一些工作：

import heapq 

array =  [.06, .25, 0, .15, .5, 0, 0, 0.04, 0, 0]
highCountX = 3
lowValY = .1

threshold = max(heapq.nlargest(highCountX, array)[-1], lowValY)
array = [x if x >= threshold else 0 for x in array]

Using a heap is a good idea, as egon says. But you can use the heapq.nlargest function to cut down on some effort:

import heapq 

array =  [.06, .25, 0, .15, .5, 0, 0, 0.04, 0, 0]
highCountX = 3
lowValY = .1

threshold = max(heapq.nlargest(highCountX, array)[-1], lowValY)
array = [x if x >= threshold else 0 for x in array]

回复收藏 0 原文

~没有更多了~