将数组中的低值归零的最快方法?
所以,假设我有 100,000 个浮点数组,每个数组有 100 个元素。我需要最大的 X 个值,但前提是它们大于 Y。任何不匹配的元素都应设置为 0。在 Python 中执行此操作的最快方法是什么?秩序必须维持。大多数元素已设置为 0。
示例变量:
array = [.06, .25, 0, .15, .5, 0, 0, 0.04, 0, 0]
highCountX = 3
lowValY = .1
预期结果:
array = [0, .25, 0, .15, .5, 0, 0, 0, 0, 0]
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(9)
这是 NumPy 的典型工作,对于这些来说非常快 现在,如果您只需要 highCountX 个最大元素,您甚至可以“忘记”小元素(而不是将它们设置为 0 并对它们进行排序),而只对
大元素列表进行排序:
当然,对整个列表进行排序如果您只需要几个元素,数组可能不是最佳选择。根据您的需求,您可能需要考虑标准 heapq 模块。
This is a typical job for NumPy, which is very fast for these kinds of operations:
Now, if you only need the highCountX largest elements, you can even "forget" the small elements (instead of setting them to 0 and sorting them) and only sort the list of large elements:
Of course, sorting the whole array if you only need a few elements might not be optimal. Depending on your needs, you might want to consider the standard heapq module.
:)
:)
NumPy 中有一个特殊的 MaskedArray 类可以做到这一点。您可以根据任何前提条件“屏蔽”元素。这比分配零更好地代表您的需求:numpy 操作将在适当时忽略屏蔽值(例如,查找平均值)。
作为一个额外的好处,如果您需要的话,matplotlib 可视化库可以很好地支持屏蔽数组。
有关 numpy 中屏蔽数组的文档
There's a special MaskedArray class in NumPy that does exactly that. You can "mask" elements based on any precondition. This better represent your need than assigning zeroes: numpy operations will ignore masked values when appropriate (for example, finding mean value).
As an addded benefit, masked arrays are well supported in matplotlib visualisation library if you need this.
Docs on masked arrays in numpy
使用
numpy
:其中
partial_sort
可以是:表达式
a[a 可以在不使用
numpy
的情况下编写> 如下:Using
numpy
:Where
partial_sort
could be:The expression
a[a<value] = 0
can be written withoutnumpy
as follows:最简单的方法是:
分片选择所有大于lowValY的元素:
该数组仅包含大于阈值的元素数量。然后,对其进行排序,使最大值位于开头:
然后列表索引采用顶部
highCountX
元素的阈值:最后,使用另一个列表理解来填充原始数组:
存在边界有两个或多个相等元素(在您的示例中)是第三高元素的条件。生成的数组将多次包含该元素。
还有其他边界条件,例如 if len(array)
len(array) < highCountX
。处理此类情况的任务由实施者负责。The simplest way would be:
In pieces, this selects all the elements greater than
lowValY
:This array only contains the number of elements greater than the threshold. Then, sorting it so the largest values are at the start:
Then a list index takes the threshold for the top
highCountX
elements:Finally, the original array is filled out using another list comprehension:
There is a boundary condition where there are two or more equal elements that (in your example) are 3rd highest elements. The resulting array will contain that element more than once.
There are other boundary conditions as well, such as if
len(array) < highCountX
. Handling such conditions is left to the implementor.将低于某个阈值的元素设置为零很容易:(
如果需要的话,偶尔还可以使用abs()。)
但是,N 个最高数字的要求有点模糊。如果有 N+1 个相等的数字高于阈值怎么办?截断哪一个?
您可以先对数组进行排序,然后将阈值设置为第 N 个元素的值:
注意:此解决方案针对可读性而不是性能进行了优化。
Settings elements below some threshold to zero is easy:
(plus the occasional abs() if needed.)
The requirement of the N highest numbers is a bit vague, however. What if there are e.g. N+1 equal numbers above the threshold? Which one to truncate?
You could sort the array first, then set the threshold to the value of the Nth element:
Note: this solution is optimized for readability not performance.
你可以使用map和lambda,它应该足够快。
You can use map and lambda, it should be fast enough.
使用堆。
这在
O(n*lg(HighCountX))
时间内有效。deletemin 在堆
O(lg(k))
和插入O(lg(k))
或O(1)
中工作,具体取决于哪个堆您使用的类型。Use a heap.
This works in time
O(n*lg(HighCountX))
.deletemin works in heap
O(lg(k))
and insertionO(lg(k))
orO(1)
depending on which heap type you use.正如egon所说,使用堆是一个好主意。但是您可以使用 heapq.nlargest 函数来减少一些工作:
Using a heap is a good idea, as egon says. But you can use the
heapq.nlargest
function to cut down on some effort: