numpy 矩阵行/列上的函数应用

发布于 2024-12-14 13:20:20 字数 251 浏览 2 评论 0原文

我正在使用 Numpy 将数据存储到矩阵中。由于具有 R 背景,有一种极其简单的方法可以将函数应用于矩阵的行/列或两者。

python/numpy 组合有类似的东西吗?编写我自己的小实现不是问题,但在我看来,我提出的大多数版本都会比任何现有的实现效率低得多/内存密集程度更高。

我想避免从 numpy 矩阵复制到局部变量等,这可能吗?

我试图实现的功能主要是简单的比较(例如,某一列中有多少元素小于数字x或其中有多少元素的绝对值大于y)。

I am using Numpy to store data into matrices. Coming from R background, there has been an extremely simple way to apply a function over row/columns or both of a matrix.

Is there something similar for python/numpy combination? It's not a problem to write my own little implementation but it seems to me that most of the versions I come up with will be significantly less efficient/more memory intensive than any of the existing implementation.

I would like to avoid copying from the numpy matrix to a local variable etc., is that possible?

The functions I am trying to implement are mainly simple comparisons (e.g. how many elements of a certain column are smaller than number x or how many of them have absolute value larger than y).

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

花间憩 2024-12-21 13:20:20

几乎所有 numpy 函数都在整个数组上运行,和/或可以被告知在特定轴(行或列)上运行。

只要您可以根据作用于 numpy 数组或数组切片的 numpy 函数来定义函数,您的函数就会自动对整个数组、行或列进行操作。

询问如何实现特定功能以获得更具体的建议可能会更有帮助。


Numpy 提供了 np.vectorizenp.frompyfunc 将对数字进行操作的 Python 函数转换为对 numpy 数组进行操作的函数。

例如,

def myfunc(a,b):
    if (a>b): return a
    else: return b
vecfunc = np.vectorize(myfunc)
result=vecfunc([[1,2,3],[5,6,9]],[7,4,5])
print(result)
# [[7 4 5]
#  [7 6 9]]

(当第二个数组更大时,第一个数组的元素将被第二个数组的相应元素替换。)

但是不要太兴奋; np.vectorizenp.frompyfunc


要计算列 x 中有多少元素小于数字 y,您可以使用如下表达式: 例如

(array['x']<y).sum()

import numpy as np
array=np.arange(6).view([('x',np.int),('y',np.int)])
print(array)
# [(0, 1) (2, 3) (4, 5)]

print(array['x'])
# [0 2 4]

print(array['x']<3)
# [ True  True False]

print((array['x']<3).sum())
# 2

Almost all numpy functions operate on whole arrays, and/or can be told to operate on a particular axis (row or column).

As long as you can define your function in terms of numpy functions acting on numpy arrays or array slices, your function will automatically operate on whole arrays, rows or columns.

It may be more helpful to ask about how to implement a particular function to get more concrete advice.


Numpy provides np.vectorize and np.frompyfunc to turn Python functions which operate on numbers into functions that operate on numpy arrays.

For example,

def myfunc(a,b):
    if (a>b): return a
    else: return b
vecfunc = np.vectorize(myfunc)
result=vecfunc([[1,2,3],[5,6,9]],[7,4,5])
print(result)
# [[7 4 5]
#  [7 6 9]]

(The elements of the first array get replaced by the corresponding element of the second array when the second is bigger.)

But don't get too excited; np.vectorize and np.frompyfunc are just syntactic sugar. They don't actually make your code any faster. If your underlying Python function is operating on one value at a time, then np.vectorize will feed it one item at a time, and the whole
operation is going to be pretty slow (compared to using a numpy function which calls some underlying C or Fortran implementation).


To count how many elements of column x are smaller than a number y, you could use an expression such as:

(array['x']<y).sum()

For example:

import numpy as np
array=np.arange(6).view([('x',np.int),('y',np.int)])
print(array)
# [(0, 1) (2, 3) (4, 5)]

print(array['x'])
# [0 2 4]

print(array['x']<3)
# [ True  True False]

print((array['x']<3).sum())
# 2
等待我真够勒 2024-12-21 13:20:20

使用 NumPy 精美的密集语法,根据一个或多个条件从 NumPy 数组中选择元素非常简单:

>>> import numpy as NP
>>> # generate a matrix to demo the code
>>> A = NP.random.randint(0, 10, 40).reshape(8, 5)
>>> A
  array([[6, 7, 6, 4, 8],
         [7, 3, 7, 9, 9],
         [4, 2, 5, 9, 8],
         [3, 8, 2, 6, 3],
         [2, 1, 8, 0, 0],
         [8, 3, 9, 4, 8],
         [3, 3, 9, 8, 4],
         [5, 4, 8, 3, 0]])

第 2 列中有多少个元素大于 6?

>>> ndx = A[:,1] > 6
>>> ndx
      array([False,  True, False, False,  True,  True,  True,  True], dtype=bool)
>>> NP.sum(ndx)
      5

A 最后一列有多少个元素的绝对值大于 3?

>>> A = NP.random.randint(-4, 4, 40).reshape(8, 5)
>>> A
  array([[-4, -1,  2,  0,  3],
         [-4, -1, -1, -1,  1],
         [-1, -2,  2, -2,  3],
         [ 1, -4, -1,  0,  0],
         [-4,  3, -3,  3, -1],
         [ 3,  0, -4, -1, -3],
         [ 3, -4,  0, -3, -2],
         [ 3, -4, -4, -4,  1]])

>>> ndx = NP.abs(A[:,-1]) > 3
>>> NP.sum(ndx)
      0

A 的前两行中有多少个元素大于或等于 2?

>>> ndx = A[:2,:] >= 2
>>> NP.sum(ndx.ravel())    # 'ravel' just flattens ndx, which is originally 2D (2x5)
      2

NumPy 的索引语法与 R 的索引语法非常接近;考虑到您对 R 的熟练程度,以下是 R 和 NumPy 在这种情况下的主要区别:

NumPy 索引从零开始,在 R 中,索引从 1

NumPy 开始(例如Python)允许您使用负索引从右到左进行索引 - 例如,

# to get the last column in A
A[:, -1], 

# to get the penultimate column in A
A[:, -2] 

# this is a big deal, because in R, the equivalent expresson is:
A[, dim(A)[0]-2]

NumPy 使用冒号“:”表示法来表示“unsliced”,例如,在 R 中,
获取 A 中的前三行,您可以使用 A[1:3, ]。在 NumPy 中,你
会使用 A[0:2, :] (在 NumPy 中,“0”不是必需的,事实上它
最好使用 A[:2, :]

Selecting elements from a NumPy array based on one or more conditions is straightforward using NumPy's beautifully dense syntax:

>>> import numpy as NP
>>> # generate a matrix to demo the code
>>> A = NP.random.randint(0, 10, 40).reshape(8, 5)
>>> A
  array([[6, 7, 6, 4, 8],
         [7, 3, 7, 9, 9],
         [4, 2, 5, 9, 8],
         [3, 8, 2, 6, 3],
         [2, 1, 8, 0, 0],
         [8, 3, 9, 4, 8],
         [3, 3, 9, 8, 4],
         [5, 4, 8, 3, 0]])

how many elements in column 2 are greater than 6?

>>> ndx = A[:,1] > 6
>>> ndx
      array([False,  True, False, False,  True,  True,  True,  True], dtype=bool)
>>> NP.sum(ndx)
      5

how many elements in last column of A have absolute value larger than 3?

>>> A = NP.random.randint(-4, 4, 40).reshape(8, 5)
>>> A
  array([[-4, -1,  2,  0,  3],
         [-4, -1, -1, -1,  1],
         [-1, -2,  2, -2,  3],
         [ 1, -4, -1,  0,  0],
         [-4,  3, -3,  3, -1],
         [ 3,  0, -4, -1, -3],
         [ 3, -4,  0, -3, -2],
         [ 3, -4, -4, -4,  1]])

>>> ndx = NP.abs(A[:,-1]) > 3
>>> NP.sum(ndx)
      0

how many elements in the first two rows of A are greater than or equal to 2?

>>> ndx = A[:2,:] >= 2
>>> NP.sum(ndx.ravel())    # 'ravel' just flattens ndx, which is originally 2D (2x5)
      2

NumPy's indexing syntax is pretty close to R's; given your fluency in R, here are the key differences between R and NumPy in this context:

NumPy indices are zero-based, in R, indexing begins with 1

NumPy (like Python) allows you to index from right to left using negative indices--e.g.,

# to get the last column in A
A[:, -1], 

# to get the penultimate column in A
A[:, -2] 

# this is a big deal, because in R, the equivalent expresson is:
A[, dim(A)[0]-2]

NumPy uses colon ":" notation to denote "unsliced", e.g., in R, to
get the first three rows in A, you would use, A[1:3, ]. In NumPy, you
would use A[0:2, :] (in NumPy, the "0" is not necessary, in fact it
is preferable to use A[:2, :]

以为你会在 2024-12-21 13:20:20

我也来自 R 背景,并且遇到了缺乏更通用的应用程序的情况,该应用程序可以使用简短的自定义函数。我看到论坛建议使用基本的 numpy 函数,因为其中许多函数都处理数组。然而,我一直对“本机”numpy 函数处理数组的方式感到困惑(有时 0 是行方向,1 是列方向,有时相反)。

我个人使用 apply_along_axis 实现更灵活函数的解决方案是将它们与 python 中可用的隐式 lambda 函数结合起来。对于使用更具函数式编程风格的 R 思维者来说,Lambda 函数应该很容易理解,例如 R 函数中的 apply、sapply、lapply 等。

例如,我想在矩阵中应用变量的标准化。通常在 R 中有一个用于此(缩放)的函数,但您也可以使用 apply 轻松构建它:

(R 代码)

apply(Mat,2,function(x) (x-mean(x))/sd(x) ) 

您会看到 apply (x-mean(x))/sd(x) 内的函数体是如何我们不能直接输入 python apply_along_axis。使用 lambda,很容易实现一组值,因此:

(Python)

import numpy as np
vec=np.random.randint(1,10,10)  # some random data vector of integers

(lambda x: (x-np.mean(x))/np.std(x)  )(vec)

然后,我们需要的就是将其插入 python apply 中,并通过 apply_along_axis 传递感兴趣的数组

Mat=np.random.randint(1,10,3*4).reshape((3,4))  # some random data vector
np.apply_along_axis(lambda x: (x-np.mean(x))/np.std(x),0,Mat )

显然,lambda 函数可以作为单独的函数来实现函数,但我想重点是使用 apply 起源的行中包含的相当小的函数。

我希望你觉得它有用!

I also come from a more R background, and bumped into the lack of a more versatile apply which could take short customized functions. I've seen the forums suggesting using basic numpy functions because many of them handle arrays. However, I've been getting confused over the way "native" numpy functions handle array (sometimes 0 is row-wise and 1 column-wise, sometimes the opposite).

My personal solution to more flexible functions with apply_along_axis was to combine them with the implicit lambda functions available in python. Lambda functions should very easy to understand for the R minded who uses a more functional programming style, like in R functions apply, sapply, lapply, etc.

So for example I wanted to apply standardisation of variables in a matrix. Tipically in R there's a function for this (scale) but you can also build it easily with apply:

(R code)

apply(Mat,2,function(x) (x-mean(x))/sd(x) ) 

You see how the body of the function inside apply (x-mean(x))/sd(x) is the bit we can't type directly for the python apply_along_axis. With lambda this is easy to implement FOR ONE SET OF VALUES, so:

(Python)

import numpy as np
vec=np.random.randint(1,10,10)  # some random data vector of integers

(lambda x: (x-np.mean(x))/np.std(x)  )(vec)

Then, all we need is to plug this inside the python apply and pass the array of interest through apply_along_axis

Mat=np.random.randint(1,10,3*4).reshape((3,4))  # some random data vector
np.apply_along_axis(lambda x: (x-np.mean(x))/np.std(x),0,Mat )

Obviously, the lambda function could be implemented as a separate function, but I guess the whole point is to use rather small functions contained within the line where apply originated.

I hope you find it useful !

柠檬心 2024-12-21 13:20:20

Pandas 对此非常有用。例如, DataFrame.apply()groupby 的 apply() 应该对你有帮助。

Pandas is very useful for this. For instance, DataFrame.apply() and groupby's apply() should help you.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文