有没有办法对 NumPy 数组中具有相同值的所有行应用函数？

发布于 2025-01-11 16:56:39 字数 742 浏览 4 评论 0原文

假设我们有一个矩阵 A，它具有以下值：

In [2]: A
Out[2]: 
array([[1, 1, 3],
       [1, 1, 5],
       [1, 1, 7],
       [1, 2, 3],
       [1, 2, 9],
       [2, 1, 5],
       [2, 2, 1],
       [2, 2, 8],
       [2, 2, 3]])

有没有办法对第三列的值逐行应用函数，例如 np.mean，其中第一个和第二个第二列相等，即得到矩阵 B：

In [4]: B
Out[4]: 
array([[1, 1, 5],
       [1, 2, 6],
       [2, 1, 5],
       [2, 2, 4]])

我的实际用例要复杂得多。我有一个大约 1M 行和 4 列的大矩阵。前三列对应于点云中点的 (x, y, z) 坐标，第四列是某个函数 f 的值，其中 f = f(x, y, z)。我必须对所有相等的 (y, z) 对沿 x 轴（矩阵中的第一列）执行积分。我最终必须得到一个具有一定行数的矩阵，该矩阵对应于唯一 (y, z) 对的数量和三列：y 轴、z 轴以及从积分获得的值。我有一些想法，但所有这些想法都包括多个 for 循环和潜在的内存问题。

有什么方法可以以矢量化方式执行此操作吗？

原文

Let's say we have a matrix, A, that has the following values:

In [2]: A
Out[2]: 
array([[1, 1, 3],
       [1, 1, 5],
       [1, 1, 7],
       [1, 2, 3],
       [1, 2, 9],
       [2, 1, 5],
       [2, 2, 1],
       [2, 2, 8],
       [2, 2, 3]])

is there a way to apply a function, e.g., np.mean, row-wise for values of the third column where first and the second column are equal, i.e, to get matrix B:

In [4]: B
Out[4]: 
array([[1, 1, 5],
       [1, 2, 6],
       [2, 1, 5],
       [2, 2, 4]])

My actual use case is much more complex. I have a large matrix with ~1M rows and 4 columns. The first three columns correspond to (x, y, z) coordinate of a point in a point-cloud and the forth column is a value of some function f where f = f(x, y, z). I have to perform integration along x-axis (the first column in a matrix) for all (y, z) pairs that are equal. I have to end up with a matrix with some number of rows which corresponds to the number of unique (y, z) pairs and three columns: y-axis, z-axis, and the value that is obtained from integration. I have a few ideas but all those ideas include multiple for-loops and potential memory issues.

Is there any way to perform this in a vectorized fashion?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

<逆流佳人身旁 2025-01-18 16:56:39

如果你有很多数据，你可以使用pandas：

import pandas as pd
df = pd.DataFrame(A, columns = ['id1','id2' ,'value'])
B = df.groupby(['id1','id2'])['value'].mean().reset_index().to_numpy()

输出：

>>
[[1. 1. 5.]
 [1. 2. 6.]
 [2. 1. 5.]
 [2. 2. 4.]]

我认为这是最快的方法

you can use pandas, if you have a lot of data :

import pandas as pd
df = pd.DataFrame(A, columns = ['id1','id2' ,'value'])
B = df.groupby(['id1','id2'])['value'].mean().reset_index().to_numpy()

output:

>>
[[1. 1. 5.]
 [1. 2. 6.]
 [2. 1. 5.]
 [2. 2. 4.]]

I assume this is being the fastest way

回复收藏 0 原文

宁愿没拥抱 2025-01-18 16:56:39

一个可能的解决方案：

import numpy as np

A = np.array([[1, 1, 3],
       [1, 1, 5],
       [1, 1, 7],
       [1, 2, 3],
       [1, 2, 9],
       [2, 1, 5],
       [2, 2, 1],
       [2, 2, 8],
       [2, 2, 3]]) 

uniquePairs = np.unique(A[:,:2], axis=0)
output = np.empty((uniquePairs.shape[0], A.shape[1]))
for iPair, pair in enumerate(uniquePairs):
    output[iPair,:2] = pair
    output[iPair,2] = np.mean( A[np.logical_and(A[:,0]==pair[0], A[:,1]==pair[1]),2] )
    
print(output)

输出是

[[1. 1. 5.]
 [1. 2. 6.]
 [2. 1. 5.]
 [2. 2. 4.]]

还有一个更紧凑的变体，但可读性可能较差：

uniquePairs = np.unique(A[:,:2], axis=0)
output = np.array([[*pair,  np.mean(A[np.logical_and(A[:,0]==pair[0], A[:,1]==pair[1]),2])] for iPair, pair in enumerate(uniquePairs)])

A possible solution:

import numpy as np

A = np.array([[1, 1, 3],
       [1, 1, 5],
       [1, 1, 7],
       [1, 2, 3],
       [1, 2, 9],
       [2, 1, 5],
       [2, 2, 1],
       [2, 2, 8],
       [2, 2, 3]]) 

uniquePairs = np.unique(A[:,:2], axis=0)
output = np.empty((uniquePairs.shape[0], A.shape[1]))
for iPair, pair in enumerate(uniquePairs):
    output[iPair,:2] = pair
    output[iPair,2] = np.mean( A[np.logical_and(A[:,0]==pair[0], A[:,1]==pair[1]),2] )
    
print(output)

The output is

[[1. 1. 5.]
 [1. 2. 6.]
 [2. 1. 5.]
 [2. 2. 4.]]

There is also a more compact variation, but perhaps with less readability:

uniquePairs = np.unique(A[:,:2], axis=0)
output = np.array([[*pair,  np.mean(A[np.logical_and(A[:,0]==pair[0], A[:,1]==pair[1]),2])] for iPair, pair in enumerate(uniquePairs)])

回复收藏 0 原文

风启觞 2025-01-18 16:56:39

import numpy as np

A = np.array(
    [
        [1, 1, 3],
        [1, 1, 5],
        [1, 1, 7],
        [1, 2, 3],
        [1, 2, 9],
        [2, 1, 5],
        [2, 2, 1],
        [2, 2, 8],
        [2, 2, 3],
    ]
)

result = np.mean(A[:, 2], where=A[:, 0] == A[:, 1])

这可能就是您正在寻找的。您可以使用 A[:, n] 访问列。

import numpy as np

A = np.array(
    [
        [1, 1, 3],
        [1, 1, 5],
        [1, 1, 7],
        [1, 2, 3],
        [1, 2, 9],
        [2, 1, 5],
        [2, 2, 1],
        [2, 2, 8],
        [2, 2, 3],
    ]
)

result = np.mean(A[:, 2], where=A[:, 0] == A[:, 1])

This might be what you're looking for. You can use A[:, n] to access a column.

回复收藏 0 原文

疯狂的代价 2025-01-18 16:56:39

numpy 没有内置的分组工具。由于各组的长度不同，因此它们需要单独的 mean 调用。因此需要一定程度的迭代。

defaultdict 是一种对值进行分组的便捷方法

In [64]: from collections import defaultdict
In [65]: dd = defaultdict(list)
In [66]: for row in A:
    ...:     dd[tuple(row[:2])].append(row[-1])
In [67]: dd
Out[67]: 
defaultdict(list,
            {(1, 1): [3, 5, 7],
             (1, 2): [3, 9],
             (2, 1): [5],
             (2, 2): [1, 8, 3]})
In [68]: {k: np.mean(v) for k, v in dd.items()}
Out[68]: {(1, 1): 5.0, (1, 2): 6.0, (2, 1): 5.0, (2, 2): 4.0}

我们可以使用以下方法创建一个均值数组：

In [72]: np.array([k + (np.mean(v),) for k, v in dd.items()])
Out[72]: 
array([[1., 1., 5.],
       [1., 2., 6.],
       [2., 1., 5.],
       [2., 2., 4.]])

一些比较时间 - 通常需要注意缩放到更大的数组。

In [99]: %%timeit
    ...: dd = defaultdict(list)
    ...: for row in A:
    ...:     dd[tuple(row[:2])].append(row[-1])
    ...: np.array([k + (np.mean(v),) for k, v in dd.items()])
132 µs ± 92.9 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [97]: %%timeit
    ...: df = pd.DataFrame(A, columns=["id1", "id2", "value"])
    ...: B = df.groupby(["id1", "id2"])["value"].mean().reset_index().to_numpy()
2.27 ms ± 123 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [102]: %%timeit
     ...: uniquePairs = np.unique(A[:, :2], axis=0)
     ...: output = np.ndarray((uniquePairs.shape[0], A.shape[1]))
     ...: for iPair, pair in enumerate(uniquePairs):
     ...:     output[iPair, :2] = pair
     ...:     output[iPair, 2] = np.mean(
     ...:         A[np.logical_and(A[:, 0] == pair[0], A[:, 1] == pair[1]), 2]
     ...:     )
279 µs ± 216 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

另一个需要更多工作才能完全发挥作用但在扩展时有希望的想法是：

In [106]: %%timeit
     ...: out = np.zeros((2, 2))
     ...: np.add.at(out, (A[:, 0] - 1, A[:, 1] - 1), A[:, -1])
     ...: cnt = np.zeros((2, 2))
     ...: np.add.at(cnt, (A[:, 0] - 1, A[:, 1] - 1), 1)
     ...: res = out / cnt
38.9 µs ± 62.5 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

numpy does not have builtin grouping tools. And since the groups differ in length, they will require separate mean calls. So some level of iteration will be required.

defaultdict is a handy way of grouping values

In [64]: from collections import defaultdict
In [65]: dd = defaultdict(list)
In [66]: for row in A:
    ...:     dd[tuple(row[:2])].append(row[-1])
In [67]: dd
Out[67]: 
defaultdict(list,
            {(1, 1): [3, 5, 7],
             (1, 2): [3, 9],
             (2, 1): [5],
             (2, 2): [1, 8, 3]})
In [68]: {k: np.mean(v) for k, v in dd.items()}
Out[68]: {(1, 1): 5.0, (1, 2): 6.0, (2, 1): 5.0, (2, 2): 4.0}

We can create an array of the means with:

In [72]: np.array([k + (np.mean(v),) for k, v in dd.items()])
Out[72]: 
array([[1., 1., 5.],
       [1., 2., 6.],
       [2., 1., 5.],
       [2., 2., 4.]])

Some comparative times - with the usual caveat about scaling to larger arrays.

In [99]: %%timeit
    ...: dd = defaultdict(list)
    ...: for row in A:
    ...:     dd[tuple(row[:2])].append(row[-1])
    ...: np.array([k + (np.mean(v),) for k, v in dd.items()])
132 µs ± 92.9 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [97]: %%timeit
    ...: df = pd.DataFrame(A, columns=["id1", "id2", "value"])
    ...: B = df.groupby(["id1", "id2"])["value"].mean().reset_index().to_numpy()
2.27 ms ± 123 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [102]: %%timeit
     ...: uniquePairs = np.unique(A[:, :2], axis=0)
     ...: output = np.ndarray((uniquePairs.shape[0], A.shape[1]))
     ...: for iPair, pair in enumerate(uniquePairs):
     ...:     output[iPair, :2] = pair
     ...:     output[iPair, 2] = np.mean(
     ...:         A[np.logical_and(A[:, 0] == pair[0], A[:, 1] == pair[1]), 2]
     ...:     )
279 µs ± 216 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

Another idea that needs more work to be fully functional, but has promise when scaling up is:

In [106]: %%timeit
     ...: out = np.zeros((2, 2))
     ...: np.add.at(out, (A[:, 0] - 1, A[:, 1] - 1), A[:, -1])
     ...: cnt = np.zeros((2, 2))
     ...: np.add.at(cnt, (A[:, 0] - 1, A[:, 1] - 1), 1)
     ...: res = out / cnt
38.9 µs ± 62.5 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

回复收藏 0 原文

~没有更多了~