
发布于 2025-02-10 04:25:14 字数 2272 浏览 1 评论 0原文


a = np.array([[1,2,3,4,5,6,7],[0,2,3,4,5,np.nan,np.nan]])
b = np.array([[1,2,3],[2,3,np.nan]])
c = np.array([[4,np.nan],[0,3]])


U = union(b,c)
U -> [[1,2,3,4,np.nan],[0,2,3,np.nan,np.nan]] # the result I want
# U[0] is equal to union(b[0],c[0]) 
# U[1] is equal to union(b[1],c[1])
D = Diff(a,U) 
D -> [[5,6,7],[4,5,np.nan]] # the result I want
# D[0] is equal to Diff(a[0],U[0])
# D[1] is equal to Diff(a[1],U[1])


下面的代码适用于此示例,但它太慢了,我想知道它是否可以改进,而且它也不保留矩形形状,这对于矢量化操作是有问题的(也许还有其他我看不到的问题) :

C = np.concatenate((b,c),axis=1)
U = np.unique(C,axis=1)
D = np.array([np.setdiff1d(a[i],U[i]) for i in range(len(a))])

我试图使用np.view()使np.setDiff1d and np.union1d读取每个行,好像是一个使人的2D形状的态度:

nrows, ncols = a.shape
dtype={'names':['f{}'.format(i) for i in range(ncols)], 'formats':ncols * [a.dtype]}
A = a.copy().view(dtype)

nrows, ncols = b.shape
dtype={'names':['f{}'.format(i) for i in range(ncols)], 'formats':ncols * [b.dtype]}
B = b.copy().view(dtype)

# then i try np.union1d() with theses newly created arrays and I get an error
>>> np.union1d(A,B)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<__array_function__ internals>", line 180, in union1d
  File "/home/abbesses/grand/miniconda3/envs/Decomp/lib/python3.10/site-packages/numpy/lib/arraysetops.py", line 777, in union1d
    return unique(np.concatenate((ar1, ar2), axis=None))
  File "<__array_function__ internals>", line 180, in concatenate
TypeError: invalid type promotion with structured datatype(s).





C = np.concatenate((b,c),axis=1)
U = np.unique(C,axis=1)
D = np.array([np.setdiff1d(a[i],U[i]) for i in range(len(a))])
maxNei = max(map(len, D)) # the maximum length of D array
D = [np.concatenate((k,[np.nan]*(maxNei - len(k)))) for k in D]

I have three 2D arrays with the same nrows but ncols differs :

a = np.array([[1,2,3,4,5,6,7],[0,2,3,4,5,np.nan,np.nan]])
b = np.array([[1,2,3],[2,3,np.nan]])
c = np.array([[4,np.nan],[0,3]])

I want to find the union between b and c and then find the difference between this union and a. I want to keep the structure of my data, so the union and the difference should output a 2 dimensional array like this:

U = union(b,c)
U -> [[1,2,3,4,np.nan],[0,2,3,np.nan,np.nan]] # the result I want
# U[0] is equal to union(b[0],c[0]) 
# U[1] is equal to union(b[1],c[1])
D = Diff(a,U) 
D -> [[5,6,7],[4,5,np.nan]] # the result I want
# D[0] is equal to Diff(a[0],U[0])
# D[1] is equal to Diff(a[1],U[1])

So the union and difference must be performed between subarrays. However when I use np.union1d(b,c) or np.setdiff1d(a,U) I get a flattend array.

The code below works with this example but it's too slow and I wonder if it could be improved, also it dosen't keep a rectangular shape which is problematic for the vectorized operations (and maybe there are other problems that I didn't see):

C = np.concatenate((b,c),axis=1)
U = np.unique(C,axis=1)
D = np.array([np.setdiff1d(a[i],U[i]) for i in range(len(a))])

I have tried to use np.view() to make np.setdiff1d and np.union1d read each rows as if it's an invividual variable to keep the 2d shape like this:

nrows, ncols = a.shape
dtype={'names':['f{}'.format(i) for i in range(ncols)], 'formats':ncols * [a.dtype]}
A = a.copy().view(dtype)

nrows, ncols = b.shape
dtype={'names':['f{}'.format(i) for i in range(ncols)], 'formats':ncols * [b.dtype]}
B = b.copy().view(dtype)

# then i try np.union1d() with theses newly created arrays and I get an error
>>> np.union1d(A,B)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<__array_function__ internals>", line 180, in union1d
  File "/home/abbesses/grand/miniconda3/envs/Decomp/lib/python3.10/site-packages/numpy/lib/arraysetops.py", line 777, in union1d
    return unique(np.concatenate((ar1, ar2), axis=None))
  File "<__array_function__ internals>", line 180, in concatenate
TypeError: invalid type promotion with structured datatype(s).

Question: what can I do to perform theses set operations with numpy, on multidimensional arrays ?

Note: the np.nan in the various arrays are there to keep them rectangular. For example if an array has a max columns length of 10, every other columns must be 10 in length to allow vectorized calculation later, so they are filled with np.nan.


I have found a way to make my output rectangular by doing this:

C = np.concatenate((b,c),axis=1)
U = np.unique(C,axis=1)
D = np.array([np.setdiff1d(a[i],U[i]) for i in range(len(a))])
maxNei = max(map(len, D)) # the maximum length of D array
D = [np.concatenate((k,[np.nan]*(maxNei - len(k)))) for k in D]

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。



需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。


花开雨落又逢春i 2025-02-17 04:25:14

union1d在一对结构化数组上使用 - 如果它们具有相同的dtype:

In [105]: dt = np.dtype('f,f,f')

In [106]: np.union1d(np.ones(3,dt), np.array([(1,2,3),(4,5,6)],dt))
array([(1., 1., 1.), (1., 2., 3.), (4., 5., 6.)],
      dtype=[('f0', '<f4'), ('f1', '<f4'), ('f2', '<f4')])


这相当于在元组列表上使用Python set

In [118]: (np.ones(3,dt), np.array([(1,2,3),(4,5,6)],dt))
(array([(1., 1., 1.), (1., 1., 1.), (1., 1., 1.)],
       dtype=[('f0', '<f4'), ('f1', '<f4'), ('f2', '<f4')]),
 array([(1., 2., 3.), (4., 5., 6.)],
       dtype=[('f0', '<f4'), ('f1', '<f4'), ('f2', '<f4')]))

In [119]: (np.ones(3,dt).tolist(), np.array([(1,2,3),(4,5,6)],dt).tolist())
([(1.0, 1.0, 1.0), (1.0, 1.0, 1.0), (1.0, 1.0, 1.0)],
 [(1.0, 2.0, 3.0), (4.0, 5.0, 6.0)])

In [120]: set(Out[119][0]).union(set(Out[119][1]))
Out[120]: {(1.0, 1.0, 1.0), (1.0, 2.0, 3.0), (4.0, 5.0, 6.0)}



由于某种原因,您尝试使用np.union1d的那些功能被标记为 1d


union1d只能:unique(np.concatenate(((Ar1,ar2),axis = none)))

make 2 2d int数组:

In [213]: x, y = np.ones((2,3),int), np.zeros((4,3),int)

In [217]: np.unique(x, axis=0)
Out[217]: array([[1, 1, 1]])

x 和y - 请注意,它添加了一个维度:

In [218]: dt = np.dtype('i,i,i')  
In [219]: x.view(dt)
array([[(1, 1, 1)],
       [(1, 1, 1)]], dtype=[('f0', '<i4'), ('f1', '<i4'), ('f2', '<i4')])

In [220]: y.view(dt)
array([[(0, 0, 0)],
       [(0, 0, 0)],
       [(0, 0, 0)],
       [(0, 0, 0)]], dtype=[('f0', '<i4'), ('f1', '<i4'), ('f2', '<i4')])

现在union1d works(因为np.concatenate(((X.View(dt),y.view)( dt))))有效):

In [221]: np.union1d(x.view(dt),y.view(dt))
array([(0, 0, 0), (1, 1, 1)],
      dtype=[('f0', '<i4'), ('f1', '<i4'), ('f2', '<i4')])


您的数组是float dtype(由于np.nan),形状(2,7),(2,3)和(2,2)。用np.nan对平等和顺序进行测试是棘手的,甚至是不可能的。测试浮子也不可靠,尽管这些浮子可能还可以。


In [236]: b
array([[ 1.,  2.,  3.],
       [ 2.,  3., nan]])

In [237]: np.repeat(b,3,0)
array([[ 1.,  2.,  3.],
       [ 1.,  2.,  3.],
       [ 1.,  2.,  3.],
       [ 2.,  3., nan],
       [ 2.,  3., nan],
       [ 2.,  3., nan]])

In [238]: np.unique(np.repeat(b,3,0),axis=0)
array([[ 1.,  2.,  3.],
       [ 2.,  3., nan],
       [ 2.,  3., nan],
       [ 2.,  3., nan]])



In [239]: [np.union1d(i,j) for i,j in zip(b,c)]
Out[239]: [array([ 1.,  2.,  3.,  4., nan]), array([ 0.,  2.,  3., nan])]

是的,python迭代不好,但是如果其他选择是什么?请记住union1d 1D

In [241]: [np.setdiff1d(i,j) for i,j in zip(a, Out[239])]
Out[241]: [array([5., 6., 7.]), array([ 4.,  5., nan])]


In [266]: [np.unique(i) for i in (np.hstack((b,c)))]
Out[266]: [array([ 1.,  2.,  3.,  4., nan]), array([ 0.,  2.,  3., nan])]


setDiff1d使用in1d,这又使2个数组somut condecation condecatenementeNemant ,然后将它们串联,然后进行Argsort。更多的排序:)


python set使用哈希,如dict所用。在这种小情况下,它实际上更快:

In [247]: timeit [set(i).union(j) for i,j in zip(b,c)]
10.2 µs ± 77.3 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

In [248]: timeit [np.union1d(i,j) for i,j in zip(b,c)]
101 µs ± 1.55 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)



In [258]: [set(j).difference(i) for i,j in zip(Out[246],a)]
Out[258]: [{5.0, 6.0, 7.0}, {nan, nan, 4.0, 5.0}]

union1d works on a pair of structured arrays - if they have the same dtype:

In [105]: dt = np.dtype('f,f,f')

In [106]: np.union1d(np.ones(3,dt), np.array([(1,2,3),(4,5,6)],dt))
array([(1., 1., 1.), (1., 2., 3.), (4., 5., 6.)],
      dtype=[('f0', '<f4'), ('f1', '<f4'), ('f2', '<f4')])

Note that it treats each record as an 'entity'; here it removed the multiple (1,1,1). It is still 1d union.

This is equivalent to using python set on lists of tuples

In [118]: (np.ones(3,dt), np.array([(1,2,3),(4,5,6)],dt))
(array([(1., 1., 1.), (1., 1., 1.), (1., 1., 1.)],
       dtype=[('f0', '<f4'), ('f1', '<f4'), ('f2', '<f4')]),
 array([(1., 2., 3.), (4., 5., 6.)],
       dtype=[('f0', '<f4'), ('f1', '<f4'), ('f2', '<f4')]))

In [119]: (np.ones(3,dt).tolist(), np.array([(1,2,3),(4,5,6)],dt).tolist())
([(1.0, 1.0, 1.0), (1.0, 1.0, 1.0), (1.0, 1.0, 1.0)],
 [(1.0, 2.0, 3.0), (4.0, 5.0, 6.0)])

In [120]: set(Out[119][0]).union(set(Out[119][1]))
Out[120]: {(1.0, 1.0, 1.0), (1.0, 2.0, 3.0), (4.0, 5.0, 6.0)}


Some general thoughts on numpy and set behavior.

Those functions that you tried to use like np.union1d are marked 1d for a reason.

np.unique can work on 2d, but it uses a little trick. The array is transformed into a 1d structured array. It then uses unique1d, where each array record, or np.void object is a multibyte "number". As such it can be sorted and tested for uniqueness as any other 1d array.

union1d just does: unique(np.concatenate((ar1, ar2), axis=None))

Make 2 2d int array:

In [213]: x, y = np.ones((2,3),int), np.zeros((4,3),int)

In [217]: np.unique(x, axis=0)
Out[217]: array([[1, 1, 1]])

Make views of x and y - note that it adds a dimension:

In [218]: dt = np.dtype('i,i,i')  
In [219]: x.view(dt)
array([[(1, 1, 1)],
       [(1, 1, 1)]], dtype=[('f0', '<i4'), ('f1', '<i4'), ('f2', '<i4')])

In [220]: y.view(dt)
array([[(0, 0, 0)],
       [(0, 0, 0)],
       [(0, 0, 0)],
       [(0, 0, 0)]], dtype=[('f0', '<i4'), ('f1', '<i4'), ('f2', '<i4')])

Now union1d works (because np.concatenate((x.view(dt),y.view(dt))) works):

In [221]: np.union1d(x.view(dt),y.view(dt))
array([(0, 0, 0), (1, 1, 1)],
      dtype=[('f0', '<i4'), ('f1', '<i4'), ('f2', '<i4')])

I haven't fully tried to understand your goals, but a couple of things stand out.

Your arrays are float dtype (because of the np.nan), and shapes (2,7), (2,3), and (2,2). Testing for equality and order with np.nan is tricky, even impossible. Testing floats is also unreliable, though may be ok with these floats.

Let's test unique on b:

In [236]: b
array([[ 1.,  2.,  3.],
       [ 2.,  3., nan]])

In [237]: np.repeat(b,3,0)
array([[ 1.,  2.,  3.],
       [ 1.,  2.,  3.],
       [ 1.,  2.,  3.],
       [ 2.,  3., nan],
       [ 2.,  3., nan],
       [ 2.,  3., nan]])

In [238]: np.unique(np.repeat(b,3,0),axis=0)
array([[ 1.,  2.,  3.],
       [ 2.,  3., nan],
       [ 2.,  3., nan],
       [ 2.,  3., nan]])

It didn't remove the duplicates involving nan.

Since you want to do the union row by row, this row iteration makes most sense:

In [239]: [np.union1d(i,j) for i,j in zip(b,c)]
Out[239]: [array([ 1.,  2.,  3.,  4., nan]), array([ 0.,  2.,  3., nan])]

Yes, python iteration is BAD, but if the alternative is nothing? Remember union1d is 1d

In [241]: [np.setdiff1d(i,j) for i,j in zip(a, Out[239])]
Out[241]: [array([5., 6., 7.]), array([ 4.,  5., nan])]

A variant on the union1d is to join b,c horizontally, and do a row by row unique:

In [266]: [np.unique(i) for i in (np.hstack((b,c)))]
Out[266]: [array([ 1.,  2.,  3.,  4., nan]), array([ 0.,  2.,  3., nan])]

Since the rows have different numbers of elements, a "vectorized" version is impossible.

setdiff1d makes use of in1d, which in turn makes the 2 arrays unique, concatenates them, and then does an argsort. Lots more sorting :)


Python set uses hashing, as used with dict. In this small case it's actually faster:

In [247]: timeit [set(i).union(j) for i,j in zip(b,c)]
10.2 µs ± 77.3 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

In [248]: timeit [np.union1d(i,j) for i,j in zip(b,c)]
101 µs ± 1.55 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

numpy set code, based on sorting, is not particularly fast.

continuing with the set:

In [258]: [set(j).difference(i) for i,j in zip(Out[246],a)]
Out[258]: [{5.0, 6.0, 7.0}, {nan, nan, 4.0, 5.0}]
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。