如何在2D数组上使用NP.SetDiff1d和Np.union1d并保持数组形状

发布于 2025-02-10 04:25:14 字数 2272 浏览 1 评论 0原文

我有三个具有相同NROW的2D数组,但NCOL有所不同:

a = np.array([[1,2,3,4,5,6,7],[0,2,3,4,5,np.nan,np.nan]])
b = np.array([[1,2,3],[2,3,np.nan]])
c = np.array([[4,np.nan],[0,3]])

我想找到B和C之间的联合,然后找到该联合与A之间的区别。我想保留数据的结构,因此联合和差异应输出这样的2维数组:

U = union(b,c)
U -> [[1,2,3,4,np.nan],[0,2,3,np.nan,np.nan]] # the result I want
# U[0] is equal to union(b[0],c[0]) 
# U[1] is equal to union(b[1],c[1])
...
D = Diff(a,U) 
D -> [[5,6,7],[4,5,np.nan]] # the result I want
# D[0] is equal to Diff(a[0],U[0])
# D[1] is equal to Diff(a[1],U[1])
...

因此必须在子阵列之间执行联合和差异。但是,当我使用np.union1d(b,c)或np.setdiff1d(a,u)时,我会得到一个平坦的数组。

下面的代码适用于此示例,但它太慢了,我想知道它是否可以改进,而且它也不保留矩形形状,这对于矢量化操作是有问题的(也许还有其他我看不到的问题) :

C = np.concatenate((b,c),axis=1)
U = np.unique(C,axis=1)
D = np.array([np.setdiff1d(a[i],U[i]) for i in range(len(a))])

我试图使用np.view()使np.setDiff1d and np.union1d读取每个行,好像是一个使人的2D形状的态度:

nrows, ncols = a.shape
dtype={'names':['f{}'.format(i) for i in range(ncols)], 'formats':ncols * [a.dtype]}
A = a.copy().view(dtype)

nrows, ncols = b.shape
dtype={'names':['f{}'.format(i) for i in range(ncols)], 'formats':ncols * [b.dtype]}
B = b.copy().view(dtype)

# then i try np.union1d() with theses newly created arrays and I get an error
>>> np.union1d(A,B)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<__array_function__ internals>", line 180, in union1d
  File "/home/abbesses/grand/miniconda3/envs/Decomp/lib/python3.10/site-packages/numpy/lib/arraysetops.py", line 777, in union1d
    return unique(np.concatenate((ar1, ar2), axis=None))
  File "<__array_function__ internals>", line 180, in concatenate
TypeError: invalid type promotion with structured datatype(s).

问题:我该做些什么来执行与numpy的thes设置操作,在多维阵列上?

注意:各个阵列中的np.nan在那里保持矩形。例如,如果数组的最大列长度为10,则每个其他列的长度必须为10,以允许以后计算,因此它们充满了NP.NAN。

EDIT1:

我找到了一种通过这样做的方法来制作输出矩形的方法:

C = np.concatenate((b,c),axis=1)
U = np.unique(C,axis=1)
D = np.array([np.setdiff1d(a[i],U[i]) for i in range(len(a))])
maxNei = max(map(len, D)) # the maximum length of D array
D = [np.concatenate((k,[np.nan]*(maxNei - len(k)))) for k in D]

I have three 2D arrays with the same nrows but ncols differs :

a = np.array([[1,2,3,4,5,6,7],[0,2,3,4,5,np.nan,np.nan]])
b = np.array([[1,2,3],[2,3,np.nan]])
c = np.array([[4,np.nan],[0,3]])

I want to find the union between b and c and then find the difference between this union and a. I want to keep the structure of my data, so the union and the difference should output a 2 dimensional array like this:

U = union(b,c)
U -> [[1,2,3,4,np.nan],[0,2,3,np.nan,np.nan]] # the result I want
# U[0] is equal to union(b[0],c[0]) 
# U[1] is equal to union(b[1],c[1])
...
D = Diff(a,U) 
D -> [[5,6,7],[4,5,np.nan]] # the result I want
# D[0] is equal to Diff(a[0],U[0])
# D[1] is equal to Diff(a[1],U[1])
...

So the union and difference must be performed between subarrays. However when I use np.union1d(b,c) or np.setdiff1d(a,U) I get a flattend array.

The code below works with this example but it's too slow and I wonder if it could be improved, also it dosen't keep a rectangular shape which is problematic for the vectorized operations (and maybe there are other problems that I didn't see):

C = np.concatenate((b,c),axis=1)
U = np.unique(C,axis=1)
D = np.array([np.setdiff1d(a[i],U[i]) for i in range(len(a))])

I have tried to use np.view() to make np.setdiff1d and np.union1d read each rows as if it's an invividual variable to keep the 2d shape like this:

nrows, ncols = a.shape
dtype={'names':['f{}'.format(i) for i in range(ncols)], 'formats':ncols * [a.dtype]}
A = a.copy().view(dtype)

nrows, ncols = b.shape
dtype={'names':['f{}'.format(i) for i in range(ncols)], 'formats':ncols * [b.dtype]}
B = b.copy().view(dtype)

# then i try np.union1d() with theses newly created arrays and I get an error
>>> np.union1d(A,B)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<__array_function__ internals>", line 180, in union1d
  File "/home/abbesses/grand/miniconda3/envs/Decomp/lib/python3.10/site-packages/numpy/lib/arraysetops.py", line 777, in union1d
    return unique(np.concatenate((ar1, ar2), axis=None))
  File "<__array_function__ internals>", line 180, in concatenate
TypeError: invalid type promotion with structured datatype(s).

Question: what can I do to perform theses set operations with numpy, on multidimensional arrays ?

Note: the np.nan in the various arrays are there to keep them rectangular. For example if an array has a max columns length of 10, every other columns must be 10 in length to allow vectorized calculation later, so they are filled with np.nan.

EDIT1:

I have found a way to make my output rectangular by doing this:

C = np.concatenate((b,c),axis=1)
U = np.unique(C,axis=1)
D = np.array([np.setdiff1d(a[i],U[i]) for i in range(len(a))])
maxNei = max(map(len, D)) # the maximum length of D array
D = [np.concatenate((k,[np.nan]*(maxNei - len(k)))) for k in D]

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

花开雨落又逢春i 2025-02-17 04:25:14

union1d在一对结构化数组上使用 - 如果它们具有相同的dtype:

In [105]: dt = np.dtype('f,f,f')

In [106]: np.union1d(np.ones(3,dt), np.array([(1,2,3),(4,5,6)],dt))
Out[106]: 
array([(1., 1., 1.), (1., 2., 3.), (4., 5., 6.)],
      dtype=[('f0', '<f4'), ('f1', '<f4'), ('f2', '<f4')])

请注意,它将每个记录视为“实体”;在这里,它删除了多个(1,1,1)。它仍然是1D工会。

这相当于在元组列表上使用Python set

In [118]: (np.ones(3,dt), np.array([(1,2,3),(4,5,6)],dt))
Out[118]: 
(array([(1., 1., 1.), (1., 1., 1.), (1., 1., 1.)],
       dtype=[('f0', '<f4'), ('f1', '<f4'), ('f2', '<f4')]),
 array([(1., 2., 3.), (4., 5., 6.)],
       dtype=[('f0', '<f4'), ('f1', '<f4'), ('f2', '<f4')]))

In [119]: (np.ones(3,dt).tolist(), np.array([(1,2,3),(4,5,6)],dt).tolist())
Out[119]: 
([(1.0, 1.0, 1.0), (1.0, 1.0, 1.0), (1.0, 1.0, 1.0)],
 [(1.0, 2.0, 3.0), (4.0, 5.0, 6.0)])

In [120]: set(Out[119][0]).union(set(Out[119][1]))
Out[120]: {(1.0, 1.0, 1.0), (1.0, 2.0, 3.0), (4.0, 5.0, 6.0)}

编辑

numpy和设置行为的一些一般想法。

由于某种原因,您尝试使用np.union1d的那些功能被标记为 1d

np.unique可以在2D上使用,但它使用了一个小技巧。阵列转换为1D结构化阵列。然后,它使用unique1d,其中每个数组recordnp.void对象是多键“编号”。因此,它可以作为其他任何一维数组作为唯一性进行排序和测试。

union1d只能:unique(np.concatenate(((Ar1,ar2),axis = none)))

make 2 2d int数组:

In [213]: x, y = np.ones((2,3),int), np.zeros((4,3),int)

In [217]: np.unique(x, axis=0)
Out[217]: array([[1, 1, 1]])

x 和y - 请注意,它添加了一个维度:

In [218]: dt = np.dtype('i,i,i')  
In [219]: x.view(dt)
Out[219]: 
array([[(1, 1, 1)],
       [(1, 1, 1)]], dtype=[('f0', '<i4'), ('f1', '<i4'), ('f2', '<i4')])

In [220]: y.view(dt)
Out[220]: 
array([[(0, 0, 0)],
       [(0, 0, 0)],
       [(0, 0, 0)],
       [(0, 0, 0)]], dtype=[('f0', '<i4'), ('f1', '<i4'), ('f2', '<i4')])

现在union1d works(因为np.concatenate(((X.View(dt),y.view)( dt))))有效):

In [221]: np.union1d(x.view(dt),y.view(dt))
Out[221]: 
array([(0, 0, 0), (1, 1, 1)],
      dtype=[('f0', '<i4'), ('f1', '<i4'), ('f2', '<i4')])

我尚未完全尝试理解您的目标,但是有几件事脱颖而出。

您的数组是float dtype(由于np.nan),形状(2,7),(2,3)和(2,2)。用np.nan对平等和顺序进行测试是棘手的,甚至是不可能的。测试浮子也不可靠,尽管这些浮子可能还可以。

让我们测试uniqueb上:

In [236]: b
Out[236]: 
array([[ 1.,  2.,  3.],
       [ 2.,  3., nan]])

In [237]: np.repeat(b,3,0)
Out[237]: 
array([[ 1.,  2.,  3.],
       [ 1.,  2.,  3.],
       [ 1.,  2.,  3.],
       [ 2.,  3., nan],
       [ 2.,  3., nan],
       [ 2.,  3., nan]])

In [238]: np.unique(np.repeat(b,3,0),axis=0)
Out[238]: 
array([[ 1.,  2.,  3.],
       [ 2.,  3., nan],
       [ 2.,  3., nan],
       [ 2.,  3., nan]])

它没有删除涉及nan的重复项。

既然您想按行进行联合行,则此行迭代是最有道理的:

In [239]: [np.union1d(i,j) for i,j in zip(b,c)]
Out[239]: [array([ 1.,  2.,  3.,  4., nan]), array([ 0.,  2.,  3., nan])]

是的,python迭代不好,但是如果其他选择是什么?请记住union1d 1D

In [241]: [np.setdiff1d(i,j) for i,j in zip(a, Out[239])]
Out[241]: [array([5., 6., 7.]), array([ 4.,  5., nan])]

union1d的变体是与b,c水平进行连接,然后通过行unique

In [266]: [np.unique(i) for i in (np.hstack((b,c)))]
Out[266]: [array([ 1.,  2.,  3.,  4., nan]), array([ 0.,  2.,  3., nan])]

由于行具有不同数量的元素,因此不可能“矢量化”版本。

setDiff1d使用in1d,这又使2个数组somut condecation condecatenementeNemant ,然后将它们串联,然后进行Argsort。更多的排序:)

设置

python set使用哈希,如dict所用。在这种小情况下,它实际上更快:

In [247]: timeit [set(i).union(j) for i,j in zip(b,c)]
10.2 µs ± 77.3 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

In [248]: timeit [np.union1d(i,j) for i,j in zip(b,c)]
101 µs ± 1.55 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

numpy基于排序的设置代码并不是特别快。

继续使用set

In [258]: [set(j).difference(i) for i,j in zip(Out[246],a)]
Out[258]: [{5.0, 6.0, 7.0}, {nan, nan, 4.0, 5.0}]

union1d works on a pair of structured arrays - if they have the same dtype:

In [105]: dt = np.dtype('f,f,f')

In [106]: np.union1d(np.ones(3,dt), np.array([(1,2,3),(4,5,6)],dt))
Out[106]: 
array([(1., 1., 1.), (1., 2., 3.), (4., 5., 6.)],
      dtype=[('f0', '<f4'), ('f1', '<f4'), ('f2', '<f4')])

Note that it treats each record as an 'entity'; here it removed the multiple (1,1,1). It is still 1d union.

This is equivalent to using python set on lists of tuples

In [118]: (np.ones(3,dt), np.array([(1,2,3),(4,5,6)],dt))
Out[118]: 
(array([(1., 1., 1.), (1., 1., 1.), (1., 1., 1.)],
       dtype=[('f0', '<f4'), ('f1', '<f4'), ('f2', '<f4')]),
 array([(1., 2., 3.), (4., 5., 6.)],
       dtype=[('f0', '<f4'), ('f1', '<f4'), ('f2', '<f4')]))

In [119]: (np.ones(3,dt).tolist(), np.array([(1,2,3),(4,5,6)],dt).tolist())
Out[119]: 
([(1.0, 1.0, 1.0), (1.0, 1.0, 1.0), (1.0, 1.0, 1.0)],
 [(1.0, 2.0, 3.0), (4.0, 5.0, 6.0)])

In [120]: set(Out[119][0]).union(set(Out[119][1]))
Out[120]: {(1.0, 1.0, 1.0), (1.0, 2.0, 3.0), (4.0, 5.0, 6.0)}

edit

Some general thoughts on numpy and set behavior.

Those functions that you tried to use like np.union1d are marked 1d for a reason.

np.unique can work on 2d, but it uses a little trick. The array is transformed into a 1d structured array. It then uses unique1d, where each array record, or np.void object is a multibyte "number". As such it can be sorted and tested for uniqueness as any other 1d array.

union1d just does: unique(np.concatenate((ar1, ar2), axis=None))

Make 2 2d int array:

In [213]: x, y = np.ones((2,3),int), np.zeros((4,3),int)

In [217]: np.unique(x, axis=0)
Out[217]: array([[1, 1, 1]])

Make views of x and y - note that it adds a dimension:

In [218]: dt = np.dtype('i,i,i')  
In [219]: x.view(dt)
Out[219]: 
array([[(1, 1, 1)],
       [(1, 1, 1)]], dtype=[('f0', '<i4'), ('f1', '<i4'), ('f2', '<i4')])

In [220]: y.view(dt)
Out[220]: 
array([[(0, 0, 0)],
       [(0, 0, 0)],
       [(0, 0, 0)],
       [(0, 0, 0)]], dtype=[('f0', '<i4'), ('f1', '<i4'), ('f2', '<i4')])

Now union1d works (because np.concatenate((x.view(dt),y.view(dt))) works):

In [221]: np.union1d(x.view(dt),y.view(dt))
Out[221]: 
array([(0, 0, 0), (1, 1, 1)],
      dtype=[('f0', '<i4'), ('f1', '<i4'), ('f2', '<i4')])

I haven't fully tried to understand your goals, but a couple of things stand out.

Your arrays are float dtype (because of the np.nan), and shapes (2,7), (2,3), and (2,2). Testing for equality and order with np.nan is tricky, even impossible. Testing floats is also unreliable, though may be ok with these floats.

Let's test unique on b:

In [236]: b
Out[236]: 
array([[ 1.,  2.,  3.],
       [ 2.,  3., nan]])

In [237]: np.repeat(b,3,0)
Out[237]: 
array([[ 1.,  2.,  3.],
       [ 1.,  2.,  3.],
       [ 1.,  2.,  3.],
       [ 2.,  3., nan],
       [ 2.,  3., nan],
       [ 2.,  3., nan]])

In [238]: np.unique(np.repeat(b,3,0),axis=0)
Out[238]: 
array([[ 1.,  2.,  3.],
       [ 2.,  3., nan],
       [ 2.,  3., nan],
       [ 2.,  3., nan]])

It didn't remove the duplicates involving nan.

Since you want to do the union row by row, this row iteration makes most sense:

In [239]: [np.union1d(i,j) for i,j in zip(b,c)]
Out[239]: [array([ 1.,  2.,  3.,  4., nan]), array([ 0.,  2.,  3., nan])]

Yes, python iteration is BAD, but if the alternative is nothing? Remember union1d is 1d

In [241]: [np.setdiff1d(i,j) for i,j in zip(a, Out[239])]
Out[241]: [array([5., 6., 7.]), array([ 4.,  5., nan])]

A variant on the union1d is to join b,c horizontally, and do a row by row unique:

In [266]: [np.unique(i) for i in (np.hstack((b,c)))]
Out[266]: [array([ 1.,  2.,  3.,  4., nan]), array([ 0.,  2.,  3., nan])]

Since the rows have different numbers of elements, a "vectorized" version is impossible.

setdiff1d makes use of in1d, which in turn makes the 2 arrays unique, concatenates them, and then does an argsort. Lots more sorting :)

set

Python set uses hashing, as used with dict. In this small case it's actually faster:

In [247]: timeit [set(i).union(j) for i,j in zip(b,c)]
10.2 µs ± 77.3 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

In [248]: timeit [np.union1d(i,j) for i,j in zip(b,c)]
101 µs ± 1.55 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

numpy set code, based on sorting, is not particularly fast.

continuing with the set:

In [258]: [set(j).difference(i) for i,j in zip(Out[246],a)]
Out[258]: [{5.0, 6.0, 7.0}, {nan, nan, 4.0, 5.0}]
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文