如何在2D数组上使用NP.SetDiff1d和Np.union1d并保持数组形状
我有三个具有相同NROW的2D数组,但NCOL有所不同:
a = np.array([[1,2,3,4,5,6,7],[0,2,3,4,5,np.nan,np.nan]])
b = np.array([[1,2,3],[2,3,np.nan]])
c = np.array([[4,np.nan],[0,3]])
我想找到B和C之间的联合,然后找到该联合与A之间的区别。我想保留数据的结构,因此联合和差异应输出这样的2维数组:
U = union(b,c)
U -> [[1,2,3,4,np.nan],[0,2,3,np.nan,np.nan]] # the result I want
# U[0] is equal to union(b[0],c[0])
# U[1] is equal to union(b[1],c[1])
...
D = Diff(a,U)
D -> [[5,6,7],[4,5,np.nan]] # the result I want
# D[0] is equal to Diff(a[0],U[0])
# D[1] is equal to Diff(a[1],U[1])
...
因此必须在子阵列之间执行联合和差异。但是,当我使用np.union1d(b,c)或np.setdiff1d(a,u)时,我会得到一个平坦的数组。
下面的代码适用于此示例,但它太慢了,我想知道它是否可以改进,而且它也不保留矩形形状,这对于矢量化操作是有问题的(也许还有其他我看不到的问题) :
C = np.concatenate((b,c),axis=1)
U = np.unique(C,axis=1)
D = np.array([np.setdiff1d(a[i],U[i]) for i in range(len(a))])
我试图使用np.view()使np.setDiff1d and np.union1d读取每个行,好像是一个使人的2D形状的态度:
nrows, ncols = a.shape
dtype={'names':['f{}'.format(i) for i in range(ncols)], 'formats':ncols * [a.dtype]}
A = a.copy().view(dtype)
nrows, ncols = b.shape
dtype={'names':['f{}'.format(i) for i in range(ncols)], 'formats':ncols * [b.dtype]}
B = b.copy().view(dtype)
# then i try np.union1d() with theses newly created arrays and I get an error
>>> np.union1d(A,B)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<__array_function__ internals>", line 180, in union1d
File "/home/abbesses/grand/miniconda3/envs/Decomp/lib/python3.10/site-packages/numpy/lib/arraysetops.py", line 777, in union1d
return unique(np.concatenate((ar1, ar2), axis=None))
File "<__array_function__ internals>", line 180, in concatenate
TypeError: invalid type promotion with structured datatype(s).
问题:我该做些什么来执行与numpy的thes设置操作,在多维阵列上?
注意:各个阵列中的np.nan在那里保持矩形。例如,如果数组的最大列长度为10,则每个其他列的长度必须为10,以允许以后计算,因此它们充满了NP.NAN。
EDIT1:
我找到了一种通过这样做的方法来制作输出矩形的方法:
C = np.concatenate((b,c),axis=1)
U = np.unique(C,axis=1)
D = np.array([np.setdiff1d(a[i],U[i]) for i in range(len(a))])
maxNei = max(map(len, D)) # the maximum length of D array
D = [np.concatenate((k,[np.nan]*(maxNei - len(k)))) for k in D]
I have three 2D arrays with the same nrows but ncols differs :
a = np.array([[1,2,3,4,5,6,7],[0,2,3,4,5,np.nan,np.nan]])
b = np.array([[1,2,3],[2,3,np.nan]])
c = np.array([[4,np.nan],[0,3]])
I want to find the union between b and c and then find the difference between this union and a. I want to keep the structure of my data, so the union and the difference should output a 2 dimensional array like this:
U = union(b,c)
U -> [[1,2,3,4,np.nan],[0,2,3,np.nan,np.nan]] # the result I want
# U[0] is equal to union(b[0],c[0])
# U[1] is equal to union(b[1],c[1])
...
D = Diff(a,U)
D -> [[5,6,7],[4,5,np.nan]] # the result I want
# D[0] is equal to Diff(a[0],U[0])
# D[1] is equal to Diff(a[1],U[1])
...
So the union and difference must be performed between subarrays. However when I use np.union1d(b,c) or np.setdiff1d(a,U) I get a flattend array.
The code below works with this example but it's too slow and I wonder if it could be improved, also it dosen't keep a rectangular shape which is problematic for the vectorized operations (and maybe there are other problems that I didn't see):
C = np.concatenate((b,c),axis=1)
U = np.unique(C,axis=1)
D = np.array([np.setdiff1d(a[i],U[i]) for i in range(len(a))])
I have tried to use np.view() to make np.setdiff1d and np.union1d read each rows as if it's an invividual variable to keep the 2d shape like this:
nrows, ncols = a.shape
dtype={'names':['f{}'.format(i) for i in range(ncols)], 'formats':ncols * [a.dtype]}
A = a.copy().view(dtype)
nrows, ncols = b.shape
dtype={'names':['f{}'.format(i) for i in range(ncols)], 'formats':ncols * [b.dtype]}
B = b.copy().view(dtype)
# then i try np.union1d() with theses newly created arrays and I get an error
>>> np.union1d(A,B)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<__array_function__ internals>", line 180, in union1d
File "/home/abbesses/grand/miniconda3/envs/Decomp/lib/python3.10/site-packages/numpy/lib/arraysetops.py", line 777, in union1d
return unique(np.concatenate((ar1, ar2), axis=None))
File "<__array_function__ internals>", line 180, in concatenate
TypeError: invalid type promotion with structured datatype(s).
Question: what can I do to perform theses set operations with numpy, on multidimensional arrays ?
Note: the np.nan in the various arrays are there to keep them rectangular. For example if an array has a max columns length of 10, every other columns must be 10 in length to allow vectorized calculation later, so they are filled with np.nan.
EDIT1:
I have found a way to make my output rectangular by doing this:
C = np.concatenate((b,c),axis=1)
U = np.unique(C,axis=1)
D = np.array([np.setdiff1d(a[i],U[i]) for i in range(len(a))])
maxNei = max(map(len, D)) # the maximum length of D array
D = [np.concatenate((k,[np.nan]*(maxNei - len(k)))) for k in D]
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
union1d
在一对结构化数组上使用 - 如果它们具有相同的dtype:请注意,它将每个记录视为“实体”;在这里,它删除了多个
(1,1,1)
。它仍然是1D工会。这相当于在元组列表上使用Python
set
编辑
numpy
和设置行为的一些一般想法。由于某种原因,您尝试使用
np.union1d
的那些功能被标记为 1d 。np.unique
可以在2D上使用,但它使用了一个小技巧。阵列转换为1D结构化阵列。然后,它使用unique1d
,其中每个数组record
或np.void
对象是多键“编号”。因此,它可以作为其他任何一维数组作为唯一性进行排序和测试。union1d
只能:unique(np.concatenate(((Ar1,ar2),axis = none)))
make 2 2d int数组:
x 和
y
- 请注意,它添加了一个维度:现在
union1d
works(因为np.concatenate(((X.View(dt),y.view)( dt))))
有效):我尚未完全尝试理解您的目标,但是有几件事脱颖而出。
您的数组是float dtype(由于
np.nan
),形状(2,7),(2,3)和(2,2)。用np.nan
对平等和顺序进行测试是棘手的,甚至是不可能的。测试浮子也不可靠,尽管这些浮子可能还可以。让我们测试
unique
在b
上:它没有删除涉及
nan
的重复项。既然您想按行进行联合行,则此行迭代是最有道理的:
是的,python迭代不好,但是如果其他选择是什么?请记住
union1d
是 1Dunion1d
的变体是与b,c
水平进行连接,然后通过行unique
:由于行具有不同数量的元素,因此不可能“矢量化”版本。
setDiff1d
使用in1d
,这又使2个数组somut
condecation condecatenementeNemant ,然后将它们串联,然后进行Argsort。更多的排序:)设置
python
set
使用哈希,如dict
所用。在这种小情况下,它实际上更快:numpy
基于排序的设置代码并不是特别快。继续使用
set
:union1d
works on a pair of structured arrays - if they have the same dtype:Note that it treats each record as an 'entity'; here it removed the multiple
(1,1,1)
. It is still 1d union.This is equivalent to using python
set
on lists of tuplesedit
Some general thoughts on
numpy
and set behavior.Those functions that you tried to use like
np.union1d
are marked 1d for a reason.np.unique
can work on 2d, but it uses a little trick. The array is transformed into a 1d structured array. It then usesunique1d
, where each arrayrecord
, ornp.void
object is a multibyte "number". As such it can be sorted and tested for uniqueness as any other 1d array.union1d
just does:unique(np.concatenate((ar1, ar2), axis=None))
Make 2 2d int array:
Make views of
x
andy
- note that it adds a dimension:Now
union1d
works (becausenp.concatenate((x.view(dt),y.view(dt)))
works):I haven't fully tried to understand your goals, but a couple of things stand out.
Your arrays are float dtype (because of the
np.nan
), and shapes (2,7), (2,3), and (2,2). Testing for equality and order withnp.nan
is tricky, even impossible. Testing floats is also unreliable, though may be ok with these floats.Let's test
unique
onb
:It didn't remove the duplicates involving
nan
.Since you want to do the union row by row, this row iteration makes most sense:
Yes, python iteration is BAD, but if the alternative is nothing? Remember
union1d
is 1dA variant on the
union1d
is to joinb,c
horizontally, and do a row by rowunique
:Since the rows have different numbers of elements, a "vectorized" version is impossible.
setdiff1d
makes use ofin1d
, which in turn makes the 2 arraysunique
, concatenates them, and then does an argsort. Lots more sorting :)set
Python
set
uses hashing, as used withdict
. In this small case it's actually faster:numpy
set code, based on sorting, is not particularly fast.continuing with the
set
: