获取两个 2D numpy 数组之间的相交行
我想获取两个 2D numpy 数组之间的相交(公共)行。例如,如果以下数组作为输入传递:
array([[1, 4],
[2, 5],
[3, 6]])
array([[1, 4],
[3, 6],
[7, 8]])
输出应该是:
array([[1, 4],
[3, 6])
我知道如何使用循环来做到这一点。我正在寻找一种 Pythonic/Numpy 方法来做到这一点。
I want to get the intersecting (common) rows across two 2D numpy arrays. E.g., if the following arrays are passed as inputs:
array([[1, 4],
[2, 5],
[3, 6]])
array([[1, 4],
[3, 6],
[7, 8]])
the output should be:
array([[1, 4],
[3, 6])
I know how to do this with loops. I'm looking at a Pythonic/Numpy way to do this.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(11)
对于短数组,使用集合可能是最清晰、最易读的方法。
另一种方法是使用
numpy.intersect1d
。不过,您必须欺骗它将行视为单个值...这会使事情的可读性降低...对于大型数组,这应该比使用集合快得多。
For short arrays, using sets is probably the clearest and most readable way to do it.
Another way is to use
numpy.intersect1d
. You'll have to trick it into treating the rows as a single value, though... This makes things a bit less readable...For large arrays, this should be considerably faster than using sets.
您可以使用 Python 的集合:
正如 Rob Cowie 指出的那样,这可以更简洁地完成,因为
可能有一种方法可以做到这一点,而无需从数组到元组来回操作,但我现在还没有想到。
You could use Python's sets:
As Rob Cowie points out, this can be done more concisely as
There's probably a way to do this without all the going back and forth from arrays to tuples, but it's not coming to me right now.
我无法理解为什么没有建议的纯 numpy 方法来使其工作。所以我找到了一个,它使用 numpy 广播。基本思想是通过轴交换将其中一个数组转换为 3d。让我们构建 2 个数组:
在我的运行中,它给出了:
步骤是(数组可以互换):
在一个具有 2 行的函数中,用于减少已用内存(如果错误,请纠正我):
这给出了我的示例的结果:
这比设置的要快解决方案,因为它仅使用简单的 numpy 运算,同时不断减少维度,并且非常适合两个大矩阵。我想我的评论可能犯了错误,因为我通过实验和直觉得到了答案。列交集的等效项可以通过转置数组或稍微改变步骤来找到。另外,如果需要重复,则必须跳过“//”内的步骤。可以编辑该函数以仅返回索引的布尔数组,这对我来说很方便,同时尝试使用相同的向量获取不同的数组索引。投票答案和我的基准(每个维度中的元素数量对选择什么起作用):
代码:
结果:
以下结论是,如果您必须比较 2 个大的 2d 点数组,则使用投票答案。如果您在所有维度上都有大矩阵,那么投票的答案无论如何都是最好的。所以,这取决于你每次的选择。
I could not understand why there is no suggested pure numpy way to get this working. So I found one, that uses numpy broadcast. The basic idea is to transform one of the arrays to 3d by axes swapping. Let's construct 2 arrays:
With my run it gave:
The steps are (arrays can be interchanged) :
In a function with 2 lines for used memory reduction (correct me if wrong):
which gave result for my example:
This is faster than set solutions, as it makes use only of simple numpy operations, while it reduces constantly dimensions, and is ideal for two big matrices. I guess I might have made mistakes in my comments, as I got the answer by experimentation and instinct. The equivalent for column intersection can either be found by transposing the arrays or by changing the steps a little. Also, if duplicates are wanted, then the steps inside "//" have to be skipped. The function can be edited to return only the boolean array of the indices, which came handy to me ,while trying to get different arrays indices with the same vector. Benchmark for the voted answer and mine (number of elements in each dimension plays role on what to choose):
Code:
with results:
Following verdict is that if you have to compare 2 big 2d arrays of 2d points then use voted answer. If you have big matrices in all dimensions, voted answer is the best one by all means. So, it depends on what you choose each time.
Numpy 广播
我们可以创建一个布尔掩码使用广播,然后可用于过滤数组
A
中也存在于数组B
中的行Numpy broadcasting
We can create a boolean mask using broadcasting which can be then used to filter the rows in array
A
which are also present in arrayB
使用结构化数组实现此目的的另一种方法:
为了清楚起见,结构化视图如下所示:
Another way to achieve this using structured array:
Just for clarity, the structured view looks like this:
这也可以工作
This could also work
当然,这假设行的长度都相同。
This of course assumes the rows are all the same length.
没有索引
访问 https://gist.github.com/RashidLadj/971c7235ce796836853fcf55b4876f3c
指数
访问 https://gist.github.com/RashidLadj/bac71f3d3380064de2f9abe0ae43c19e
Without Index
Visit https://gist.github.com/RashidLadj/971c7235ce796836853fcf55b4876f3c
With Index
Visit https://gist.github.com/RashidLadj/bac71f3d3380064de2f9abe0ae43c19e
当涉及非常大的数组时,所有提到的方法都很慢。我找到了一种使用 numpy 来做到这一点的方法。由于 numpy 只为一维数组提供 np.in1d ,我们可以使用 轴 1 中的康托配对。然后可以使用numpy的函数。
对于非常大的数组来说,这是最快的方法。
要记住的一件重要的事情是康托配对只能对非负整数进行。因此,如果数组中有负整数,请在使用之前将它们全部设为正数(通过添加最小值)。
All the mentioned methods are slow when it comes to very large arrays. I found a way to do it using numpy. Since numpy only provides np.in1d for 1D arrays, we can encode the 2D array to 1D array using cantor pairing in axis 1. Then the numpy's function can be used.
This is the fastest possible way for very large arrays.
One important thing to remember is that cantor pairing can only be done for non-negative integers. So if you have negative integers in your arrays, make them all positive before using it (by adding the min value).
您可以使用 numpy.intersect 函数以及设置为 0 的 axis 参数来查找两个 2D numpy 数组中的公共行来实现此目的。下面是一个示例:
输出:
在此示例中,np.intersect1d 用于查找 array1 和 array2 之间的公共元素。当已知输入数组是唯一的时,assume_unique=True 用于性能优化。
然后,将结果重新整形回二维数组以获得最终输出。
这种方法提供了一种更 Pythonic/Numpy 的方法来查找两个 2D numpy 数组之间的相交行,而无需使用显式循环。
You can achieve this using the numpy.intersect function along with the axis parameter set to 0 to find the common rows across two 2D numpy arrays. Here's an example:
Output:
In this example, np.intersect1d is used to find the common elements between array1 and array2. assume_unique=True is used for performance optimization when it is known that the input arrays are unique.
Then, result is reshaped back into a 2D array to get the final output.
This approach provides a more Pythonic/Numpy way to find the intersecting rows across two 2D numpy arrays without using explicit loops.