根据一列中的公共值从两个或多个 2d numpy 数组创建交集

发布于 2024-12-28 18:09:01 字数 1070 浏览 2 评论 0原文

我有 3 个具有以下结构的 numpy 重新排列。第一列是某个位置（整数），第二列是分数（浮点数）。

输入：

a = [[1, 5.41],
     [2, 5.42],
     [3, 12.32],
     dtype=[('position', '<i4'), ('score', '<f4')])
     ]

b = [[3, 8.41],
     [6, 7.42],
     [4, 6.32],
     dtype=[('position', '<i4'), ('score', '<f4')])
     ]

c = [[3, 7.41],
     [7, 6.42],
     [1, 5.32],
     dtype=[('position', '<i4'), ('score', '<f4')])
     ]

所有 3 个数组包含相同数量的元素。
我正在寻找一种有效的方法来根据位置列将这三个二维数组组合成一个数组。

上面示例的输出数组应如下所示：

输出：

output = [[3, 12.32, 8.41, 7.41],
          dtype=[('position', '<i4'), ('score1', '<f4'),('score2', '<f4'),('score3', '<f4')])]

只有位置为 3 的行位于输出数组中，因为该位置出现在所有 3 个输入数组中。

更新：我的天真的方法是以下步骤：

创建 3 个输入数组的第一列的向量。
使用 intersect1D 获取这 3 个向量的交集。
以某种方式检索所有 3 个输入数组的向量索引。
使用 3 个输入数组中过滤后的行创建新数组。

更新2：每个位置值可以位于一个、两个或全部三个输入数组中。在我的输出数组中，我只想包含出现在所有 3 个输入数组中的位置值的行。

原文

I have 3 numpy recarrays with following structure.
The first column is some position (Integer) and the second column is a score (Float).

Input:

a = [[1, 5.41],
     [2, 5.42],
     [3, 12.32],
     dtype=[('position', '<i4'), ('score', '<f4')])
     ]

b = [[3, 8.41],
     [6, 7.42],
     [4, 6.32],
     dtype=[('position', '<i4'), ('score', '<f4')])
     ]

c = [[3, 7.41],
     [7, 6.42],
     [1, 5.32],
     dtype=[('position', '<i4'), ('score', '<f4')])
     ]

All 3 arrays contain the same amount of elements.
I am looking for an efficient way to combine these three 2d arrays into one array based on the position column.

The output arary for the example above should look like this:

Output:

output = [[3, 12.32, 8.41, 7.41],
          dtype=[('position', '<i4'), ('score1', '<f4'),('score2', '<f4'),('score3', '<f4')])]

Only the row with position 3 is in the output array because this position appears in all 3 input arrays.

Update: My naive approach would be following steps:

create vector of the first columns of my 3 input arrays.
use intersect1D to get the intersection of these 3 vectors.
somehow retrieve indexes for the vector for all 3 input arrays.
create new array with filtered rows from the 3 input arrays.

Update2:
Each position value can be in one, two or all three input arrays. In my output array I only want to include rows for position values which appear in all 3 input arrays.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

緦唸λ蓇 2025-01-04 18:09:01

这是一种方法，我相信它应该相当快。我认为您要做的第一件事是计算每个位置的出现次数。这个函数将处理这个问题：

def count_positions(positions):
    positions = np.sort(positions)
    diff = np.ones(len(positions), 'bool')
    diff[:-1] = positions[1:] != positions[:-1]
    count = diff.nonzero()[0]
    count[1:] = count[1:] - count[:-1]
    count[0] += 1
    uniqPositions = positions[diff]
    return uniqPositions, count

现在使用上面的函数形式，您只想获取出现 3 次的位置：

positions = np.concatenate((a['position'], b['position'], c['position']))
uinqPos, count = count_positions(positions)
uinqPos = uinqPos[count == 3]

我们将使用搜索排序，因此我们对 ab 和 c 进行排序：

a.sort(order='position')
b.sort(order='position')
c.sort(order='position')

现在我们可以使用搜索排序来查找每个数组中的位置找到我们的每个 uniqPos：

new_array = np.empty((len(uinqPos), 4))
new_array[:, 0] = uinqPos
index = a['position'].searchsorted(uinqPos)
new_array[:, 1] = a['score'][index]
index = b['position'].searchsorted(uinqPos)
new_array[:, 2] = b['score'][index]
index = c['position'].searchsorted(uinqPos)
new_array[:, 3] = c['score'][index]

使用字典可能有一个更优雅的解决方案，但我首先想到了这个，所以我将其留给其他人。

Here is one approach, I believe it should be reasonably fast. I think the first thing you want to do is count the number occurrences for each position. This function will handle that:

def count_positions(positions):
    positions = np.sort(positions)
    diff = np.ones(len(positions), 'bool')
    diff[:-1] = positions[1:] != positions[:-1]
    count = diff.nonzero()[0]
    count[1:] = count[1:] - count[:-1]
    count[0] += 1
    uniqPositions = positions[diff]
    return uniqPositions, count

Now using the function form above you want to take only the positions that occur 3 times:

positions = np.concatenate((a['position'], b['position'], c['position']))
uinqPos, count = count_positions(positions)
uinqPos = uinqPos[count == 3]

We will be using search sorted so we sort a b and c:

a.sort(order='position')
b.sort(order='position')
c.sort(order='position')

Now we can user search sorted to find where in each array to find each of our uniqPos:

new_array = np.empty((len(uinqPos), 4))
new_array[:, 0] = uinqPos
index = a['position'].searchsorted(uinqPos)
new_array[:, 1] = a['score'][index]
index = b['position'].searchsorted(uinqPos)
new_array[:, 2] = b['score'][index]
index = c['position'].searchsorted(uinqPos)
new_array[:, 3] = c['score'][index]

There might be a more elegant solution using dictionaries, but I thought of this one first so I'll leave that to someone else.

回复收藏 0 原文

~没有更多了~