我有一批形状的向量(bs,m,n)(即尺寸mxn的bs载体)。对于每个批次,我想与其余的(M-1)计算第一个向量的jaccard相似性
:
a = [
[[3, 8, 6, 8, 7],
[9, 7, 4, 8, 1],
[7, 8, 8, 5, 7],
[3, 9, 9, 4, 4]],
[[7, 3, 8, 1, 7],
[3, 0, 3, 4, 2],
[9, 1, 6, 1, 6],
[2, 7, 0, 6, 6]]
]
查找[::,0,:]和a [::,1 :,::这是给出的
即,
[3, 8, 6, 8, 7] with each of [[9, 7, 4, 8, 1], [7, 8, 8, 5, 7], [3, 9, 9, 4, 4]] (3 scores)
and
[7, 3, 8, 1, 7] with each of [[3, 0, 3, 4, 2], [9, 1, 6, 1, 6], [2, 7, 0, 6, 6]] (3 scores)
这是我在使用不平等张量的情况下使用的Jaccard函数
def js(la1, la2):
combined = torch.cat((la1, la2))
union, counts = combined.unique(return_counts=True)
intersection = union[counts > 1]
torch.numel(intersection) / torch.numel(union)
,这种方法的问题是,每个组合中的唯一数量(一对张量)可能会有所不同,并且由于Pytorch不支持锯齿状的张量,,,,,,张力。我无法一次处理批次。
如果我无法以预期的清晰度表达问题,请告诉我。在这方面的任何帮助都将不胜感激
:这是通过在第一和第二维度上迭代来实现的流程。我希望拥有以下代码的矢量化版本以进行批处理处理
bs = 2
m = 4
n = 5
a = torch.randint(0, 10, (bs, m, n))
print(f"Array is: \n{a}")
for bs_idx in range(bs):
first = a[bs_idx,0,:]
for row in range(1, m):
second = a[bs_idx,row,:]
idx = js(first, second)
print(f'comparing{first} and {second}: {idx}')
I'm having a batch of vectors of shape (bs, m, n) (i.e., bs vectors of dimensions mxn). For each batch, I would like to calculate the Jaccard similarity of the first vector with the rest (m-1) of them
Example:
a = [
[[3, 8, 6, 8, 7],
[9, 7, 4, 8, 1],
[7, 8, 8, 5, 7],
[3, 9, 9, 4, 4]],
[[7, 3, 8, 1, 7],
[3, 0, 3, 4, 2],
[9, 1, 6, 1, 6],
[2, 7, 0, 6, 6]]
]
Find pairwise jaccard similarity between a[:,0,:] and a[:,1:,:]
i.e.,
[3, 8, 6, 8, 7] with each of [[9, 7, 4, 8, 1], [7, 8, 8, 5, 7], [3, 9, 9, 4, 4]] (3 scores)
and
[7, 3, 8, 1, 7] with each of [[3, 0, 3, 4, 2], [9, 1, 6, 1, 6], [2, 7, 0, 6, 6]] (3 scores)
Here's the Jaccard function I have tried
def js(la1, la2):
combined = torch.cat((la1, la2))
union, counts = combined.unique(return_counts=True)
intersection = union[counts > 1]
torch.numel(intersection) / torch.numel(union)
While this works with unequal-sized tensors, the problem with this approach is that the number of uniques in each combination (pair of tensors) might be different and since PyTorch doesn't support jagged tensors, I'm unable to process batches of vectors at once.
If I'm not able to express the problem with the expected clarity, do let me know. Any help in this regard would be greatly appreciated
EDIT: Here's the flow achieved by iterating over the 1st and 2nd dimensions. I wish to have a vectorised version of the below code for batch processing
bs = 2
m = 4
n = 5
a = torch.randint(0, 10, (bs, m, n))
print(f"Array is: \n{a}")
for bs_idx in range(bs):
first = a[bs_idx,0,:]
for row in range(1, m):
second = a[bs_idx,row,:]
idx = js(first, second)
print(f'comparing{first} and {second}: {idx}')
发布评论
评论(3)
我不知道如何在Pytorch中实现这一目标,因为Afaik Pytorch不支持张量的设置操作。在您的
JS()
实现中,Union
计算应起作用,但是bretsection = Union [counts> 1]
,如果其中一个张量包含重复的值,则不会给您正确的结果。另一方面,Numpy具有union1d
和Intersect1d
的内置支持。您可以使用numpy矢量化来计算成对的jaccard索引,而无需使用for-loops:这当然是次优的,因为代码在GPU上不运行。如果我运行
jaccard
使用vecs1
的形状(1,10)
and vecs2 Shape的VECS2
(10_000,10)我在我的机器上获得的平均循环时间为200 ms±1.34 ms
,对于大多数用例,这可能足够快。 Pytorch和Numpy阵列之间的转换非常便宜。将此函数应用于您的问题,以 a :
在结果上使用
torch.from_numpy()
在需要时再次获得pytorch张量。更新:
如果您需要一个pytorch版本来计算jaccard索引,则我部分实现了numpy的
Intersect1d
in torch中:矢量化
torch_jaccard1d
函数,您可能需要研究torch.vmap
,它使您可以在任意任意的功能上进行功能批处理维度(类似于Numpy的
vectorize
)。vmap
函数是原型功能,在通常的pytorch发行版中尚不可用,但是您可以使用Pytorch的夜间构建来将其获取。我没有测试过,但这可能起作用。I don't know how you could achieve this in pytorch, since AFAIK pytorch doesn't support set operations on tensors. In your
js()
implementation,union
calculation should work, butintersection = union[counts > 1]
doesn't give you the right result if one of the tensors contains duplicated values. Numpy on the other hand has built-on support withunion1d
andintersect1d
. You can use numpy vectorization to calculate pairwise jaccard indices without using for-loops:This is of course suboptimal because the code doesn't run on the GPU. If I run the
jaccard
function withvecs1
of shape(1, 10)
andvecs2
of shape(10_000, 10)
I get a mean loop time of200 ms ± 1.34 ms
on my machine, which should probably be fast enough for most use cases. And conversion between pytorch and numpy arrays is very cheap.To apply this function to your problem with array
a
:Use
torch.from_numpy()
on the result to get a pytorch tensor again if needed.Update:
If you need a pytorch version for calculating the Jaccard index, I partially implemented numpy's
intersect1d
in torch:To vectorize the
torch_jaccard1d
function, you might want to look intotorch.vmap
, which lets you vectorize a function over an arbitrary batch dimension (similar to numpy'svectorize
). Thevmap
function is a prototype feature and not yet available in the usual pytorch distributions, but you can get it using nightly builds of pytorch. I haven't tested it but this might work.您可以使用 numpy.unique , numpy.intersect1d , numpy.union1d 和变换推荐函数在这里用于计算python中的jaccard_simurility:
输出:
You can use numpy.unique , numpy.intersect1d , numpy.union1d and transform recommended function here for computing jaccard_similarity in python:
Output:
您不标记 numba ,,< /strong>,但是您需要一种快速计算方法
jaccard_simarility
用于Shape的数据(45_000,110,12)
使用numba
withparallel = trualle = true
。我获得了带有Shape的随机数据的Run_Time(45_000,110,12)
bolly5 sec
: (运行时间获取 colab )我在下面写下
50_000的基准标记批量与形状(4,5) - &gt; (50_000、4、5)
。我们获得116 ms,使用numba
获得,但其他方法获得8秒和9秒
:(run-时间到达 colab )输出:
You don't tag numba, but you want a fast approach for computing
jaccard_similarity
for data with shape(45_000, 110, 12)
then I highly recommend you to usenumba
withparallel=True
. I get run_time for random data with shape(45_000, 110, 12)
only5 sec
: (Run-time get on colab)I write below Benchmark for
50_000 batch with shape (4,5) -> (50_000, 4, 5)
. We get116 ms with numba
but other approach get8 sec and 9 sec
: (Run-time get on colab)Output: