如何在很大的火炬张量上执行操作而无需分开操作

发布于 2025-02-11 21:02:00 字数 1614 浏览 2 评论 0原文

我的任务

我正在尝试计算两个大张量(用于k-nearest-neighbours)中每两个样品之间的成对距离,也就是说 - 给定张量测试带有形状(B1,C,H,W),我需要||测试[I] - Train [J] ||对于每个ij。 (其中两个test [i]train [j]具有Shape (C,H,W),因为这些是批处理中的sampes )。

问题

train 和test很大,所以我无法将它们放入RAM

我当前的解决方案

启动时,我没有一次构造这些张量 - 当我构建它们时,我将数据张量分开并分别保存到内存中,因此我最终得到了文件{test \ test_1,...,test \ test_n}{train \ train_1,...,Train \ Train_m}。 然后,我将循环的嵌套加载每个test \ test \ test_itrain \ train_j,计算当前距离并保存。

这次半伪码可以解释

test_files = [f'Test\test_{i}' for i in range(n)]
train_files = [f'Train\train_{j}' for j in range(m)]
dist = lambda t1,t2: torch.cdist(t1.flatten(1), t2.flatten(1)) 
all_distances = []
for test_i in test_files:
    test_i = torch.load(test_i) # shape (c,h,w)
    dist_of_i_from_all_j = torch.Tensor([])
    for train_j in train_files:
        train_j = torch.load(train_j) # shape (c,h,w)
        dist_of_i_from_all_j = torch.cat((dist_of_i_from_all_j, dist(test_i,train_j))
    all_distances.append(dist_of_i_from_all_j)
# and now I can take the k-smallest from all_distances

我认为可以起作用的

我遇到了 faiss存储库,他们解释说可以使用他们的解决方案加速此过程(也许是?),尽管我不确定如何。无论如何,任何方法都会有所帮助!

My Task:

I'm trying to calculate the pair-wise distance between every two samples in two big tensors (for k-Nearest-Neighbours), That is - given tensor test with shape (b1,c,h,w) and tensor train with shape (b2,c,h,w), I need || test[i]-train[j] || for every i,j. (where both test[i] and train[j] have shape (c,h,w), as those are sampes in the batch).

The Problem

both train and test are very big, so I can't fit them into RAM

My current solution

For a start, I did not construct these tensors in one go - As I build them, I split the data Tensor and save them separately to memory, so I end up with files {Test\test_1,...,Test\test_n} and {Train\train_1,...,Train\train_m}.
Then, I load in a nested for loop every Test\test_i and Train\train_j, calculate the current distance, and save it.

This semi-pseudo-code might explain

test_files = [f'Test\test_{i}' for i in range(n)]
train_files = [f'Train\train_{j}' for j in range(m)]
dist = lambda t1,t2: torch.cdist(t1.flatten(1), t2.flatten(1)) 
all_distances = []
for test_i in test_files:
    test_i = torch.load(test_i) # shape (c,h,w)
    dist_of_i_from_all_j = torch.Tensor([])
    for train_j in train_files:
        train_j = torch.load(train_j) # shape (c,h,w)
        dist_of_i_from_all_j = torch.cat((dist_of_i_from_all_j, dist(test_i,train_j))
    all_distances.append(dist_of_i_from_all_j)
# and now I can take the k-smallest from all_distances

What I thought might work

I came across FAISS repository, in which they explain that this process can be sped up (maybe?) using their solutions, though I'm not quite sure how. Regardless, any approach would help!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

德意的啸 2025-02-18 21:02:00

您是否检查了 faiss文档

如果您需要的是L2 NORM(torch.cidst使用p = 2作为默认参数),则非常简单。下面的代码是您示例的FAISS文档的改编:

import faiss
import numpy as np
d = 64                           # dimension
nb = 100000                      # database size
nq = 10000                       # nb of queries
np.random.seed(1234)             # make reproducible
x_test = np.random.random((nb, d)).astype('float32')
x_test[:, 0] += np.arange(nb) / 1000.
x_train = np.random.random((nq, d)).astype('float32')
x_train[:, 0] += np.arange(nq) / 1000.

index = faiss.IndexFlatL2(d)   # build the index
print(index.is_trained)
index.add(x_test)                  # add vectors to the index
print(index.ntotal)

k= 100 # take the 100 closest neighbors
D, I = index.search(x_train, k)     # actual search
print(I[:5])                   # neighbors of the 100 first queries
print(I[-5:])                  # neighbors of the 100 last queries

Did you check the FAISS documentation?

If what you need is the L2 norm (torch.cidst uses p=2 as default parameter) then it is quite straightforward. Code below is an adaptation of the FAISS docs to your example:

import faiss
import numpy as np
d = 64                           # dimension
nb = 100000                      # database size
nq = 10000                       # nb of queries
np.random.seed(1234)             # make reproducible
x_test = np.random.random((nb, d)).astype('float32')
x_test[:, 0] += np.arange(nb) / 1000.
x_train = np.random.random((nq, d)).astype('float32')
x_train[:, 0] += np.arange(nq) / 1000.

index = faiss.IndexFlatL2(d)   # build the index
print(index.is_trained)
index.add(x_test)                  # add vectors to the index
print(index.ntotal)

k= 100 # take the 100 closest neighbors
D, I = index.search(x_train, k)     # actual search
print(I[:5])                   # neighbors of the 100 first queries
print(I[-5:])                  # neighbors of the 100 last queries
难理解 2025-02-18 21:02:00

因此,我选择实施某种版本的地球移动距离,如以下ai.stackexchange post 。 Let me summarize the approach:

Given the task as described in "My Task" above, I defined

def cumsum_3d(test, train):
    for i in [-1, -2, -3]:
        test = torch.cumsum(test, i)
        train = torch.cumsum(train, i)
    return test, train

then, given the tensors test and train:

test,train = cumsum_3d(test,train)
dist = torch.cdist(test.flatten(1),train.flatten(1))

For future viewers - bare in mind that:

  • I did not use FAISS because it does not support windows currently, but most importantly it does not support (as far as I know of) this version of EMD or any其他版本的多维(= Shape (C,H,W)在我的示例中类似)张量。 To account for the RAM problem I've used Google Colab and sliced my data to more files
  • This implementation was only relevant as I was dealing with shallow activation layers.如果我要使用最后一层(avgpool)作为激活,那么不使用EMD就可以了,作为avgpool具有Shape(512,)

Consequently, I chose to implement some version of the Earth-Movers-Distance, as was suggested in the following ai.StackExchange post. Let me summarize the approach:

Given the task as described in "My Task" above, I defined

def cumsum_3d(test, train):
    for i in [-1, -2, -3]:
        test = torch.cumsum(test, i)
        train = torch.cumsum(train, i)
    return test, train

then, given the tensors test and train:

test,train = cumsum_3d(test,train)
dist = torch.cdist(test.flatten(1),train.flatten(1))

For future viewers - bare in mind that:

  • I did not use FAISS because it does not support windows currently, but most importantly it does not support (as far as I know of) this version of EMD or any other version of multidimensional (=shape (c,h,w) like in my example) tensors distance. To account for the RAM problem I've used Google Colab and sliced my data to more files
  • This implementation was only relevant as I was dealing with shallow activation layers. If I were to use the last layer (avgpool) as my activations, It would have been fine not using the EMD, as the output right after the avgpool has shape (512,)
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文