更有效的方法来计算交叉点?

发布于 2024-08-14 08:40:46 字数 1318 浏览 4 评论 0原文

我有一个包含 300000 个列表(光纤轨道)的列表,其中每个轨道都是 (x,y,z) 元组/坐标的列表:

tracks=
[[(1,2,3),(3,2,4),...]
 [(4,2,1),(5,7,3),...]
 ...
]

我还有一组掩码,其中每个掩码定义为 (x, y,z)元组/坐标:

mask_coords_list=
[[(1,2,3),(8,13,4),...]
 [(6,2,2),(5,7,3),...]
 ...
]

我试图找到所有可能的掩码对:与

  1. 每个掩码-掩码对相交的轨道数量(以创建连接矩阵)与
  2. 每个掩码相交的轨道子集(按顺序)为子集中每个轨迹的每个 (x,y,z) 坐标添加 1(以创建“密度”图像)

我目前正在执行第 1 部分,如下所示:

def mask_connectivity_matrix(tracks,masks,masks_coords_list):
    connect_mat=zeros((len(masks),len(masks)))
    for track in tracks:
        cur=[]
        for count,mask_coords in enumerate(masks_coords_list):
            if any(set(track) & set(mask_coords)):
                cur.append(count)
            for x,y in list(itertools.combinations(cur,2)):
                connect_mat[x,y] += 1

和第 2 部分如下:

def mask_tracks(tracks,masks,masks_coords_list):
    vox_tracks_img=zeros((xdim,ydim,zdim,len(masks)))
    for track in tracks:
        for count,mask in enumerate(masks_coords_list):
            if any(set(track) & set(mask)):
                for x,y,z in track:
                    vox_tracks_img[x,y,z,count] += 1

使用集合查找交叉点显着加快了这个过程,但当我有 70 个或更多面具的列表时,这两个部分仍然需要一个多小时。有没有比迭代每个轨道更有效的方法?

I have a list of 300000 lists (fiber tracks), where each track is a list of (x,y,z) tuples/coordinates:

tracks=
[[(1,2,3),(3,2,4),...]
 [(4,2,1),(5,7,3),...]
 ...
]

I also have a group of masks, where each mask is defined as a list of (x,y,z) tuples/coordinates:

mask_coords_list=
[[(1,2,3),(8,13,4),...]
 [(6,2,2),(5,7,3),...]
 ...
]

I am trying to find, for all possible pairs of masks:

  1. the number of tracks that intersect each mask-mask pair (to create a connectivity matrix)
  2. the subset of tracks that intersect each mask, in order to add 1 to each (x,y,z) coordinate for each track in the subset (to create a "density" image)

I'm currently doing part 1 like so:

def mask_connectivity_matrix(tracks,masks,masks_coords_list):
    connect_mat=zeros((len(masks),len(masks)))
    for track in tracks:
        cur=[]
        for count,mask_coords in enumerate(masks_coords_list):
            if any(set(track) & set(mask_coords)):
                cur.append(count)
            for x,y in list(itertools.combinations(cur,2)):
                connect_mat[x,y] += 1

and part 2 like so:

def mask_tracks(tracks,masks,masks_coords_list):
    vox_tracks_img=zeros((xdim,ydim,zdim,len(masks)))
    for track in tracks:
        for count,mask in enumerate(masks_coords_list):
            if any(set(track) & set(mask)):
                for x,y,z in track:
                    vox_tracks_img[x,y,z,count] += 1

Using sets to find intersections has sped this process up significantly but both portions still take over an hour when I have a list of 70 or more masks. Is there a more efficient way to do this than iterating for each track?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

薯片软お妹 2024-08-21 08:40:46

线性化体素坐标,并将它们放入两个 scipy.sparse.sparse.csc 矩阵中。

令 v 为体素数,m 为掩模数,t 为轨道数。
设 M 为掩模 csc 矩阵,大小 (mxv),其中 (i,j) 处的 1 表示掩模 i 与体素 j 重叠。
令 T 为轨迹 csc 矩阵,大小为 (txv),其中 (k,j) 处的 1 表示轨迹 k 与体素 j 重叠。

Overlap = (M * T.transpose() > 0)  # track T overlaps mask M  
Connected = (Overlap * Overlap.tranpose() > 0) # Connected masks
Density[mask_idx] = numpy.take(T, nonzero(Overlap[mask_idx, :])[0], axis=0).sum(axis=0)

我在最后一个可能是错的,并且我不确定 css_matrices 是否可以通过非零 & 进行操作。拿。您可能需要在循环中取出每一列并将其转换为完整矩阵。


我进行了一些实验,试图模拟我认为合理的数据量。下面的代码在使用了 2 年的 MacBook 上大约需要 2 分钟。如果使用csr_matrices,大约需要4分钟。根据每条轨道的长度,可能需要进行权衡。

from numpy import *
from scipy.sparse import csc_matrix

nvox = 1000000
ntracks = 300000
nmask = 100

# create about 100 entries per track
tcoords = random.uniform(0, ntracks, ntracks * 100).astype(int)
vcoords = random.uniform(0, nvox, ntracks * 100).astype(int)
d = ones(ntracks * 100)
T = csc_matrix((d,  vstack((tcoords, vcoords))), shape=(ntracks, nvox), dtype=bool)

# create around 10000 entries per mask
mcoords = random.uniform(0, nmask, nmask * 10000).astype(int)
vcoords = random.uniform(0, nvox, nmask * 10000).astype(int)
d = ones(nmask * 10000)
M = csc_matrix((d, vstack((mcoords, vcoords))), shape=(nmask, nvox), dtype=bool)

Overlap = (M * T.transpose()).astype(bool) # mask M overlaps track T
Connected = (Overlap * Overlap.transpose()).astype(bool) # mask M1 and M2 are connected
Density = Overlap * T.astype(float) # number of tracks overlapping mask M summed across voxels

Linearize the voxel coordinates, and put them into two scipy.sparse.sparse.csc matrices.

Let v be the number of voxels, m the number of masks, and t the number of tracks.
Let M be the mask csc matrix, size (m x v), where a 1 at (i,j) means mask i overlaps voxel j.
Let T be the track csc matrix, size (t x v), where a 1 at (k,j) means track k overlaps voxel j.

Overlap = (M * T.transpose() > 0)  # track T overlaps mask M  
Connected = (Overlap * Overlap.tranpose() > 0) # Connected masks
Density[mask_idx] = numpy.take(T, nonzero(Overlap[mask_idx, :])[0], axis=0).sum(axis=0)

I might be wrong on the last one, and I'm not sure css_matrices can be operated on by nonzero & take. You might need to pull out each column in a loop and convert it to a full matrix.


I ran some experiments trying to simulate what I thought was a reasonable amount of data. The code below takes about 2 minutes on a 2-year old MacBook. If you use csr_matrices, it takes about 4 minutes. There is probably a tradeoff depending on how long each track is.

from numpy import *
from scipy.sparse import csc_matrix

nvox = 1000000
ntracks = 300000
nmask = 100

# create about 100 entries per track
tcoords = random.uniform(0, ntracks, ntracks * 100).astype(int)
vcoords = random.uniform(0, nvox, ntracks * 100).astype(int)
d = ones(ntracks * 100)
T = csc_matrix((d,  vstack((tcoords, vcoords))), shape=(ntracks, nvox), dtype=bool)

# create around 10000 entries per mask
mcoords = random.uniform(0, nmask, nmask * 10000).astype(int)
vcoords = random.uniform(0, nvox, nmask * 10000).astype(int)
d = ones(nmask * 10000)
M = csc_matrix((d, vstack((mcoords, vcoords))), shape=(nmask, nvox), dtype=bool)

Overlap = (M * T.transpose()).astype(bool) # mask M overlaps track T
Connected = (Overlap * Overlap.transpose()).astype(bool) # mask M1 and M2 are connected
Density = Overlap * T.astype(float) # number of tracks overlapping mask M summed across voxels
始于初秋 2024-08-21 08:40:46

好吧,我想我终于有了一些可以降低复杂性的东西。与您所拥有的相比,此代码应该确实可行。

似乎首先您需要知道哪些轨道与哪些掩码一致,即关联矩阵

import numpy
from collections import defaultdict

def by_point(sets):
    d = defaultdict(list)
    for i, s in enumerate(sets):
        for pt in s:
            d[pt].append(i)
    return d

def calc(xdim, ydim, zdim, mask_coords_list, tracks):
    masks_by_point = by_point(mask_coords_list)
    tracks_by_point = by_point(tracks)

    a = numpy.zeros((len(mask_coords_list), len(tracks)), dtype=int)
    for pt, maskids in masks_by_point.iteritems():
        for trackid in tracks_by_point.get(pt, ()):
            a[maskids, trackid] = 1
    m = numpy.matrix(a)

您要查找的邻接矩阵m * mT

到目前为止,您的代码仅计算上三角形。您可以使用 triu 来获取那一半。

    am = m * m.T  # calculate adjacency matrix
    am = numpy.triu(am, 1)  # keep only upper triangle
    am = am.A  # convert matrix back to array

体素计算也可以使用关联矩阵。

    vox_tracks_img = numpy.zeros((xdim, ydim, zdim, len(mask_coords_list)), dtype=int)
    for trackid, track in enumerate(tracks):
        for x, y, z in track:
            vox_tracks_img[x, y, z, :] += a[:,trackid]
    return am, vox_tracks_img

对我来说,对于具有数百个掩码和轨道的数据集,运行时间不到一秒。

如果您有许多点出现在蒙版中但不在任何轨道上,则在进入循环之前从 masks_by_point 中删除这些点的条目可能是值得的。

OK, I think I finally have something that will reduce the complexity. This code should really fly compared to what you've got.

It seems like first you need to know which tracks coincide with which masks, the incidence matrix.

import numpy
from collections import defaultdict

def by_point(sets):
    d = defaultdict(list)
    for i, s in enumerate(sets):
        for pt in s:
            d[pt].append(i)
    return d

def calc(xdim, ydim, zdim, mask_coords_list, tracks):
    masks_by_point = by_point(mask_coords_list)
    tracks_by_point = by_point(tracks)

    a = numpy.zeros((len(mask_coords_list), len(tracks)), dtype=int)
    for pt, maskids in masks_by_point.iteritems():
        for trackid in tracks_by_point.get(pt, ()):
            a[maskids, trackid] = 1
    m = numpy.matrix(a)

The adjacency matrix you're looking for is m * m.T.

The code you have so far computes the upper triangle only. You can use triu to grab just that half.

    am = m * m.T  # calculate adjacency matrix
    am = numpy.triu(am, 1)  # keep only upper triangle
    am = am.A  # convert matrix back to array

The voxel calculation can use the incidence matrix too.

    vox_tracks_img = numpy.zeros((xdim, ydim, zdim, len(mask_coords_list)), dtype=int)
    for trackid, track in enumerate(tracks):
        for x, y, z in track:
            vox_tracks_img[x, y, z, :] += a[:,trackid]
    return am, vox_tracks_img

For me this runs in under a second for data sets having hundreds of masks and tracks.

If you have many points that appear in masks but are not on any tracks, it might be worthwhile to delete the entries for those points from masks_by_point before entering the loop.

最笨的告白 2024-08-21 08:40:46

如果您存储了每个掩模点集:
(1,2,3), (1,2,4), (1,3,1) 作为字典,如下所示: {1: [{2: set([3, 4])}, { 3:set([1])}]},您最终可能能够更快地检查匹配...但也许不能。

If you stored the each mask set of points:
(1,2,3), (1,2,4), (1,3,1) as a dictionary like this: {1: [{2: set([3, 4])}, {3: set([1])}]}, you might end up being able to check for matches faster...but maybe not.

美人迟暮 2024-08-21 08:40:46

可以通过删除冗余操作来进行较小的优化(相同的大 O,稍小的乘数):

  1. 不要在每个轨道和掩码上多次调用 set:每个轨道调用一次,每个轨道调用一次mask,设置辅助“并行”集合列表,然后处理这些
  2. if any(someset): 语义上与 if someset: 相同,但速度

稍慢不会产生显着的差异,但可能会有所帮助。

A minor optimization (same big-O, sligthly smaller multiplier) can be had by removing redundant operations:

  1. don't call set so many times on each track and mask: call it once per track and once per mask, to set up auxiliary "parallel" lists of sets, then work on those
  2. if any(someset): is semantically the same as if someset: but a bit slower

Won't make a dramatic difference, but might minutely help.

陌伤浅笑 2024-08-21 08:40:46

您可以首先组合这两个函数来同时创建两个结果。此外,无需在循环之前列出组合列表,因为它已经是一个生成器,这可能会节省您一些时间。

def mask_connectivity_matrix_and_tracks(tracks,masks,masks_coords_list):
    connect_mat=zeros((len(masks),len(masks)))
    vox_tracks_img=zeros((xdim,ydim,zdim,len(masks)))
    for track in tracks:
        cur=[]
        for count,mask_coords in enumerate(masks_coords_list):
            if any(set(track) & set(mask_coords)):
                cur.append(count)
                for x,y,z in track:
                    vox_tracks_img[x,y,z,count] += 1
            for x,y in itertools.combinations(cur,2):
                connect_mat[x,y] += 1

另外,这可能永远不会像“在我们死之前完成”那样“快”,所以最好的方法是最终使用 Cython 作为 python 的 ac 模块进行编译。

You can probably start by combining the two functions to create both results at once. Also there's no need to make a list of the combinations before looping, as it is already a generator, and that might save you some time.

def mask_connectivity_matrix_and_tracks(tracks,masks,masks_coords_list):
    connect_mat=zeros((len(masks),len(masks)))
    vox_tracks_img=zeros((xdim,ydim,zdim,len(masks)))
    for track in tracks:
        cur=[]
        for count,mask_coords in enumerate(masks_coords_list):
            if any(set(track) & set(mask_coords)):
                cur.append(count)
                for x,y,z in track:
                    vox_tracks_img[x,y,z,count] += 1
            for x,y in itertools.combinations(cur,2):
                connect_mat[x,y] += 1

Also, this will probably never be "fast" as in "finished before we die", so the best way is to eventually compile it with Cython as a c module for python.

情绪失控 2024-08-21 08:40:46

我知道,提出另一个可能进行的增量改进是蹩脚的,但是:

可以使用 Python 的长整型将小整数集建模为位向量。假设您用一个小整数 id 替换每个元组,然后将每个轨道和每组掩码坐标转换为一组这些小 id。您可以将这些集合表示为长整数,从而使交集运算更快一些(但不是渐近更快)。

Lame to suggest yet another incremental improvement that might be made, I know, but:

Sets of small integers can be modeled as bit-vectors using Python's long ints. Suppose you replace each tuple with a small integer id, then convert each track and each set of mask-coords into a set of those small ids. You could represent those sets as long ints, making the intersection operation a bit faster (but not asymptotically faster).

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文