时间复杂度优于 O(n**2) 的成对比较算法

发布于 2025-01-11 11:57:02 字数 567 浏览 0 评论 0原文

我有大约 500,000 个 10 个单词的数组,即 500,000 个单词 10-grams。对于每个 10 克,我需要知道其余 499,999 个 10 克在哪些位置(如果有)具有相同的元素:

a = ['A', 'B', 'C', ' D', 'E', 'F', 'G', 'H', 'I', 'J']

b = ['A', 'M', 'C', 'M', 'E', 'M', 'G', 'M', 'I', 'M']

...

z = ['R', ' R', 'R', 'R', 'R', 'F', 'G', 'H', 'I', 'J']

如果我们对两个数组包含相同单词的位置使用 1和一个0 对于包含不同单词的位置,ab 的交集将表示为 [1, 0, 1, 0, 1, 0, 1, 0, 1, 0]; az 的交集将表示为 [0, 0, 0, 0, 0, 1, 1, 1, 1, 1] 等。

我们可以比简单的 O(n**2) 算法(即一个 for 循环在另一个 for 循环中)做得更好?

I have around 500,000 arrays of 10 words i.e. 500,000 word 10-grams. For every 10-gram, I need to know in which positions, if any, the remaining 499,999 10-grams have identical elements:

a = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J']

b = ['A', 'M', 'C', 'M', 'E', 'M', 'G', 'M', 'I', 'M']

...

z = ['R', 'R', 'R', 'R', 'R', 'F', 'G', 'H', 'I', 'J']

If we use a 1 for positions where the two arrays contain the same word and a 0 for positions where they contain different words, the intersection of a with b would be represented as [1, 0, 1, 0, 1, 0, 1, 0, 1, 0]; the intersection of a with z would be represented as [0, 0, 0, 0, 0, 1, 1, 1, 1, 1], etc.

Can we do better than a naive O(n**2) algorithm, i.e. one for loop within another for loop?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

巴黎夜雨 2025-01-18 11:57:02

有趣的问题:)

所以我有一个想法,我认为它是O(n*log(n) + n),其中+n渐近无关。

所以我建议如下:

tuple_len = 10
min_value = 1
max_value = 10
number_of_entries = 100
l = [[j] + [randint(min_value,max_value) for i in range(tuple_len)] for j in range(number_of_entries)]

基本集:

[[0, 9, 10, 3, 6, 3, 10, 9, 7, 8, 4],
 [1, 2, 3, 6, 7, 9, 2, 5, 10, 6, 10],
 [2, 5, 4, 10, 8, 5, 9, 2, 7, 4, 3],
 [3, 5, 9, 4, 5, 5, 3, 10, 1, 4, 4],
 [4, 9, 10, 9, 10, 9, 10, 6, 1, 6, 2],
 [5, 5, 6, 3, 6, 9, 5, 8, 3, 1, 1],
 [6, 9, 7, 5, 5, 5, 2, 1, 2, 3, 6],
 [7, 2, 6, 9, 10, 5, 6, 7, 3, 7, 5],
 [8, 6, 8, 9, 3, 7, 1, 2, 9, 8, 10],
 [9, 7, 5, 7, 2, 1, 3, 7, 1, 2, 9],
 [10, 1, 4, 4, 3, 6, 9, 6, 3, 3, 8],
 [11, 8, 3, 10, 10, 5, 9, 7, 3, 4, 5],
...]

所以为了方便起见,我只是使用数字并将列表中的位置添加为第一个值。

我建议对数据集的每一列数据依次进行排序,其中排序是O(n*log(n)),然后将所有具有相同值的条目的位置值添加到集,其时间复杂度为O(n)。结果类似于:

[{6, 18, 24, 26},
 {22, 34},
 {1, 6, 19, 31, 57, 58},
 {1, 9, 18},
...}

这可以解释为条目 6、18、24 和 26 在位置 1 具有相同的值。
检查两个条目是否对应的方法是 Ò(1) with

true if (a in match_set) and (b in match_set) else false

代码示例如下:

match_sets = [set() for i in range(tuple_len)]


for position in range(tuple_len):
    l = sorted(l, key= lambda x: x[position+1])
    last_value = l[0][position+1]
    for entry in range(number_of_entries):
        if l[entry][position + 1] == last_value:
            match_sets[position].add(l[entry][0])
            last_value = l[entry][position + 1]
        

Fun question :)

So I have an idea and I think it is O(n*log(n) + n) where +n asymptotically irrelevant.

So I would propose something as follows:

tuple_len = 10
min_value = 1
max_value = 10
number_of_entries = 100
l = [[j] + [randint(min_value,max_value) for i in range(tuple_len)] for j in range(number_of_entries)]

Base set:

[[0, 9, 10, 3, 6, 3, 10, 9, 7, 8, 4],
 [1, 2, 3, 6, 7, 9, 2, 5, 10, 6, 10],
 [2, 5, 4, 10, 8, 5, 9, 2, 7, 4, 3],
 [3, 5, 9, 4, 5, 5, 3, 10, 1, 4, 4],
 [4, 9, 10, 9, 10, 9, 10, 6, 1, 6, 2],
 [5, 5, 6, 3, 6, 9, 5, 8, 3, 1, 1],
 [6, 9, 7, 5, 5, 5, 2, 1, 2, 3, 6],
 [7, 2, 6, 9, 10, 5, 6, 7, 3, 7, 5],
 [8, 6, 8, 9, 3, 7, 1, 2, 9, 8, 10],
 [9, 7, 5, 7, 2, 1, 3, 7, 1, 2, 9],
 [10, 1, 4, 4, 3, 6, 9, 6, 3, 3, 8],
 [11, 8, 3, 10, 10, 5, 9, 7, 3, 4, 5],
...]

So I just used numbers for convenience and added the position in list as first value.

I propose to sort the dataset for each column of the data in turn, where sorting is O(n*log(n)), and then add the positional value of all entries with the same value to a set, which is O(n). The result looks something like:

[{6, 18, 24, 26},
 {22, 34},
 {1, 6, 19, 31, 57, 58},
 {1, 9, 18},
...}

This can be interpreted as Entry 6, 18, 24 and 26 have the same value in position 1.
Checking if two entries correspond is Ò(1) with

true if (a in match_set) and (b in match_set) else false

Code example below:

match_sets = [set() for i in range(tuple_len)]


for position in range(tuple_len):
    l = sorted(l, key= lambda x: x[position+1])
    last_value = l[0][position+1]
    for entry in range(number_of_entries):
        if l[entry][position + 1] == last_value:
            match_sets[position].add(l[entry][0])
            last_value = l[entry][position + 1]
        
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文