计算并集和交集的编程方法的正式名称

发布于 2024-08-17 02:31:23 字数 639 浏览 4 评论 0原文

当我想同时计算两个集合（存储为列表）的并集、交集和差异时，我[当然是重新]发明了这个[轮子]。初始代码（不是最严格的）：

dct = {}
for a in lst1:
  dct[a] = 1
for b in lst2:
  if b in dct:
    dct[b] -= 1
  else:
    dct[b] = -1

union = [k for k in dct]
inter = [k for k in dct if dct[k] == 0]
oneminustwo = [k for k in dct if dct[k] ==  1]
twominusone = [k for k in dct if dct[k] ==  -1]

然后我意识到我应该使用 00, 01, 10 和 11 而不是 -1, 1, 0, ... 因此，位置 n 处的位表示集合 n 中的成员资格。

使用 32 位 int 可以将其推广到最多 32 个集合，或者使用位数组或字符串可以推广到任意数量的集合。因此，您预先计算该字典一次，然后使用非常快的 O(n) 查询来提取感兴趣的元素。例如，全 1 表示所有集合的交集。全 0 是一种特殊的 - 不会发生。

无论如何，这并不是我自吹自擂。这肯定是以前发明的并且有一个名字。它叫什么？这种方法是否在数据库中使用过？

原文

I [surely re] invented this [wheel] when I wanted to compute the union and the intersection and diff of two sets (stored as lists) at the same time. Initial code (not the tightest):

dct = {}
for a in lst1:
  dct[a] = 1
for b in lst2:
  if b in dct:
    dct[b] -= 1
  else:
    dct[b] = -1

union = [k for k in dct]
inter = [k for k in dct if dct[k] == 0]
oneminustwo = [k for k in dct if dct[k] ==  1]
twominusone = [k for k in dct if dct[k] ==  -1]

Then I realized that I should use 00, 01, 10, and 11 instead of -1, 1, 0, ...
So, a bit at position n indicates membership in set n.

This can be generalized to up to 32 sets using an 32-bit int, or to any number of sets using a bitarray, or a string. So, you pre-compute this dictionary once, and then use very fast O(n) queries to extract elements of interest. For instance, all 1s means intersection of all sets. All 0s is a special one - will not occur.

Anyhow, this is not to toot my own horn. This surely was invented before and has a name. What is it called? Is this approach used in databases somewhere?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

守护在此方 2024-08-24 02:31:23

使用 N 位整数来表示 N 个布尔值是称为完美哈希表的数据结构的一种特殊情况。请注意，您在提示您考虑位集的想法中明确使用了字典（它们是通用哈希表）。它是一个哈希表，因为您使用哈希来查找值，并且它是完美的，因为您永远不会发生冲突。特殊情况是由表的打包和存储方式决定的。

制定哈希函数，这显示了它与数组的不同之处：

int bitset_hash(int n) {
  // domain of this function is only non-negative ints
  return 1 << n;
}

注意 bitset_hash(3) 是 0b1000，它对应于使用 C int 和按位运算时的第 4 项（偏移量/索引 3）。（由于存储实现细节，按位运算也用于操作哈希中的特定项。）

扩展该方法以使用按位与/-或/-xor 进行集合运算是常见，并且不需要除“设置操作”之外的任何特殊名称，或者，如果您需要流行语，“集合论”。

最后，这是在 prime sieve 中使用它的另一个示例（我已在欧拉项目解决方案）：

class Sieve(object):
  def __init__(self, stop):
    self.stop = stop
    self.data = [0] * (stop // 32 // 2 + 1)
    self.len = 1 if stop >= 2 else 0
    for n in xrange(3, stop, 2):
      if self[n]:
        self.len += 1
        for n2 in xrange(n * 3, stop, n * 2):
          self[n2] = False

  def __getitem__(self, idx):
    assert idx >= 2
    if idx % 2 == 0:
      return idx == 2
    int_n, bit_n = divmod(idx // 2, 32)
    return not bool(self.data[int_n] & (1 << bit_n))

  def __setitem__(self, idx, value):
    assert idx >= 2 and idx % 2 != 0
    assert value is False
    int_n, bit_n = divmod(idx // 2, 32)
    self.data[int_n] |= (1 << bit_n)

  def __len__(self):
    return self.len

  def __iter__(self):
    yield 2
    for n in xrange(3, self.stop, 2):
      if self[n]:
        yield n

Using an N bit integer to represent N booleans is a special case of the data structure known as a perfect hash table. Notice you're explicitly using dicts (which are general hash tables) in the idea that prompted you to think about bitsets. It's a hash table because you use hashes to find a value, and it's perfect because you never have collisions. The special case is because of how the table is packed and stored.

Formulate the hash function, which shows how it's different from an array:

int bitset_hash(int n) {
  // domain of this function is only non-negative ints
  return 1 << n;
}

Notice bitset_hash(3) is 0b1000, which corresponds to the 4th item (offset/index 3) when using a C int and bitwise operations. (Because of the storage implementation detail, bitwise operations are also used to manipulate a specific item from the hash.)

Extending the approach to use bitwise-and/-or/-xor for set operations is common, and doesn't require any special name other than "set operations" or, if you need a buzzword, "set theory".

Finally, here's another example use of it in a prime sieve (I've used this code on Project Euler solutions):

class Sieve(object):
  def __init__(self, stop):
    self.stop = stop
    self.data = [0] * (stop // 32 // 2 + 1)
    self.len = 1 if stop >= 2 else 0
    for n in xrange(3, stop, 2):
      if self[n]:
        self.len += 1
        for n2 in xrange(n * 3, stop, n * 2):
          self[n2] = False

  def __getitem__(self, idx):
    assert idx >= 2
    if idx % 2 == 0:
      return idx == 2
    int_n, bit_n = divmod(idx // 2, 32)
    return not bool(self.data[int_n] & (1 << bit_n))

  def __setitem__(self, idx, value):
    assert idx >= 2 and idx % 2 != 0
    assert value is False
    int_n, bit_n = divmod(idx // 2, 32)
    self.data[int_n] |= (1 << bit_n)

  def __len__(self):
    return self.len

  def __iter__(self):
    yield 2
    for n in xrange(3, self.stop, 2):
      if self[n]:
        yield n

回复收藏 0 原文

少年亿悲伤 2024-08-24 02:31:23

是的，它有时用在数据库中，例如 PostgreSQL。正如维基百科提到的：

一些数据库系统没有
提供持久位图索引使用
内部位图以加快查询速度
加工。例如，PostgreSQL
版本 8.1 及更高版本实现了
“位图索引扫描”优化为
加速任意复杂的逻辑
可用索引之间的操作
在一张桌子上。

（来自 http://en.wikipedia.org/wiki/Bitmap_index）

回复收藏 0 原文

孤独患者 2024-08-24 02:31:23

使用整数来表示一组小整数是很常见的；它通常称为位集或位向量。这里您使用一个整数来表示“包含该值的输入序列集”。

你所做的操作让我想起了反转多重地图。

在您的情况下，输入是列表的列表：

[["a", "b"], ["a", "c", "d"]]

但是您可以将其视为一袋有序对，如下所示：

0, "a"
0, "b"
1, "a"
1, "c"
1, "d"

您只需构建一个包含反转对的表，

"a", 0
"b", 0
"a", 1
"c", 1
"d", 1

如下所示：

{"a": [0, 1],
 "b": [0],
 "c": [1],
 "d": [1]}

并且您碰巧使用位向量表示这些整数数组。

原始数据结构（列表的列表）可以轻松迭代给定列表的所有值。反向数据结构（列表字典）可以轻松查找具有给定值的所有列表。

It's very common to use an integer to represent a set of small integers; it's often called a bitset or bitvector. Here you're using an integer to represent "the set of input sequences that contain this value".

The operation you're doing reminds me of reversing a multimap.

In your case, the input is a list of lists:

[["a", "b"], ["a", "c", "d"]]

But you could instead think of it as a bag of ordered pairs, like this:

0, "a"
0, "b"
1, "a"
1, "c"
1, "d"

You're simply constructing a table that contains the reversed pairs

"a", 0
"b", 0
"a", 1
"c", 1
"d", 1

which looks like this:

{"a": [0, 1],
 "b": [0],
 "c": [1],
 "d": [1]}

and you happen to be representing those arrays of integers using bitvectors.

The original data structure (the list of lists) made it easy to iterate over all values for a given list. The reversed data structure (the dictionary of lists) makes it easy to find all lists that have a given value.

回复收藏 0 原文

~没有更多了~