有人知道 Python 中一个很棒的稀疏一维数组库吗?

发布于 2024-09-05 05:55:00 字数 340 浏览 14 评论 0原文

我正在 Python 中研究一种大量使用 int64 数组的算法。数组通常是稀疏的,并且不断地读取和写入。我目前使用的是相对较大的本机数组,性能很好,但内存使用率很高(正如预期的那样)。

我希望能够让数组实现不为未使用的值浪费空间,并允许索引偏移量不为零。举个例子,如果我的数字从 1,000,000 开始,我希望能够索引从 1,000,000 开始的数组,并且不需要浪费一百万个未使用的值的内存。

数组读取和写入需要快速。扩展到新的领域可能会有一点延迟,但如果可能的话,读取和写入应该是 O(1)。

有人知道有一个图书馆可以做到这一点吗?

谢谢!

更新为将 int64 作为数据类型。

I am working on an algorithm in Python that uses arrays of int64s heavily. The arrays are typically sparse and are read from and written to constantly. I am currently using relatively large native arrays and the performance is good but the memory usage is high (as expected).

I would like to be able to have the array implementation not waste space for values that are not used and allow an index offset other than zero. As an example, if my numbers start at 1,000,000 I would like to be able to index my array starting at 1,000,000 and not be required to waste memory with a million unused values.

Array reads and writes needs to be fast. Expanding into new territory can be a small delay but reads and writes should be O(1) if possible.

Does anybody know of a library that can do it?

Thanks!

Updated to mention int64 as the data type.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

好久不见√ 2024-09-12 05:55:00

听起来像 blist 类型(文档下载)可能正是您正在寻找的内容(免责声明:我是作者)。它与 Python 的 list 具有完全相同的接口,因此没有学习曲线,但它具有不同的性能特征。特别是,它在许多情况下可以有效地处理稀疏列表。下面是创建一个包含 2**29 个元素的列表的示例。这几乎是瞬时的。以这种方式创建的稀疏列表使用 O(log n) 空间。

>>> from blist import *
>>> x = blist([0])             # x is a blist with one element
>>> x *= 2**29                 # x is a blist with > 500 million elements
>>> x.append(5)                # append to x
>>> y = x[4:-234234]           # Take a 500 million element slice from x
>>> del x[3:1024]              # Delete a few thousand elements from x

迭代整个列表的操作仍然需要 O(n) 时间(removereversecount 等)。该文档详细说明了每个操作的时间复杂度,因此您应该能够评估它是否满足您的需求。

It sounds like the blist type (documentation, download) might be just what you're looking for (disclaimer: I'm the author). It has exactly the same interface as Python's list, so there's no learning curve, but it has different performance characteristics. In particular, it can efficiently handle sparse lists in many cases. Below is an example that creates a list with 2**29 elements. It's pretty much instantaneous. Sparse lists created in this manner use O(log n) space.

>>> from blist import *
>>> x = blist([0])             # x is a blist with one element
>>> x *= 2**29                 # x is a blist with > 500 million elements
>>> x.append(5)                # append to x
>>> y = x[4:-234234]           # Take a 500 million element slice from x
>>> del x[3:1024]              # Delete a few thousand elements from x

Operations that iterate over the whole list still take O(n) time (remove, reverse, count, etc.). The documentation spells out the time-complexity for each operation, so you should be able to assess if it will meet your needs.

沉睡月亮 2024-09-12 05:55:00

You could remap a numpy sparse matrix into a sparse array - or consider using a hash table(a python dict). As far as the offset is concerned, just wrap whatever storage class you are using and make you own insert/lookup/remove methods.

情魔剑神 2024-09-12 05:55:00

我不懂Python,所以这可能是一个无法回答的问题:

在某些语言中,您可以通过将索引空间中的函数定义到数据空间中来模拟稀疏数组。例如(伪代码):

f[1000000] = 32;
f[2000000] = 51;

在某些语言(最好的语言)中,数组引用的形式(例如 f[3])看起来就像函数调用的形式(例如 f[3])。当然,这是因为数组定义了从索引空间到数据空间的函数。通过这种方式实现高维稀疏数组也非常容易。

I don't know any Python so this is probably an un-answer:

In some languages you can simulate a sparse array by defining a function from your index space into your data space. For example (pseudo-code):

f[1000000] = 32;
f[2000000] = 51;

In some languages (the best ones) the form of an array reference (eg f[3]) looks just like the form of a function call (eg f[3]). This is, of course, because an array defines a function from an index space into a data space. It's very easy to implement higher-dimensioned sparse arrays this way too.

囚我心虐我身 2024-09-12 05:55:00

为什么不直接使用字典呢?

sparse = dict()
sparse[100000] = 1234
sparse[123456] = 2345

Why not just use a dict?

sparse = dict()
sparse[100000] = 1234
sparse[123456] = 2345
妳是的陽光 2024-09-12 05:55:00

另一种选择 - 至少如果您愿意自己实现的话 - 是页表。这通常在虚拟内存系统中用于将虚拟地址映射到物理地址,如果您的地址空间稀疏并且使用的地址聚集,那么它的效果最好。如果使用的地址是随机分布的,则效果会较差。

页表的基本方法与 Trie 相同 - 递归细分。页表有一些固定数量的级别,每个节点都是一个固定大小的数组。如果给定子节点的条目为空,则该节点覆盖的所有叶子都为空。页表的主要优点是查找速度快,只需要很少的位移和取消引用。

让我们看一下 2 级页表的简单 Python 实现:

class Pagetable(object):
  def __init__(self, num_bits=8):
    """Creates a new Pagetable with num_bits bits per level.

    Args:
      num_bits: The number of bits per pagetable level.
        A 2 level pagetable will be able to store indexes between 0 and 2^(num_bits*2).
    """
    self.num_bits = num_bits
    self.mask = (1 << num_bits) - 1
    self.root = [None] * (2 ** num_bits)

  def __getitem__(self, idx):
    page = self.root[idx >> self.num_bits]
    return page and page[idx & self.mask]

  def __setitem__(self, idx, val):
    page = self.root[idx >> self.num_bits]
    if not page:
      page = self.root[idx >> self.num_bits] = [None] * (2 ** self.num_bits)
    page[idx & self.mask] = val

Another option - at least if you're willing to implement one yourself - is a Page table. This is commonly used in virtual memory systems to map virtual addresses to physical addresses, and it works best if your address space is sparsely populated, and used addresses are clustered. If used addresses are randomly distributed, this will be less effective.

The basic approach of a page table is the same as a Trie - recursive subdivision. A page table has some fixed number of levels, and each node is an array of a fixed size. If the entry for a given subnode is null, all the leaves covered by that node are null. The main advantage of the page table is that lookups are fast, requiring only a few bit-shifts and dereferences.

Let's see a straightforward Python implementation of a 2-level pagetable:

class Pagetable(object):
  def __init__(self, num_bits=8):
    """Creates a new Pagetable with num_bits bits per level.

    Args:
      num_bits: The number of bits per pagetable level.
        A 2 level pagetable will be able to store indexes between 0 and 2^(num_bits*2).
    """
    self.num_bits = num_bits
    self.mask = (1 << num_bits) - 1
    self.root = [None] * (2 ** num_bits)

  def __getitem__(self, idx):
    page = self.root[idx >> self.num_bits]
    return page and page[idx & self.mask]

  def __setitem__(self, idx, val):
    page = self.root[idx >> self.num_bits]
    if not page:
      page = self.root[idx >> self.num_bits] = [None] * (2 ** self.num_bits)
    page[idx & self.mask] = val
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文