Python(或 C)中内存高效的字符串到字符串映射

发布于 2024-09-28 19:27:52 字数 940 浏览 1 评论 0原文

我需要一个内存高效的数据结构来存储大约一百万个键值对,其中键是大约 80 字节的字符串,值是大约 200 字节的字符串,键和值的总大小约为 280MB。我还需要通过键(最好是哈希映射)有效查找值。内存开销应尽可能小,例如,对于 280MB 的有用数据,数据结构不应使用超过 300MB 的虚拟内存(包括 malloc() 开销和其他所有内容)。使用模式如下:我们从一个空数据结构开始,然后逐渐填充它,从不更改键,也从不更改值的长度。作为一个优点,数据结构可以支持更改值的长度,但代价是 100% 的值开销(意味着对于 x 值字节,x 字节可能会暂时浪费在未使用的缓冲区空间中)。

我需要一个纯 Python 模块,或内置 Python 模块,或最好带有 (C)Python 绑定的 C 实现。我希望能够将整个数据结构序列化到磁盘,并非常快速地读回。

为了证明如此小的开销是可能的,我使用开放寻址创建了一个简单的设计,包含 125 万个元素的哈希表,包含指向 1MB 数据块的 4 字节指针,数据块包含的键和值长度为 base-128 变种。这种设计有一个重要的限制:它不允许在不浪费内存区域的情况下删除或更改对。根据我对 100 万个键值对(每个键值对 280 字节)的计算,开销小于 3.6%(10 080 000 字节)。上述限制更为宽松,它们允许 20 000 000 字节的开销。

我刚刚发现 http://www.pytables.org/ ,它提供快速访问和内存效率数据打包。我必须更仔细地检查它,看看它是否适合我的需要。

I need a memory-efficient data structure for for storing about a million key--value pairs, where keys are strings of about 80 bytes, and values are strings of about 200 bytes, the total key and value size being about 280MB. I also need efficient lookup of value by key, preferably a hash-map. The memory overhead should be as little as possible, e.g. for 280MB of useful data, the data structure shouldn't use more than 300MB of virtual memory (including malloc() overhead and everything else). The usage pattern is the following: we start with an empty data structure, and we populate it gradually, never changing keys, and never changing the length of values. As a plus, the data structure may support changing the length of values, at the expense of a 100% value overhead (meaning that for x value bytes, x bytes might be wasted in temporarily in unused buffer space).

I need a pure Python module, or a built-in Python module, or a C implementation preferably with (C)Python bindings. I'd prefer if it was possible to serialize the whole data structure to disk, and to read it back very quickly.

Just to prove that such a small overhead is possible, I've created a simple design with open addressing, the hash table of 1.25 million elements containing 4-byte pointers to 1MB data blocks, the data blocks containing the key and value lengths as base-128 varints. This design has an important limitation: it doesn't allow removing or changing pairs without wasting their memory area. According to my calculations with 1 million key--value pairs of 280 bytes each, the overhead is less than 3.6% (10 080 000 bytes). The limits above are more generous, they allow 20 000 000 bytes of overhead.

I've just found http://www.pytables.org/ , which provides fast access and memory-efficient packing of data. I have to examine it more closely to check if it suits my needs.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(10

橘和柠 2024-10-05 19:27:52

好的,非常简单的方法。

使用Python字典作为数据结构。我用 100 万个随机键值对填充了一个 Python 字典,其中键为 80 个字符,值为 200 个字符。在我的计算机上,它占用了 360,844 Kb,这超出了不超过 300 MB 的规格,但无论如何我都将其作为解决方案提供,因为它仍然具有相当高的内存效率。

这也满足了您对 C API 的要求。我不确定你为什么需要 C,但由于问题被标记为 Python 并且缺少 C 标记,我将提供纯 Python 来看看它是否符合要求。

关于坚持。使用 cPickle 模块。它非常快,而且非常简单。保存你的字典:

cPickle.dump(mydict, "myfile.pkl")

重新加载你的字典:

mydict = cPickle.load("myfile.pkl")

第二个简单的想法是使用 shelve 模块,它基本上是基于磁盘的 python 字典。内存开销非常低(全部在磁盘上)。但它也慢得多。

Ok, the dirt-simple approach.

Use a python dictionary for the data structure. I filled a python dictionary with 1 million random key-value pairs where the key was 80 characters and the value 200 characters. It took 360,844 Kb on my computer, which is outside of your specification of no more than 300 MB, but I offer it up as a solution anyway because it's still pretty memory efficient.

This also fails your requirement of having a C API. I'm not sure why you need C, but as the question is tagged Python and lacks a C tag, I'll offer the pure Python to see if it just might fit the bill.

Regarding persistence. Use the cPickle module. It's very fast and, again, dirt-simple. To save your dictionary:

cPickle.dump(mydict, "myfile.pkl")

To reload your dictionary:

mydict = cPickle.load("myfile.pkl")

A second dirt-simple idea is to use the shelve module, which is basically disk-based python dictionary. Memory overhead is very low (it's all on disk). But it's also much slower.

避讳 2024-10-05 19:27:52

Martijn 在评论中提到了这一点(不知道为什么人们会用答案来评论),但我同意:使用 SQLite。您应该尝试一下,看看它是否能满足您的需求。

Martijn mentioned this in a comment (not sure why people comment with answers), but I agree: use SQLite. You should give it a try and see if it will meet your needs.

樱娆 2024-10-05 19:27:52

如果您不打算进行大量删除,那么这并不难。删除会导致碎片。

您还需要承诺固定长度的密钥。你提到了80字节。您的钥匙允许复制吗?如果没有,那就更容易了。

所以,这就是你要做的。

您创建一个数组:

struct {
    char value[80];
    char *data;
} key;

并保持该数组已排序。

如果你的键可以重复,那么你需要:(

struct link {
    char *data;
    link *next;
}

struct {
    char value[80];
    link *data;
} key;

我的C很生锈,但这就是它的要点)后者让每个键都指向一个值的链接列表。

那么查找就是简单的二分查找。 “痛苦”在于维护这个数组和插入/删除键。它并不像听起来那么痛苦,但它节省了大量内存,尤其是在 64 位系统上。

你想要减少的是指针的数量。当您有大量充满指针的结构时,指针的成本很高。在64位系统上,一个指针有8个字节。因此,对于单个指针,需要 8MB 的内存预算。

因此,费用在于构建数组、复制和压缩内存(如果您“知道”您将有一百万行并且可以承诺这一点,那么立即 malloc(1000000 * sizeof(key)) ,它将节省您的时间扩展期间进行一些复制)。

但不要害怕,一旦启动并运行,性能就相当不错了。现代 cpu 实际上非常擅长复制 100M 内存块。

顺便说一句,我刚刚在 Java 中做了类似的事情。在 64 位 JVM 上,具有 25M 条目的 Map 需要 2G RAM。我的解决方案(使用与此类似的技术)大约有 600M)。 Java比C使用更多的指针,但前提是相同的。

If you don't plan to have a large amounts of deletes, then this isn't that hard. Deletes lead to fragmentation.

You also need to commit to a fixed length key. You mentioned 80 bytes. Are your keys allowed to duplicate? If not, it's even easier.

So, here is what you do.

You create an array of:

struct {
    char value[80];
    char *data;
} key;

And you keep this array sorted.

If you keys can duplicate, then you need:

struct link {
    char *data;
    link *next;
}

struct {
    char value[80];
    link *data;
} key;

(My C is rusty, but this is the gist of it) The latter has each key pointing to a linked list of values.

Then a lookup is a simple binary search. The "pain" is in maintaining this array and inserting/deleting keys. It's not as painful as it sounds, but it saves a LOT of memory, especially on 64bit systems.

What you want to reduce is the number of pointers. Pointers are expensive when you have lots of structures filled with pointers. On a 64bit system, a pointer is 8 bytes. So for a single pointer, there goes 8MB of your memory budget.

So, the expense is in building the array, copying and compacting memory (if you "know" you will have a million rows and can commit to that, then malloc(1000000 * sizeof(key)) right away, it'll save you some copying during expansion).

But don't be afraid of, once it's up and running, performance is quite good. Modern cpus are actually pretty good at copying 100M blocks of memory around.

Just as an aside, I just did something much like this in Java. On a 64bit JVM, a Map with 25M entries is 2G of RAM. My solution (using similar techniques to this) has at around 600M). Java uses more pointers than C, but the premise is the same.

原谅我要高飞 2024-10-05 19:27:52

您是否尝试过使用简单的字典?您的大部分数据都是字符串,因此开销可能符合您的要求。

Have you tried using a straightforward dict? Most of your data is in strings, so the overhead might fit within your requirements.

半葬歌 2024-10-05 19:27:52

您可以使用密钥的 sha1 而不是密钥本身。如果密钥是唯一的,那么密钥的 sha1 哈希值也可能是唯一的。它可以节省内存,以尽量低于您的限制。

from random import choice
from string import letters
from hashlib import sha1

def keygen(length):
    return "".join(choice(letters) for _ in xrange(length))

def gentestdata(n=1000*1000):
    # return dict((sha1(keygen(80)).digest(), keygen(200)) for _ in xrange(n))
    d = {}
    for _ in xrange(n):
        key = sha1(keygen(80)).digest()
        assert key not in d
        value = keygen(200)
        d[key] = value
    return d

if __name__ == '__main__':
    d = gentestdata()

在我的 ubuntu 机器上,内存最高为 304 MB:

2010-10-26 14:26:02 hbrown@hbrown-ubuntu-wks:~$ ps aux | grep python
[...]
hbrown   12082 78.2  7.5 307420 303128 pts/1   S+   14:20   4:47 python

足够接近了吗?它是 python,而不是 C。


稍后:另外,如果您的数据有些冗余,您可以对值进行 gzip 压缩。这是时间与空间的权衡。

You can use the sha1 of the key instead of the key itself. If the keys are unique, then the sha1 hash of the keys is likely, too. It provides a memory savings to try to squeak in under your limit.

from random import choice
from string import letters
from hashlib import sha1

def keygen(length):
    return "".join(choice(letters) for _ in xrange(length))

def gentestdata(n=1000*1000):
    # return dict((sha1(keygen(80)).digest(), keygen(200)) for _ in xrange(n))
    d = {}
    for _ in xrange(n):
        key = sha1(keygen(80)).digest()
        assert key not in d
        value = keygen(200)
        d[key] = value
    return d

if __name__ == '__main__':
    d = gentestdata()

On my ubuntu box, this tops out at 304 MB of memory:

2010-10-26 14:26:02 hbrown@hbrown-ubuntu-wks:~$ ps aux | grep python
[...]
hbrown   12082 78.2  7.5 307420 303128 pts/1   S+   14:20   4:47 python

Close enough? It's python, not C.


Later: also, if your data is somewhat redundant, you can gzip the values. It's a time versus space trade-off.

浅暮の光 2024-10-05 19:27:52

使用 SQLite 是一个好主意。快速实施可以告诉您是否足够快,只需付出很少的努力。


如果您确定必须自己推出,我会建议您执行以下操作:

您可以如何准确地预测对的数量或上限?
您能否准确地预测总数据大小或其上限?

用于字符串和节点的 Arena 分配器。 (通常,您会处理一系列竞技场,因此您不必预测总大小)。

对齐取决于您的算法,原则上您可以将其打包为字节紧的,唯一的开销是您的过度分配,这对您的工作集影响很小。

但是,如果您必须对这些字符串运行任何 cmp/copy 等操作,请记住,通过以下保证,您可以从这些字符串操作中压缩一点或很多:

  • 所有元素都是 CPU 字对齐的,
  • 所有填充字节都是对齐的(例如) 0
  • 只要不跨越

索引的 CPU 边界哈希表,您就可以安全地读取“超出”字符串结尾。字典也可以,但只有当潜在的退化/重新散列是一个严重的问题时才有意义。我不知道 C 语言有任何“库存”哈希表实现,但应该有一个,对吧?正确的?只需用对 arena 分配器的调用替换分配即可。


内存局部性

如果您可以保证查找永远不会请求映射中不存在的字符串,则应该将键存储在单独的区域中,因为仅在哈希冲突时才需要它们。这可以显着提高内存局部性。 (在这种情况下,如果你有一张“最终”牌桌,你甚至可以将冲突的键复制到一个新的竞技场,并扔掉所有其他键。不过,这样做的好处可能微乎其微。)

分离可能会有所帮助或有害,取决于您的访问模式。如果您通常在每次查找后使用该值一次,那么将它们成对地放在同一区域中就很好了。例如,如果您查找几个键,然后重复使用它们的值,则单独的区域是有意义的。


如果您必须支持“有趣的字符”/Unicode,请在存储字符串之前对其进行规范化。

Using SQLite is a good idea. A quick implementation can tell if you are fast enough with little effort.


If you determine you have to roll your own, I'd recommend the following:

How well can you predict the number of pairs, or an upper limit for that?
How well can you predict the total data size, or an upper limit for that?

Arena allocator for strings and nodes. (Usually, you'd work on a list of arenas, so you don't have to predict the total size).

Alignment depends on your algorithms, in principle you could pack it byte-tight, and the only overhead is your overallocation, which only minimally affects your working set.

However, if you have to run any cmp/copy etc. operations on these strings, remember that with the following guarantees, you can squeeze a little or a lot from these string operations:

  • all elements are CPU word aligned
  • all pad bytes are (e.g.) 0
  • you can safely read "beyond" a string end as long as you don't cross a CPU border

Hash table for the index. A dictionary would work, too, but that makes sense only if potential degradation / rehashing would be a serious issue. I don't know any "stock" hashtable implementation for C, but there should be one, right? right? Just replace allocations with calls to the arena allocator.


Memory Locality

If you can guarantee that lookup will never request a string that is not in the map, you should store the keys in a separate arena, as they are needed only on hash collisions. That can improve memory locality significantly. (In that case, if you ever have a "final" table, you could even copy the colliding keys to a new arena, and throw away all the others. The benefits of that are probably marginal, though.)

Separation can help or hurt, depending on your access patterns. If you typically use the value once after each lookup, having them pair-wise in the same arena is great. If you e.g. look up a few keys, then use their values repeatedly, separate arenas make sense.


If you have to support "funny characters" / Unicode, normalize your strings before storing them.

怪我闹别瞎闹 2024-10-05 19:27:52

您可以使用 struct 模块来打包二进制数据并在需要时将其解包。
您可以使用这种方法实现内存高效存储。我想访问会很痛苦。

You could use struct module to pack binary data and unpack it when needed.
You can implement a memory efficient storage using this approach. I guess access would be a pain.

一梦浮鱼 2024-10-05 19:27:52

Apache Portable Runtime(又名 APR)有一个基于 C 的哈希表。您可以在 http://apr.apache.org/docs/apr/ 查看文档0.9/group_apr_hash.html

使用apr_hash_t,您存储的所有内容都是无效的*。因此,它使您可以完全控制价值观。因此,如果您愿意,您可以存储指向 100 字节块的指针,而不是字符串的实际长度。

Apache Portable Runtime (aka APR) has a c-based hash table. You can see documentation at http://apr.apache.org/docs/apr/0.9/group_apr_hash.html

With apr_hash_t all you store is void*. So it gives you full control over values. SO if you want you can store pointer to a 100 byte block instead of actual length of the string.

墨小墨 2024-10-05 19:27:52

Judy should be memory-efficient: http://judy.sourceforge.net/
(Benchmarks: http://www.nothings.org/computer/judy/, see "Data Structure Size").
See also: http://www.dalkescientific.com/Python/PyJudy.html

Also,

For keys of a fixed size there is http://panthema.net/2007/stx-btree/ in C++ (I'm sure that with a custom C wrappers it can be used from CPython).
If the dataset allows it, you can store the variable-length keys in the value and use a hash or a prefix of the variable-length key as the fixed-length key.

The same logic applies to http://google-opensource.blogspot.ru/2013/01/c-containers-that-save-memory-and-time.html and http://code.google.com/p/sparsehash/ - istead of using a heavy std::string as a key, use an 32-bit or 64-bit integer key, making it somehow from the real variable-length key.

南街女流氓 2024-10-05 19:27:52

由于我找不到任何现有的解决方案可以紧密地压缩内存,所以我决定自己用 C 语言实现它。请参阅我在问题中使用开放寻址的设计。

Since I couldn't find any existing solutions which will pack the memory tightly, I've decided to implement it in C for myself. See my design with open addressing in the question.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文