使用简单的map-reduce列出存储桶中的所有键与bucket.get_keys()?

发布于 2024-12-06 19:46:20 字数 354 浏览 2 评论 0原文

根据 Riak 的文档(使用 Python 绑定), get_keys () 非常昂贵,不适合生产。我的问题是非常简单的地图查询是否合适。例如,仅使用带有以下函数的映射阶段:

function(v) { return [v.key]; }

这会比 get_keys() 执行得更好吗?为什么 Riak 不提供这个实现而不是当前版本的 get_keys() ?有没有更好的方法来列出存储桶的密钥?

According to Riak's docs (using Python bindings), get_keys() is extremely expensive and not suitable for production. My question is whether a very simple map query is suitable. For instance, using a map stage only with the function:

function(v) { return [v.key]; }

is this going to perform better than get_keys()? why wouldn't Riak ship with this implementation instead of the current version of get_keys()? Is there a better way I should be listing keys for a bucket?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

征﹌骨岁月お 2024-12-13 19:46:20

get_keys() 函数在后端调用 list_keys,并且被认为是一个昂贵的操作,因为它执行密钥空间的完整扫描。根据您的 Riak 后端,这还可能涉及对磁盘上存储的数据进行完整扫描(我想到了 InnoStore)。默认存储后端 (Bitcask) 将所有密钥存储在内存中,因此性能不应该成为问题。

list_keys 被认为昂贵的另一个原因是它以前是一个阻塞操作,因为它涉及 Basho 开发人员所说的所有键的“折叠”。 list_keys 现在使用存储桶的快照(而不是读取实时密钥空间),这也使其成为更轻量级的操作。

升级到 Riak 1.0 后,这一切变得更加容易。如果您使用 LevelDB 后端,则可以在存储桶上启用二级索引,并使用 $key 索引(由 Riak 自动提供)来获取存储桶中所有键的列表。

至于为什么 Riak 没有提供类似这样的更好的实现:询问该功能的用途。在 RDBMS 中,获取表的所有主键涉及全表扫描。在 Riak 中,从存储桶获取所有密钥需要扫描每个节点中的所有数据,然后将密钥名称发送回原始节点,组合该数据,然后将其发送到调用客户端。由于 Riak 的分布式、无序状态,无论您如何分割它,该操作都是昂贵的。正如我上面概述的,有一些方法可以让它变得更好。

The get_keys() function calls list_keys in the back end and is considered to be an expensive operation because it performs a full scan of the key space. Depending on your Riak back end, this could also involve a full scan of the data as stored on disk (InnoStore springs to mind). The default storage back end (Bitcask) stores all of your keys in memory, so performance shouldn't be as much of a problem.

The other reason list_keys was considered expensive is because it was formerly a blocking operation as it involved what the Basho developers refer to as a 'fold' over all of the keys. list_keys now uses a snapshot of the bucket (instead of reading the live key space) and this makes it a lighter weight operation as well.

This is made easier with an upgrade to Riak 1.0. If you're using the LevelDB back end, you can enable secondary indexes on a bucket and use the $key index (automatically provided by Riak) to get you a list of all keys in a bucket.

As for why Riak doesn't ship with a better implementation of something like this: ask what the functionality is for. In an RDBMS, getting all primary keys of a table involves a full table scan. In Riak, getting all keys from a bucket requires scanning all data in every node and then shipping the key names back to the originating node, combining that data, and then sending it to the calling client. Because of Riak's distributed, unordered, state this operation is expensive no matter how you slice it. There are, as I outlined above, ways to make it better.

ら栖息 2024-12-13 19:46:20

如果您使用的是 eleveldb 后端(通过 LevelDB 库实现),您的密钥存储在排序顺序,因此您可以执行以下操作:

def get_bucket_keys(riak_client, bucket_name, start='0', stop='Z'):
    for record_key in riak_client.index(bucket_name, '$key', start, stop).run():
        yield record_key

for key in get_bucket_keys(riak.RiakClient(), 'mybucket'):
    print key

使用 eleveldb riak 仅扫描指定范围的所有节点。因此,如果您以可以控制键范围的方式填充存储桶,则列表存储桶键的性能会非常好。

权衡是您无法为每个节点上处理的密钥数量指定 LIMIT。这就是为什么您需要控制需要列出密钥的存储桶的密钥。

If you are using the eleveldb backend (which is implemented with LevelDB library) your keys are stored in an sorted order, so you can do something like:

def get_bucket_keys(riak_client, bucket_name, start='0', stop='Z'):
    for record_key in riak_client.index(bucket_name, '$key', start, stop).run():
        yield record_key

for key in get_bucket_keys(riak.RiakClient(), 'mybucket'):
    print key

With eleveldb riak scans all the nodes for only the range specified. So, if you populate your buckets in a way that you can control key ranges, list bucket keys can be very performatic.

The trade off is that you can't specify a LIMIT for the number of keys processed on each node. That is why you NEED to control the keys for the buckets you need keys listing.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文