使用简单的map-reduce列出存储桶中的所有键与bucket.get_keys()？

发布于 2024-12-06 19:46:20 字数 354 浏览 9 评论 0原文

根据 Riak 的文档（使用 Python 绑定）， get_keys () 非常昂贵，不适合生产。我的问题是非常简单的地图查询是否合适。例如，仅使用带有以下函数的映射阶段：

function(v) { return [v.key]; }

这会比 get_keys() 执行得更好吗？为什么 Riak 不提供这个实现而不是当前版本的 get_keys() ？有没有更好的方法来列出存储桶的密钥？

原文

According to Riak's docs (using Python bindings), get_keys() is extremely expensive and not suitable for production. My question is whether a very simple map query is suitable. For instance, using a map stage only with the function:

function(v) { return [v.key]; }

is this going to perform better than get_keys()? why wouldn't Riak ship with this implementation instead of the current version of get_keys()? Is there a better way I should be listing keys for a bucket?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

征﹌骨岁月お 2024-12-13 19:46:20

get_keys() 函数在后端调用 list_keys，并且被认为是一个昂贵的操作，因为它执行密钥空间的完整扫描。根据您的 Riak 后端，这还可能涉及对磁盘上存储的数据进行完整扫描（我想到了 InnoStore）。默认存储后端 (Bitcask) 将所有密钥存储在内存中，因此性能不应该成为问题。

list_keys 被认为昂贵的另一个原因是它以前是一个阻塞操作，因为它涉及 Basho 开发人员所说的所有键的“折叠”。 list_keys 现在使用存储桶的快照（而不是读取实时密钥空间），这也使其成为更轻量级的操作。

升级到 Riak 1.0 后，这一切变得更加容易。如果您使用 LevelDB 后端，则可以在存储桶上启用二级索引，并使用 $key 索引（由 Riak 自动提供）来获取存储桶中所有键的列表。

至于为什么 Riak 没有提供类似这样的更好的实现：询问该功能的用途。在 RDBMS 中，获取表的所有主键涉及全表扫描。在 Riak 中，从存储桶获取所有密钥需要扫描每个节点中的所有数据，然后将密钥名称发送回原始节点，组合该数据，然后将其发送到调用客户端。由于 Riak 的分布式、无序状态，无论您如何分割它，该操作都是昂贵的。正如我上面概述的，有一些方法可以让它变得更好。

回复收藏 0 原文

ら栖息 2024-12-13 19:46:20

如果您使用的是 eleveldb 后端（通过 LevelDB 库实现），您的密钥存储在排序顺序，因此您可以执行以下操作：

def get_bucket_keys(riak_client, bucket_name, start='0', stop='Z'):
    for record_key in riak_client.index(bucket_name, '$key', start, stop).run():
        yield record_key

for key in get_bucket_keys(riak.RiakClient(), 'mybucket'):
    print key

使用 eleveldb riak 仅扫描指定范围的所有节点。因此，如果您以可以控制键范围的方式填充存储桶，则列表存储桶键的性能会非常好。

权衡是您无法为每个节点上处理的密钥数量指定 LIMIT。这就是为什么您需要控制需要列出密钥的存储桶的密钥。

If you are using the eleveldb backend (which is implemented with LevelDB library) your keys are stored in an sorted order, so you can do something like:

def get_bucket_keys(riak_client, bucket_name, start='0', stop='Z'):
    for record_key in riak_client.index(bucket_name, '$key', start, stop).run():
        yield record_key

for key in get_bucket_keys(riak.RiakClient(), 'mybucket'):
    print key

With eleveldb riak scans all the nodes for only the range specified. So, if you populate your buckets in a way that you can control key ranges, list bucket keys can be very performatic.

The trade off is that you can't specify a LIMIT for the number of keys processed on each node. That is why you NEED to control the keys for the buckets you need keys listing.

回复收藏 0 原文

~没有更多了~