Redis排序集和存储uid的最佳方式

发布于 2025-01-01 22:18:52 字数 630 浏览 2 评论 0原文

我的数据由 user_ids 和这些用户 id 的标签组成。 user_ids 出现多次并具有预先指定的标签数量 (500),但是该功能可能会发生变化。必须存储的是 user_id、他们的标签和他们的计数。 我想稍后轻松找到得分最高的标签..等等。每次出现标签时,它都会递增

我在redis中的实现是使用排序集

  • 每个user_id都是一个排序集

  • key 是 user_id,是一个十六进制数字

工作原理如下:

zincrby user_id:x 1 "tag0"

zincrby user_id:x 1 “tag499”

zincrby user_id:y 1“tag3”

等等,

考虑到我想获得得分最高的标签,有更好的方法吗?

第二个问题是,现在我正在使用“keys *”来检索这些密钥以进行客户端操作,我知道它不是针对生产系统的。

另外,迭代指定数量的键(在 10000 范围内)对于解决内存问题会很有帮助。我知道密钥必须存储在内存中,但它们并不遵循 允许部分检索的特定模式,这样我就可以避免“zmalloc”错误(4GB 64 位 debian 服务器)。 密钥数量达2000万个。 有什么想法吗?

I have data consisting of user_ids and tags of these user ids.
The user_ids occur multiple times and have pre-specified number of tags (500) however that might change in the feature. What must be stored is the user_id, their tags and their count.
I want later to easily find tags with top score.. etc. Every time a tag appears it is incremented

My implementation in redis is done using sorted sets

  • every user_id is a sorted set

  • key is user_id and is a hex number

works like this:

zincrby user_id:x 1 "tag0"

zincrby user_id:x 1 "tag499"

zincrby user_id:y 1 "tag3"

and so on

having in mind that I want to get tags with highest score, is there a better way?

The second issue is that right now I 'm using "keys *" to retrieve these keys for client side manipulation which I know that it's not aimed towards production systems.

Plus it would be great for memory problems to iterate through a specified number of keys (in the range of 10000). I know that keys have to be stored in memory, however they don't follow
a specific pattern to allow for partial retrieval so I can avoid "zmalloc" error (4GB 64 bit debian server).
Keys amount to range of 20 million.
Any thoughts?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

原来分手还会想你 2025-01-08 22:18:52

我的第一点是要注意,4 GB 空间不足以存储 20M 个排序集。快速尝试表明,20M 用户,每个用户有 20 个标签,在 64 位机器上将占用大约 8 GB(并且它考虑了 Redis 2.4 提供的排序集 ziplist 内存优化 - 甚至不要在早期版本中尝试此操作) 。

排序集是支持您的用例的理想数据结构。我会完全按照你的描述使用它们。

正如您所指出的,KEYS 不能用于迭代键。它更确切地说是一个调试命令。为了支持密钥迭代,需要添加一个数据结构来提供这个访问路径。 Redis 中唯一可以支持迭代的结构是列表和排序集(通过范围方法)。然而,他们倾向于将 O(n) 迭代算法转换为 O(n^2)(对于列表)或 O(nlogn)(对于 zset)。列表也是存储密钥的糟糕选择,因为在添加/删除密钥时很难维护它。

更有效的解决方案是添加由正则集组成的索引。您需要使用哈希函数将特定用户关联到某个桶,并将该用户id添加到该桶对应的集合中。如果用户 ID 是数值,则简单的模函数就足够了。如果不是,一个简单的字符串哈希函数就可以解决问题。

因此,为了支持 user:1000、user:2000 和 user:1001 的迭代,我们选择模 1000 函数。 user:1000 和 user:2000 将放入桶索引:0,而 user:1001 将放入桶索引:1。

因此,在 zset 之上,我们现在有以下键:

index:0 => set[ 1000, 2000 ]
index:1 => set[ 1001 ]

在集合中,不需要键的前缀,并且它允许 Redis 通过序列化集合来优化内存消耗,前提是它们保持足够小(整数集优化由 Sripathi Krishnan 提出)。

全局迭代由从 0 到 1000(不包括)的桶上的简单循环组成。对于每个存储桶,应用 SMEMBERS 命令来检索相应的集合,然后客户端可以迭代各个项目。

下面是一个 Python 示例:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# ----------------------------------------------------

import redis, random

POOL = redis.ConnectionPool(host='localhost', port=6379, db=0)

NUSERS = 10000
NTAGS = 500
NBUCKETS = 1000

# ----------------------------------------------------
# Fill redis with some random data

def fill(r):
  p = r.pipeline()
  # Create only 10000 users for this example
  for id in range(0,NUSERS):
    user = "user:%d" % id
    # Add the user in the index: a simple modulo is used to hash the user id
    # and put it in the correct bucket
    p.sadd( "index:%d" % (id%NBUCKETS), id )
    # Add random tags to the user
    for x in range(0,20):
      tag = "tag:%d" % (random.randint(0,NTAGS))
      p.zincrby( user, tag, 1 )
    # Flush the pipeline every 1000 users
    if id % 1000 == 0:
      p.execute()
      print id
  # Flush one last time
  p.execute()

# ----------------------------------------------------
# Iterate on all the users and display their 5 highest ranked tags

def iterate(r):
  # Iterate on the buckets of the key index
  # The range depends on the function used to hash the user id
  for x in range(0,NBUCKETS):
    # Iterate on the users in this bucket
    for id in r.smembers( "index:%d"%(x) ):
      user = "user:%d" % int(id)
      print user,r.zrevrangebyscore(user,"+inf","-inf", 0, 5, True )

# ----------------------------------------------------
# Main function

def main():
  r = redis.Redis(connection_pool=POOL)
  r.flushall()
  m = r.info()["used_memory"]
  fill(r)
  info = r.info()
  print "Keys: ",info["db0"]["keys"]
  print "Memory: ",info["used_memory"]-m
  iterate(r)

# ----------------------------------------------------

main()

通过调整常量,您还可以使用该程序来评估该数据结构的全局内存消耗。

在我看来,这个策略简单而高效,因为它提供了添加/删除用户的 O(1) 复杂度,以及迭代所有项目的真正的 O(n) 复杂度。唯一的缺点是关键迭代顺序是随机的。

My first point would be to note that 4 GB are tight to store 20M sorted sets. A quick try shows that 20M users, each of them with 20 tags would take about 8 GB on a 64 bits box (and it accounts for the sorted set ziplist memory optimizations provided with Redis 2.4 - don't even try this with earlier versions).

Sorted sets are the ideal data structure to support your use case. I would use them exactly as you described.

As you pointed out, KEYS cannot be used to iterate on keys. It is rather meant as a debug command. To support key iteration, you need to add a data structure to provide this access path. The only structures in Redis which can support iteration are the list and the sorted set (through the range methods). However, they tend to transform O(n) iteration algorithms into O(n^2) (for list), or O(nlogn) (for zset). A list is also a poor choice to store keys since it will be difficult to maintain it as keys are added/removed.

A more efficient solution is to add an index composed of regular sets. You need to use a hash function to associate a specific user to a bucket, and add the user id to the set corresponding to this bucket. If the user id are numeric values, a simple modulo function will be enough. If they are not, a simple string hashing function will do the trick.

So to support iteration on user:1000, user:2000 and user:1001, let's choose a modulo 1000 function. user:1000 and user:2000 will be put in bucket index:0 while user:1001 will be put in bucket index:1.

So on top of the zsets, we now have the following keys:

index:0 => set[ 1000, 2000 ]
index:1 => set[ 1001 ]

In the sets, the prefix of the keys is not needed, and it allows Redis to optimize the memory consumption by serializing the sets provided they are kept small enough (integer sets optimization proposed by Sripathi Krishnan).

The global iteration consists in a simple loop on the buckets from 0 to 1000 (excluded). For each bucket, the SMEMBERS command is applied to retrieve the corresponding set, and the client can then iterate on the individual items.

Here is an example in Python:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# ----------------------------------------------------

import redis, random

POOL = redis.ConnectionPool(host='localhost', port=6379, db=0)

NUSERS = 10000
NTAGS = 500
NBUCKETS = 1000

# ----------------------------------------------------
# Fill redis with some random data

def fill(r):
  p = r.pipeline()
  # Create only 10000 users for this example
  for id in range(0,NUSERS):
    user = "user:%d" % id
    # Add the user in the index: a simple modulo is used to hash the user id
    # and put it in the correct bucket
    p.sadd( "index:%d" % (id%NBUCKETS), id )
    # Add random tags to the user
    for x in range(0,20):
      tag = "tag:%d" % (random.randint(0,NTAGS))
      p.zincrby( user, tag, 1 )
    # Flush the pipeline every 1000 users
    if id % 1000 == 0:
      p.execute()
      print id
  # Flush one last time
  p.execute()

# ----------------------------------------------------
# Iterate on all the users and display their 5 highest ranked tags

def iterate(r):
  # Iterate on the buckets of the key index
  # The range depends on the function used to hash the user id
  for x in range(0,NBUCKETS):
    # Iterate on the users in this bucket
    for id in r.smembers( "index:%d"%(x) ):
      user = "user:%d" % int(id)
      print user,r.zrevrangebyscore(user,"+inf","-inf", 0, 5, True )

# ----------------------------------------------------
# Main function

def main():
  r = redis.Redis(connection_pool=POOL)
  r.flushall()
  m = r.info()["used_memory"]
  fill(r)
  info = r.info()
  print "Keys: ",info["db0"]["keys"]
  print "Memory: ",info["used_memory"]-m
  iterate(r)

# ----------------------------------------------------

main()

By tweaking the constants, you can also use this program to evaluate the global memory consumption of this data structure.

IMO this strategy is simple and efficient, because it offers O(1) complexity to add/remove users, and true O(n) complexity to iterate on all items. The only downside is the key iteration order is random.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文