如何在 Ruby on Rails 中使用 Redis 有效地获取两个哈希值的点积

发布于 2024-12-06 06:33:28 字数 2415 浏览 0 评论 0原文

我在数据库的特征表中有一个这样的数据结构,称为 token_vector(哈希):

Feature.find(1).token_vector = { "a" => 0.1, "b" => 0.2, "c" => 0.3 }

其中有 25 个特征。首先,我在 script/console 中将数据输入到 Redis 中:

REDIS.set(  "feature1",
            "#{ TokenVector.to_json Feature.find(1).token_vector }"
)
# ...
REDIS.set(  "feature25",
            "#{ TokenVector.to_json Feature.find(25).token_vector }"
)

TokenVector.to_json 首先将哈希值转换为 JSON 格式。 Redis 中存储的 25 个 JSON 哈希值大约占用 8 MB。

我有一个方法,称为Analysis#locate。此方法采用两个 token_vector 之间的点积。散列的点积的工作原理如下:

hash1 = { "a" => 1, "b" => 2, "c" => 3 }
hash2 = { "a" => 4, "b" => 5, "c" => 6, "d" => 7 }

散列中的每个重叠键(在本例中为 a、b 和 c,而不是 d)将其值两两相乘,然后相加。

hash1a 的值为 1,hash2a 的值为 4。将它们相乘得到1*4 = 4

hash1b 的值为 2,hash2b 的值为 5。将它们相乘得到2*5 = 10

hash1c 的值为 3,hash2c 的值为 6。将它们相乘得到3*6 = 18

hash1d 的值不存在,hash2d 的值为 7。在这种情况下,为第一个哈希设置d = 0。将它们相乘即可得到 0*7 = 0

现在将相乘的值相加。 4 + 10 + 18 + 0 = 32。这是 hash1 和 hash2 的点积。

Analysis.locate( hash1, hash2 ) # => 32

我有一个经常使用的方法,Analysis#topicize。该方法接受一个参数,token_vector,它只是一个哈希值,与上面类似。 Analysis#topicize 采用 token_vector 与 25 个特征的 token_vectors 中每一个的点积,并创建这 25 个点积的新向量,称为feature_vectorfeature_vector 只是一个数组。代码如下所示:

def self.topicize token_vector

  feature_vector = FeatureVector.new

  feature_vector.push(
    locate( token_vector, TokenVector.from_json( REDIS.get "feature1" ) )
  )
  # ...
  feature_vector.push(
    locate( token_vector, TokenVector.from_json( REDIS.get "feature25" ) )
  )

  feature_vector

end

如您所见,它采用了我在上面输入到 Redis 中的 token_vector 和每个功能的 token_vector 的点积,并将值推入一个数组。

我的问题是,每次调用该方法大约需要 18 秒。我是否误用了 Redis?我认为问题可能是我不应该将 Redis 数据加载到 Ruby 中。我是否应该向 Redis 发送数据 (token_vector) 并编写一个 Redis 函数来让它执行 dot_product 函数,而不是使用 Ruby 代码编写它?

I have a data structure like this in the database in the features table called token_vector (a hash):

Feature.find(1).token_vector = { "a" => 0.1, "b" => 0.2, "c" => 0.3 }

There are 25 of these features. First, I entered the data into Redis with this in script/console:

REDIS.set(  "feature1",
            "#{ TokenVector.to_json Feature.find(1).token_vector }"
)
# ...
REDIS.set(  "feature25",
            "#{ TokenVector.to_json Feature.find(25).token_vector }"
)

TokenVector.to_json converts the hash into JSON format first. The 25 JSON hashes stored in Redis take up about 8 MB.

I have a method, called Analysis#locate. This method takes the dot product between two token_vectors. The dot product for hashes works like this:

hash1 = { "a" => 1, "b" => 2, "c" => 3 }
hash2 = { "a" => 4, "b" => 5, "c" => 6, "d" => 7 }

Each overlapping key in the hash (a, b, and c in this case, and not d) have their values multiplied pairwise together, then added up.

The value for a in hash1 is 1, the value for a in hash2 is 4. Multiply these to get 1*4 = 4.

The value for b in hash1 is 2, the value for b in hash2 is 5. Multiply these to get 2*5 = 10.

The value for c in hash1 is 3, the value for c in hash2 is 6. Multiply these to get 3*6 = 18.

The value for d in hash1 is nonexistent, the value for d in hash2 is 7. In this case, set d = 0 for the first hash. Multiply these to get 0*7 = 0.

Now add up the multiplied values. 4 + 10 + 18 + 0 = 32. This is the dot product of hash1 and hash2.

Analysis.locate( hash1, hash2 ) # => 32

I have a method that is often used, Analysis#topicize. This method takes in a parameter, token_vector, which is just a hash, similar to above. Analysis#topicize takes the dot product of token_vector and each of the 25 features' token_vectors, and creates a new vector of those 25 dot products, called feature_vector. A feature_vector is just an array. Here is what the code looks like:

def self.topicize token_vector

  feature_vector = FeatureVector.new

  feature_vector.push(
    locate( token_vector, TokenVector.from_json( REDIS.get "feature1" ) )
  )
  # ...
  feature_vector.push(
    locate( token_vector, TokenVector.from_json( REDIS.get "feature25" ) )
  )

  feature_vector

end

As you can see, it takes the dot product of token_vector and each feature's token_vector that I entered into Redis above, and pushes the value into an array.

My problem is, this takes about 18 seconds each time I invoke the method. Am I misusing Redis? I think the problem could be that I shouldn't load Redis data into Ruby. Am I supposed to send Redis the data (token_vector) and write a Redis function to have it do the dot_product function, rather than writing it with Ruby code?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

怂人 2024-12-13 06:33:28

您必须对其进行分析才能确定,但​​我怀疑您在序列化/反序列化 JSON 对象方面浪费了大量时间。与其将 token_vector 转换为 JSON 字符串,为什么不直接将其放入 Redis 中,因为 Redis 有 它自己的哈希类型

REDIS.hmset "feature1",   *Feature.find(1).token_vector.flatten
# ...
REDIS.hmset "feature25",  *Feature.find(25).token_vector.flatten

Hash#flatten 的作用是将哈希值转换为 { 'a' =>; 1、'b' => 2 } 转换为 [ 'a', 1, 'b', 2 ] 这样的数组,然后我们使用 splat (*) 发送每个元素数组作为 Redis#hmset 的参数(“hmset”中的“m”代表“多个”,如“一次设置多个哈希值”)。

然后,当您想将其取回时,请使用 Redis#hgetall ,它会自动返回 Ruby 哈希值:

def self.topicize token_vector
  feature_vector = FeatureVector.new

  feature_vector.push locate( token_vector, REDIS.hgetall "feature1" )
  # ...
  feature_vector.push locate( token_vector, REDIS.hgetall "feature25" )

  feature_vector
end

但是!由于您只关心哈希中的值,而不是键,因此您可以使用 Redis#hvals 来简化事情,它只返回值的数组,而不是 hgetall

您可能会花费大量周期的第二个地方是 locate,您尚未提供其源代码,但是有很多方法可以在 Ruby 中编写点积方法,其中一些方法他们比其他人表现更好。 这个 Ruby 讨论主题涵盖了一些有价值的内容。其中一张海报指向 NArray,一个用 C 语言实现数值数组和向量的库。

如果我理解你的代码正确的话,它可以重新实现这样的东西(先决条件:gem install narray):

require 'narray'

def self.topicize token_vector
  # Make sure token_vector is an NVector
  token_vector  = NVector.to_na token_vector unless token_vector.is_a? NVector
  num_feats     = 25

  # Use Redis#multi to bundle every operation into one call.
  # It will return an array of all 25 features' token_vectors.
  feat_token_vecs = REDIS.multi do
    num_feats.times do |feat_idx|
      REDIS.hvals "feature#{feat_idx + 1}"
    end
  end 

  pad_to_len = token_vector.length

  # Get the dot product of each of those arrays with token_vector
  feat_token_vecs.map do |feat_vec|
    # Make sure the array is long enough by padding it out with zeroes (using
    # pad_arr, defined below). (Since Redis only returns strings we have to
    # convert each value with String#to_f first.)
    feat_vec = pad_arr feat_vec.map(&:to_f), pad_to_len

    # Then convert it to an NVector and do the dot product
    token_vector * NVector.to_na(feat_vec)

    # If we need to get a Ruby Array out instead of an NVector use #to_a, e.g.:
    # ( token_vector * NVector.to_na(feat_vec) ).to_a
  end
end

# Utility to pad out array with zeroes to desired size
def pad_arr arr, size
  arr.length < size ?
    arr + Array.new(size - arr.length, 0) : arr
end

希望这有帮助!

You would have to profile it to be sure, but I suspect you're losing a lot of time in serializing/deserializing JSON objects. Instead of turning token_vector into a JSON string, why not put it directly into Redis, since Redis has its own hash type?

REDIS.hmset "feature1",   *Feature.find(1).token_vector.flatten
# ...
REDIS.hmset "feature25",  *Feature.find(25).token_vector.flatten

What Hash#flatten does is turns a hash like { 'a' => 1, 'b' => 2 } into an array like [ 'a', 1, 'b', 2 ], and then we use splat (*) to send each element of the array as an argument to Redis#hmset (the "m" in "hmset" is for "multiple," as in "set multiple hash values at once").

Then when you want to get it back out use Redis#hgetall, which automatically returns a Ruby Hash:

def self.topicize token_vector
  feature_vector = FeatureVector.new

  feature_vector.push locate( token_vector, REDIS.hgetall "feature1" )
  # ...
  feature_vector.push locate( token_vector, REDIS.hgetall "feature25" )

  feature_vector
end

However! Since you only care about the values, and not the keys, from the hash, you can streamline things a little more by using Redis#hvals, which just returns an array of the values, instead of hgetall.

The second place you might be spending a lot of cycles is in locate, which you haven't provided the source for, but there are a lot of ways to write a dot product method in Ruby and some of them are more performant than others. This ruby-talk thread covers some valuable ground. One of the posters points to NArray, a library that implements numeric arrays and vectors in C.

If I understand your code correctly it could be reimplemented something like this (prereq: gem install narray):

require 'narray'

def self.topicize token_vector
  # Make sure token_vector is an NVector
  token_vector  = NVector.to_na token_vector unless token_vector.is_a? NVector
  num_feats     = 25

  # Use Redis#multi to bundle every operation into one call.
  # It will return an array of all 25 features' token_vectors.
  feat_token_vecs = REDIS.multi do
    num_feats.times do |feat_idx|
      REDIS.hvals "feature#{feat_idx + 1}"
    end
  end 

  pad_to_len = token_vector.length

  # Get the dot product of each of those arrays with token_vector
  feat_token_vecs.map do |feat_vec|
    # Make sure the array is long enough by padding it out with zeroes (using
    # pad_arr, defined below). (Since Redis only returns strings we have to
    # convert each value with String#to_f first.)
    feat_vec = pad_arr feat_vec.map(&:to_f), pad_to_len

    # Then convert it to an NVector and do the dot product
    token_vector * NVector.to_na(feat_vec)

    # If we need to get a Ruby Array out instead of an NVector use #to_a, e.g.:
    # ( token_vector * NVector.to_na(feat_vec) ).to_a
  end
end

# Utility to pad out array with zeroes to desired size
def pad_arr arr, size
  arr.length < size ?
    arr + Array.new(size - arr.length, 0) : arr
end

Hope that's helpful!

甜妞爱困 2024-12-13 06:33:28

这并不是真正的答案,只是我之前评论的后续,因为这可能不适合评论。看来 Hash/TokenVector 问题可能不是唯一的问题。我这样做:

token_vector = Feature.find(1).token_vector
Analysis.locate( token_vector, TokenVector[ REDIS.hgetall( "feature1" ) ] )

并收到此错误:

TypeError: String can't be coerced into Float
from /Users/RedApple/S/lib/analysis/vectors.rb:26:in `*'
from /Users/RedApple/S/lib/analysis/vectors.rb:26:in `block in dot'
from /Users/RedApple/S/lib/analysis/vectors.rb:24:in `each'
from /Users/RedApple/S/lib/analysis/vectors.rb:24:in `inject'
from /Users/RedApple/S/lib/analysis/vectors.rb:24:in `dot'
from /Users/RedApple/S/lib/analysis/analysis.rb:223:in `locate'
from (irb):6
from /Users/RedApple/.rvm/rubies/ruby-1.9.2-p290/bin/irb:16:in `<main>'

Analysis#locate 看起来像这样:

def self.locate vector1, vector2
  vector1.dot vector2
end

这是analysis/vectors.rb 第 23-28 行的相关部分,TokenVector#dot 方法:

def dot vector
  inject 0 do |product,item|
    axis, value = item
    product + value * ( vector[axis] || 0 )
  end
end

我不确定问题出在哪里。

This isn't really an answer, just a follow up to my previous comment, since this probably won't fit into a comment. It looks like the Hash/TokenVector issue might not have been the only problem. I do:

token_vector = Feature.find(1).token_vector
Analysis.locate( token_vector, TokenVector[ REDIS.hgetall( "feature1" ) ] )

and get this error:

TypeError: String can't be coerced into Float
from /Users/RedApple/S/lib/analysis/vectors.rb:26:in `*'
from /Users/RedApple/S/lib/analysis/vectors.rb:26:in `block in dot'
from /Users/RedApple/S/lib/analysis/vectors.rb:24:in `each'
from /Users/RedApple/S/lib/analysis/vectors.rb:24:in `inject'
from /Users/RedApple/S/lib/analysis/vectors.rb:24:in `dot'
from /Users/RedApple/S/lib/analysis/analysis.rb:223:in `locate'
from (irb):6
from /Users/RedApple/.rvm/rubies/ruby-1.9.2-p290/bin/irb:16:in `<main>'

Analysis#locate looks like this:

def self.locate vector1, vector2
  vector1.dot vector2
end

Here is the relevant part of analysis/vectors.rb lines 23-28, the TokenVector#dot method:

def dot vector
  inject 0 do |product,item|
    axis, value = item
    product + value * ( vector[axis] || 0 )
  end
end

I am not sure where the problem is.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文