如何在 Ruby on Rails 中使用 Redis 有效地获取两个哈希值的点积

发布于 2024-12-06 06:33:28 字数 2415 浏览 3 评论 0原文

我在数据库的特征表中有一个这样的数据结构，称为 token_vector（哈希）：

Feature.find(1).token_vector = { "a" => 0.1, "b" => 0.2, "c" => 0.3 }

其中有 25 个特征。首先，我在 script/console 中将数据输入到 Redis 中：

REDIS.set(  "feature1",
            "#{ TokenVector.to_json Feature.find(1).token_vector }"
)
# ...
REDIS.set(  "feature25",
            "#{ TokenVector.to_json Feature.find(25).token_vector }"
)

TokenVector.to_json 首先将哈希值转换为 JSON 格式。 Redis 中存储的 25 个 JSON 哈希值大约占用 8 MB。

我有一个方法，称为Analysis#locate。此方法采用两个 token_vector 之间的点积。散列的点积的工作原理如下：

hash1 = { "a" => 1, "b" => 2, "c" => 3 }
hash2 = { "a" => 4, "b" => 5, "c" => 6, "d" => 7 }

散列中的每个重叠键（在本例中为 a、b 和 c，而不是 d）将其值两两相乘，然后相加。

hash1 中 a 的值为 1，hash2 中 a 的值为 4。将它们相乘得到1*4 = 4。

hash1 中 b 的值为 2，hash2 中 b 的值为 5。将它们相乘得到2*5 = 10。

hash1 中 c 的值为 3，hash2 中 c 的值为 6。将它们相乘得到3*6 = 18。

hash1 中 d 的值不存在，hash2 中 d 的值为 7。在这种情况下，为第一个哈希设置d = 0。将它们相乘即可得到 0*7 = 0。

现在将相乘的值相加。 4 + 10 + 18 + 0 = 32。这是 hash1 和 hash2 的点积。

Analysis.locate( hash1, hash2 ) # => 32

我有一个经常使用的方法，Analysis#topicize。该方法接受一个参数，token_vector，它只是一个哈希值，与上面类似。 Analysis#topicize 采用 token_vector 与 25 个特征的 token_vectors 中每一个的点积，并创建这 25 个点积的新向量，称为feature_vector。 feature_vector 只是一个数组。代码如下所示：

def self.topicize token_vector

  feature_vector = FeatureVector.new

  feature_vector.push(
    locate( token_vector, TokenVector.from_json( REDIS.get "feature1" ) )
  )
  # ...
  feature_vector.push(
    locate( token_vector, TokenVector.from_json( REDIS.get "feature25" ) )
  )

  feature_vector

end

如您所见，它采用了我在上面输入到 Redis 中的 token_vector 和每个功能的 token_vector 的点积，并将值推入一个数组。

我的问题是，每次调用该方法大约需要 18 秒。我是否误用了 Redis？我认为问题可能是我不应该将 Redis 数据加载到 Ruby 中。我是否应该向 Redis 发送数据 (token_vector) 并编写一个 Redis 函数来让它执行 dot_product 函数，而不是使用 Ruby 代码编写它？

原文

I have a data structure like this in the database in the features table called token_vector (a hash):

Feature.find(1).token_vector = { "a" => 0.1, "b" => 0.2, "c" => 0.3 }

There are 25 of these features. First, I entered the data into Redis with this in script/console:

REDIS.set(  "feature1",
            "#{ TokenVector.to_json Feature.find(1).token_vector }"
)
# ...
REDIS.set(  "feature25",
            "#{ TokenVector.to_json Feature.find(25).token_vector }"
)

TokenVector.to_json converts the hash into JSON format first. The 25 JSON hashes stored in Redis take up about 8 MB.

I have a method, called Analysis#locate. This method takes the dot product between two token_vectors. The dot product for hashes works like this:

hash1 = { "a" => 1, "b" => 2, "c" => 3 }
hash2 = { "a" => 4, "b" => 5, "c" => 6, "d" => 7 }

Each overlapping key in the hash (a, b, and c in this case, and not d) have their values multiplied pairwise together, then added up.

The value for a in hash1 is 1, the value for a in hash2 is 4. Multiply these to get 1*4 = 4.

The value for b in hash1 is 2, the value for b in hash2 is 5. Multiply these to get 2*5 = 10.

The value for c in hash1 is 3, the value for c in hash2 is 6. Multiply these to get 3*6 = 18.

The value for d in hash1 is nonexistent, the value for d in hash2 is 7. In this case, set d = 0 for the first hash. Multiply these to get 0*7 = 0.

Now add up the multiplied values. 4 + 10 + 18 + 0 = 32. This is the dot product of hash1 and hash2.

Analysis.locate( hash1, hash2 ) # => 32

I have a method that is often used, Analysis#topicize. This method takes in a parameter, token_vector, which is just a hash, similar to above. Analysis#topicize takes the dot product of token_vector and each of the 25 features' token_vectors, and creates a new vector of those 25 dot products, called feature_vector. A feature_vector is just an array. Here is what the code looks like:

def self.topicize token_vector

  feature_vector = FeatureVector.new

  feature_vector.push(
    locate( token_vector, TokenVector.from_json( REDIS.get "feature1" ) )
  )
  # ...
  feature_vector.push(
    locate( token_vector, TokenVector.from_json( REDIS.get "feature25" ) )
  )

  feature_vector

end

As you can see, it takes the dot product of token_vector and each feature's token_vector that I entered into Redis above, and pushes the value into an array.

My problem is, this takes about 18 seconds each time I invoke the method. Am I misusing Redis? I think the problem could be that I shouldn't load Redis data into Ruby. Am I supposed to send Redis the data (token_vector) and write a Redis function to have it do the dot_product function, rather than writing it with Ruby code?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

怂人 2024-12-13 06:33:28

您必须对其进行分析才能确定，但我怀疑您在序列化/反序列化 JSON 对象方面浪费了大量时间。与其将 token_vector 转换为 JSON 字符串，为什么不直接将其放入 Redis 中，因为 Redis 有它自己的哈希类型？

REDIS.hmset "feature1",   *Feature.find(1).token_vector.flatten
# ...
REDIS.hmset "feature25",  *Feature.find(25).token_vector.flatten

Hash#flatten 的作用是将哈希值转换为 { 'a' =>; 1、'b' => 2 } 转换为 [ 'a', 1, 'b', 2 ] 这样的数组，然后我们使用 splat (*) 发送每个元素数组作为 Redis#hmset 的参数（“hmset”中的“m”代表“多个”，如“一次设置多个哈希值”）。

然后，当您想将其取回时，请使用 Redis#hgetall ，它会自动返回 Ruby 哈希值：

def self.topicize token_vector
  feature_vector = FeatureVector.new

  feature_vector.push locate( token_vector, REDIS.hgetall "feature1" )
  # ...
  feature_vector.push locate( token_vector, REDIS.hgetall "feature25" )

  feature_vector
end

但是！由于您只关心哈希中的值，而不是键，因此您可以使用 Redis#hvals 来简化事情，它只返回值的数组，而不是 hgetall。

您可能会花费大量周期的第二个地方是 locate，您尚未提供其源代码，但是有很多方法可以在 Ruby 中编写点积方法，其中一些方法他们比其他人表现更好。这个 Ruby 讨论主题涵盖了一些有价值的内容。其中一张海报指向 NArray，一个用 C 语言实现数值数组和向量的库。

如果我理解你的代码正确的话，它可以重新实现这样的东西（先决条件：gem install narray）：

require 'narray'

def self.topicize token_vector
  # Make sure token_vector is an NVector
  token_vector  = NVector.to_na token_vector unless token_vector.is_a? NVector
  num_feats     = 25

  # Use Redis#multi to bundle every operation into one call.
  # It will return an array of all 25 features' token_vectors.
  feat_token_vecs = REDIS.multi do
    num_feats.times do |feat_idx|
      REDIS.hvals "feature#{feat_idx + 1}"
    end
  end 

  pad_to_len = token_vector.length

  # Get the dot product of each of those arrays with token_vector
  feat_token_vecs.map do |feat_vec|
    # Make sure the array is long enough by padding it out with zeroes (using
    # pad_arr, defined below). (Since Redis only returns strings we have to
    # convert each value with String#to_f first.)
    feat_vec = pad_arr feat_vec.map(&:to_f), pad_to_len

    # Then convert it to an NVector and do the dot product
    token_vector * NVector.to_na(feat_vec)

    # If we need to get a Ruby Array out instead of an NVector use #to_a, e.g.:
    # ( token_vector * NVector.to_na(feat_vec) ).to_a
  end
end

# Utility to pad out array with zeroes to desired size
def pad_arr arr, size
  arr.length < size ?
    arr + Array.new(size - arr.length, 0) : arr
end

希望这有帮助！

You would have to profile it to be sure, but I suspect you're losing a lot of time in serializing/deserializing JSON objects. Instead of turning token_vector into a JSON string, why not put it directly into Redis, since Redis has its own hash type?

REDIS.hmset "feature1",   *Feature.find(1).token_vector.flatten
# ...
REDIS.hmset "feature25",  *Feature.find(25).token_vector.flatten

What Hash#flatten does is turns a hash like { 'a' => 1, 'b' => 2 } into an array like [ 'a', 1, 'b', 2 ], and then we use splat (*) to send each element of the array as an argument to Redis#hmset (the "m" in "hmset" is for "multiple," as in "set multiple hash values at once").

Then when you want to get it back out use Redis#hgetall, which automatically returns a Ruby Hash:

def self.topicize token_vector
  feature_vector = FeatureVector.new

  feature_vector.push locate( token_vector, REDIS.hgetall "feature1" )
  # ...
  feature_vector.push locate( token_vector, REDIS.hgetall "feature25" )

  feature_vector
end

However! Since you only care about the values, and not the keys, from the hash, you can streamline things a little more by using Redis#hvals, which just returns an array of the values, instead of hgetall.

The second place you might be spending a lot of cycles is in locate, which you haven't provided the source for, but there are a lot of ways to write a dot product method in Ruby and some of them are more performant than others. This ruby-talk thread covers some valuable ground. One of the posters points to NArray, a library that implements numeric arrays and vectors in C.

If I understand your code correctly it could be reimplemented something like this (prereq: gem install narray):

require 'narray'

def self.topicize token_vector
  # Make sure token_vector is an NVector
  token_vector  = NVector.to_na token_vector unless token_vector.is_a? NVector
  num_feats     = 25

  # Use Redis#multi to bundle every operation into one call.
  # It will return an array of all 25 features' token_vectors.
  feat_token_vecs = REDIS.multi do
    num_feats.times do |feat_idx|
      REDIS.hvals "feature#{feat_idx + 1}"
    end
  end 

  pad_to_len = token_vector.length

  # Get the dot product of each of those arrays with token_vector
  feat_token_vecs.map do |feat_vec|
    # Make sure the array is long enough by padding it out with zeroes (using
    # pad_arr, defined below). (Since Redis only returns strings we have to
    # convert each value with String#to_f first.)
    feat_vec = pad_arr feat_vec.map(&:to_f), pad_to_len

    # Then convert it to an NVector and do the dot product
    token_vector * NVector.to_na(feat_vec)

    # If we need to get a Ruby Array out instead of an NVector use #to_a, e.g.:
    # ( token_vector * NVector.to_na(feat_vec) ).to_a
  end
end

# Utility to pad out array with zeroes to desired size
def pad_arr arr, size
  arr.length < size ?
    arr + Array.new(size - arr.length, 0) : arr
end

Hope that's helpful!

回复收藏 0 原文

甜妞爱困 2024-12-13 06:33:28

这并不是真正的答案，只是我之前评论的后续，因为这可能不适合评论。看来 Hash/TokenVector 问题可能不是唯一的问题。我这样做：

token_vector = Feature.find(1).token_vector
Analysis.locate( token_vector, TokenVector[ REDIS.hgetall( "feature1" ) ] )

并收到此错误：

TypeError: String can't be coerced into Float
from /Users/RedApple/S/lib/analysis/vectors.rb:26:in `*'
from /Users/RedApple/S/lib/analysis/vectors.rb:26:in `block in dot'
from /Users/RedApple/S/lib/analysis/vectors.rb:24:in `each'
from /Users/RedApple/S/lib/analysis/vectors.rb:24:in `inject'
from /Users/RedApple/S/lib/analysis/vectors.rb:24:in `dot'
from /Users/RedApple/S/lib/analysis/analysis.rb:223:in `locate'
from (irb):6
from /Users/RedApple/.rvm/rubies/ruby-1.9.2-p290/bin/irb:16:in `<main>'

Analysis#locate 看起来像这样：

def self.locate vector1, vector2
  vector1.dot vector2
end

这是analysis/vectors.rb 第 23-28 行的相关部分，TokenVector#dot 方法：

def dot vector
  inject 0 do |product,item|
    axis, value = item
    product + value * ( vector[axis] || 0 )
  end
end

我不确定问题出在哪里。

This isn't really an answer, just a follow up to my previous comment, since this probably won't fit into a comment. It looks like the Hash/TokenVector issue might not have been the only problem. I do:

token_vector = Feature.find(1).token_vector
Analysis.locate( token_vector, TokenVector[ REDIS.hgetall( "feature1" ) ] )

and get this error:

TypeError: String can't be coerced into Float
from /Users/RedApple/S/lib/analysis/vectors.rb:26:in `*'
from /Users/RedApple/S/lib/analysis/vectors.rb:26:in `block in dot'
from /Users/RedApple/S/lib/analysis/vectors.rb:24:in `each'
from /Users/RedApple/S/lib/analysis/vectors.rb:24:in `inject'
from /Users/RedApple/S/lib/analysis/vectors.rb:24:in `dot'
from /Users/RedApple/S/lib/analysis/analysis.rb:223:in `locate'
from (irb):6
from /Users/RedApple/.rvm/rubies/ruby-1.9.2-p290/bin/irb:16:in `<main>'

Analysis#locate looks like this:

def self.locate vector1, vector2
  vector1.dot vector2
end

Here is the relevant part of analysis/vectors.rb lines 23-28, the TokenVector#dot method:

def dot vector
  inject 0 do |product,item|
    axis, value = item
    product + value * ( vector[axis] || 0 )
  end
end

I am not sure where the problem is.

回复收藏 0 原文

~没有更多了~

关于作者

做个少女永远怀春

暂无简介

文章

1127 人气

关注发私信

友情链接

文江博客

如何在 Ruby on Rails 中使用 Redis 有效地获取两个哈希值的点积

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

知足的幸福

我一向站在原地

慕烟庭风

秉忠贞之诚守退让之实

小兔几

mb_3y7WUgWY

友情链接

如何在 Ruby on Rails 中使用 Redis 有效地获取两个哈希值的点积

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

知足的幸福

我一向站在原地

慕烟庭风

秉忠贞之诚 守退让之实

小兔几

mb_3y7WUgWY

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

秉忠贞之诚守退让之实