如何在 Ruby on Rails 中使用 Redis 有效地获取两个哈希值的点积
我在数据库的特征表中有一个这样的数据结构,称为 token_vector(哈希):
Feature.find(1).token_vector = { "a" => 0.1, "b" => 0.2, "c" => 0.3 }
其中有 25 个特征。首先,我在 script/console
中将数据输入到 Redis 中:
REDIS.set( "feature1",
"#{ TokenVector.to_json Feature.find(1).token_vector }"
)
# ...
REDIS.set( "feature25",
"#{ TokenVector.to_json Feature.find(25).token_vector }"
)
TokenVector.to_json
首先将哈希值转换为 JSON 格式。 Redis 中存储的 25 个 JSON 哈希值大约占用 8 MB。
我有一个方法,称为Analysis#locate
。此方法采用两个 token_vector 之间的点积。散列的点积的工作原理如下:
hash1 = { "a" => 1, "b" => 2, "c" => 3 }
hash2 = { "a" => 4, "b" => 5, "c" => 6, "d" => 7 }
散列中的每个重叠键(在本例中为 a、b 和 c,而不是 d)将其值两两相乘,然后相加。
hash1
中 a
的值为 1,hash2
中 a
的值为 4。将它们相乘得到1*4 = 4
。
hash1
中 b
的值为 2,hash2
中 b
的值为 5。将它们相乘得到2*5 = 10
。
hash1
中 c
的值为 3,hash2
中 c
的值为 6。将它们相乘得到3*6 = 18
。
hash1
中 d
的值不存在,hash2
中 d
的值为 7。在这种情况下,为第一个哈希设置d = 0
。将它们相乘即可得到 0*7 = 0
。
现在将相乘的值相加。 4 + 10 + 18 + 0 = 32
。这是 hash1 和 hash2 的点积。
Analysis.locate( hash1, hash2 ) # => 32
我有一个经常使用的方法,Analysis#topicize
。该方法接受一个参数,token_vector
,它只是一个哈希值,与上面类似。 Analysis#topicize
采用 token_vector
与 25 个特征的 token_vectors
中每一个的点积,并创建这 25 个点积的新向量,称为feature_vector
。 feature_vector
只是一个数组。代码如下所示:
def self.topicize token_vector
feature_vector = FeatureVector.new
feature_vector.push(
locate( token_vector, TokenVector.from_json( REDIS.get "feature1" ) )
)
# ...
feature_vector.push(
locate( token_vector, TokenVector.from_json( REDIS.get "feature25" ) )
)
feature_vector
end
如您所见,它采用了我在上面输入到 Redis 中的 token_vector
和每个功能的 token_vector
的点积,并将值推入一个数组。
我的问题是,每次调用该方法大约需要 18 秒。我是否误用了 Redis?我认为问题可能是我不应该将 Redis 数据加载到 Ruby 中。我是否应该向 Redis 发送数据 (token_vector
) 并编写一个 Redis 函数来让它执行 dot_product
函数,而不是使用 Ruby 代码编写它?
I have a data structure like this in the database in the features table called token_vector
(a hash):
Feature.find(1).token_vector = { "a" => 0.1, "b" => 0.2, "c" => 0.3 }
There are 25 of these features. First, I entered the data into Redis with this in script/console
:
REDIS.set( "feature1",
"#{ TokenVector.to_json Feature.find(1).token_vector }"
)
# ...
REDIS.set( "feature25",
"#{ TokenVector.to_json Feature.find(25).token_vector }"
)
TokenVector.to_json
converts the hash into JSON format first. The 25 JSON hashes stored in Redis take up about 8 MB.
I have a method, called Analysis#locate
. This method takes the dot product between two token_vectors. The dot product for hashes works like this:
hash1 = { "a" => 1, "b" => 2, "c" => 3 }
hash2 = { "a" => 4, "b" => 5, "c" => 6, "d" => 7 }
Each overlapping key in the hash (a, b, and c in this case, and not d) have their values multiplied pairwise together, then added up.
The value for a
in hash1
is 1, the value for a
in hash2
is 4. Multiply these to get 1*4 = 4
.
The value for b
in hash1
is 2, the value for b
in hash2
is 5. Multiply these to get 2*5 = 10
.
The value for c
in hash1
is 3, the value for c
in hash2
is 6. Multiply these to get 3*6 = 18
.
The value for d
in hash1
is nonexistent, the value for d
in hash2
is 7. In this case, set d = 0
for the first hash. Multiply these to get 0*7 = 0
.
Now add up the multiplied values. 4 + 10 + 18 + 0 = 32
. This is the dot product of hash1 and hash2.
Analysis.locate( hash1, hash2 ) # => 32
I have a method that is often used, Analysis#topicize
. This method takes in a parameter, token_vector
, which is just a hash, similar to above. Analysis#topicize
takes the dot product of token_vector
and each of the 25 features' token_vectors
, and creates a new vector of those 25 dot products, called feature_vector
. A feature_vector
is just an array. Here is what the code looks like:
def self.topicize token_vector
feature_vector = FeatureVector.new
feature_vector.push(
locate( token_vector, TokenVector.from_json( REDIS.get "feature1" ) )
)
# ...
feature_vector.push(
locate( token_vector, TokenVector.from_json( REDIS.get "feature25" ) )
)
feature_vector
end
As you can see, it takes the dot product of token_vector
and each feature's token_vector
that I entered into Redis above, and pushes the value into an array.
My problem is, this takes about 18 seconds each time I invoke the method. Am I misusing Redis? I think the problem could be that I shouldn't load Redis data into Ruby. Am I supposed to send Redis the data (token_vector
) and write a Redis function to have it do the dot_product
function, rather than writing it with Ruby code?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您必须对其进行分析才能确定,但我怀疑您在序列化/反序列化 JSON 对象方面浪费了大量时间。与其将
token_vector
转换为 JSON 字符串,为什么不直接将其放入 Redis 中,因为 Redis 有 它自己的哈希类型?Hash#flatten
的作用是将哈希值转换为{ 'a' =>; 1、'b' => 2 }
转换为[ 'a', 1, 'b', 2 ]
这样的数组,然后我们使用 splat (*
) 发送每个元素数组作为Redis#hmset
的参数(“hmset”中的“m”代表“多个”,如“一次设置多个哈希值”)。然后,当您想将其取回时,请使用 Redis#hgetall ,它会自动返回 Ruby 哈希值:
但是!由于您只关心哈希中的值,而不是键,因此您可以使用
Redis#hvals
来简化事情,它只返回值的数组,而不是hgetall
。您可能会花费大量周期的第二个地方是
locate
,您尚未提供其源代码,但是有很多方法可以在 Ruby 中编写点积方法,其中一些方法他们比其他人表现更好。 这个 Ruby 讨论主题涵盖了一些有价值的内容。其中一张海报指向 NArray,一个用 C 语言实现数值数组和向量的库。如果我理解你的代码正确的话,它可以重新实现这样的东西(先决条件:
gem install narray
):希望这有帮助!
You would have to profile it to be sure, but I suspect you're losing a lot of time in serializing/deserializing JSON objects. Instead of turning
token_vector
into a JSON string, why not put it directly into Redis, since Redis has its own hash type?What
Hash#flatten
does is turns a hash like{ 'a' => 1, 'b' => 2 }
into an array like[ 'a', 1, 'b', 2 ]
, and then we use splat (*
) to send each element of the array as an argument toRedis#hmset
(the "m" in "hmset" is for "multiple," as in "set multiple hash values at once").Then when you want to get it back out use
Redis#hgetall
, which automatically returns a Ruby Hash:However! Since you only care about the values, and not the keys, from the hash, you can streamline things a little more by using
Redis#hvals
, which just returns an array of the values, instead ofhgetall
.The second place you might be spending a lot of cycles is in
locate
, which you haven't provided the source for, but there are a lot of ways to write a dot product method in Ruby and some of them are more performant than others. This ruby-talk thread covers some valuable ground. One of the posters points to NArray, a library that implements numeric arrays and vectors in C.If I understand your code correctly it could be reimplemented something like this (prereq:
gem install narray
):Hope that's helpful!
这并不是真正的答案,只是我之前评论的后续,因为这可能不适合评论。看来 Hash/TokenVector 问题可能不是唯一的问题。我这样做:
并收到此错误:
Analysis#locate 看起来像这样:
这是analysis/vectors.rb 第 23-28 行的相关部分,TokenVector#dot 方法:
我不确定问题出在哪里。
This isn't really an answer, just a follow up to my previous comment, since this probably won't fit into a comment. It looks like the Hash/TokenVector issue might not have been the only problem. I do:
and get this error:
Analysis#locate looks like this:
Here is the relevant part of analysis/vectors.rb lines 23-28, the TokenVector#dot method:
I am not sure where the problem is.