我是否误解了 Ruby 中的 String#hash ?
我正在处理一堆数据,而且还没有将重复检查器编码到数据处理器中,所以我预计会发生重复。我运行了以下 SQL 查询:
SELECT body, COUNT(body) AS dup_count
FROM comments
GROUP BY body
HAVING (COUNT(body) > 1)
并返回重复项列表。调查这个问题,我发现这些重复项有多个哈希值。评论的最短字符串是“[deleted]”
。让我们以此为例。在我的数据库中,有 9 个评论为 "[deleted]"
的实例,在我的数据库中,这会生成 1169143752200809218 和 1738115474508091027 的哈希值。116 被找到 6 次,173 被找到 3 次。但是,当我在 IRB 中运行它时,我得到以下信息:
a = '[deleted]'.hash # => 811866697208321010
这是我用来生成哈希的代码:
def comment_and_hash(chunk)
comment = chunk.at_xpath('*/span[@class="comment"]').text ##Get Comment##
hash = comment.hash
return comment,hash
end
我已经确认我不会在代码中的其他任何地方触摸注释。这是我的数据映射器类。
class Comment
include DataMapper::Resource
property :uid , Serial
property :author , String
property :date , Date
property :body , Text
property :arank , Float
property :srank , Float
property :parent , Integer #Should Be UID of another comment or blank if parent
property :value , Integer #Hash to prevent duplicates from occurring
end
我是否正确假设字符串上的 .hash
每次在同一字符串上调用时都会返回相同的值?
假设我的字符串由 "[deleted]"
组成,哪个值是正确的值?
有没有办法可以在 ruby 中使用不同的字符串,但 SQL 会将它们视为相同的字符串?这似乎是为什么会发生这种情况的最合理的解释,但我真的是在黑暗中拍摄。
I am processing a bunch of data and I haven't coded a duplicate checker into the data processor yet, so I expected duplicates to occur. I ran the following SQL query:
SELECT body, COUNT(body) AS dup_count
FROM comments
GROUP BY body
HAVING (COUNT(body) > 1)
And get back a list of duplicates. Looking into this I find that these duplicates have multiple hashes. The shortest string of a comment is "[deleted]"
. So let's use that as an example. In my database there are nine instances of a comment being "[deleted]"
and in my database this produces a hash of both 1169143752200809218 and 1738115474508091027. The 116 is found 6 times and 173 is found 3 times. But, when I run it in IRB, I get the following:
a = '[deleted]'.hash # => 811866697208321010
Here is the code I'm using to produce the hash:
def comment_and_hash(chunk)
comment = chunk.at_xpath('*/span[@class="comment"]').text ##Get Comment##
hash = comment.hash
return comment,hash
end
I've confirmed that I don't touch comment anywhere else in my code. Here is my datamapper class.
class Comment
include DataMapper::Resource
property :uid , Serial
property :author , String
property :date , Date
property :body , Text
property :arank , Float
property :srank , Float
property :parent , Integer #Should Be UID of another comment or blank if parent
property :value , Integer #Hash to prevent duplicates from occurring
end
Am I correct in assuming that .hash
on a string will return the same value each time it is called on the same string?
Which value is the correct value assuming my string consists of "[deleted]"
?
Is there a way I could have different strings inside ruby, but SQL would see them as the same string? That seems to be the most plausible explanation for why this is occurring, but I'm really shooting in the dark.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
如果您
运行
ruby -e "puts '[deleted]'.hash"
多次 ,您会注意到该值不同。事实上,只要 Ruby 进程处于活动状态,哈希值就只会保持不变。其原因是
String#hash
是用随机值作为种子的。rb_str_hash
(C 实现函数)使用 rb_hash_start< /a> 使用这个随机种子,每次生成 Ruby 时都会对其进行初始化。您可以使用 CRC,例如 Zlib#crc32 出于您的目的,或者您可能想要使用
OpenSSL::Digest
的消息摘要之一,尽管后者是矫枉过正的,因为为了检测重复项,您可能不需要安全属性。If you run
ruby -e "puts '[deleted]'.hash"
several times, you will notice that the value is different. In fact, the hash value stays only constant as long as your Ruby process is alive. The reason for this is that
String#hash
is seeded with a random value.rb_str_hash
(the C implementing function) uses rb_hash_start which uses this random seed which gets initialized every time Ruby is spawned.You could use a CRC such as Zlib#crc32 for your purposes or you may want to use one of the message digests of
OpenSSL::Digest
, although the latter is overkill since for detection of duplicates you probably won't need the security properties.我使用以下内容创建在时间和进程中保持一致的 String#hash 替代方案
I use the following to create String#hash alternatives that are consistant across time and processes
Ruby 故意使
String.hash
在不同会话中产生不同的值:为什么 Ruby String.hash 在不同机器上不一致?Ruby intentionally makes
String.hash
produce different values in different sessions: Why is Ruby String.hash inconsistent across machines?