我是否误解了 Ruby 中的 String#hash ?

发布于 2024-12-09 03:03:22 字数 1404 浏览 0 评论 0原文

我正在处理一堆数据,而且还没有将重复检查器编码到数据处理器中,所以我预计会发生重复。我运行了以下 SQL 查询:

SELECT     body, COUNT(body) AS dup_count 
FROM         comments
GROUP BY body
HAVING     (COUNT(body) > 1) 

并返回重复项列表。调查这个问题,我发现这些重复项有多个哈希值。评论的最短字符串是“[deleted]”。让我们以此为例。在我的数据库中,有 9 个评论为 "[deleted]" 的实例,在我的数据库中,这会生成 1169143752200809218 和 1738115474508091027 的哈希值。116 被找到 6 次,173 被找到 3 次。但是,当我在 IRB 中运行它时,我得到以下信息:

a = '[deleted]'.hash # => 811866697208321010

这是我用来生成哈希的代码:

def comment_and_hash(chunk)     
  comment = chunk.at_xpath('*/span[@class="comment"]').text ##Get Comment##
  hash = comment.hash
  return comment,hash
end

我已经确认我不会在代码中的其他任何地方触摸注释。这是我的数据映射器类。

class Comment

    include DataMapper::Resource

    property :uid       , Serial
    property :author    , String
    property :date      , Date
    property :body      , Text
    property :arank     , Float 
    property :srank     , Float 
    property :parent    , Integer #Should Be UID of another comment or blank if parent
    property :value     , Integer #Hash to prevent duplicates from occurring

end

我是否正确假设字符串上的 .hash 每次在同一字符串上调用时都会返回相同的值?

假设我的字符串由 "[deleted]" 组成,哪个值是正确的值?

有没有办法可以在 ruby​​ 中使用不同的字符串,但 SQL 会将它们视为相同的字符串?这似乎是为什么会发生这种情况的最合理的解释,但我真的是在黑暗中拍摄。

I am processing a bunch of data and I haven't coded a duplicate checker into the data processor yet, so I expected duplicates to occur. I ran the following SQL query:

SELECT     body, COUNT(body) AS dup_count 
FROM         comments
GROUP BY body
HAVING     (COUNT(body) > 1) 

And get back a list of duplicates. Looking into this I find that these duplicates have multiple hashes. The shortest string of a comment is "[deleted]". So let's use that as an example. In my database there are nine instances of a comment being "[deleted]" and in my database this produces a hash of both 1169143752200809218 and 1738115474508091027. The 116 is found 6 times and 173 is found 3 times. But, when I run it in IRB, I get the following:

a = '[deleted]'.hash # => 811866697208321010

Here is the code I'm using to produce the hash:

def comment_and_hash(chunk)     
  comment = chunk.at_xpath('*/span[@class="comment"]').text ##Get Comment##
  hash = comment.hash
  return comment,hash
end

I've confirmed that I don't touch comment anywhere else in my code. Here is my datamapper class.

class Comment

    include DataMapper::Resource

    property :uid       , Serial
    property :author    , String
    property :date      , Date
    property :body      , Text
    property :arank     , Float 
    property :srank     , Float 
    property :parent    , Integer #Should Be UID of another comment or blank if parent
    property :value     , Integer #Hash to prevent duplicates from occurring

end

Am I correct in assuming that .hash on a string will return the same value each time it is called on the same string?

Which value is the correct value assuming my string consists of "[deleted]"?

Is there a way I could have different strings inside ruby, but SQL would see them as the same string? That seems to be the most plausible explanation for why this is occurring, but I'm really shooting in the dark.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

吾性傲以野 2024-12-16 03:03:22

如果您

运行ruby -e "puts '[deleted]'.hash"

多次 ,您会注意到该值不同。事实上,只要 Ruby 进程处于活动状态,哈希值就只会保持不变。其原因是 String#hash 是用随机值作为种子的。 rb_str_hash (C 实现函数)使用 rb_hash_start< /a> 使用这个随机种子,每次生成 Ruby 时都会对其进行初始化。

您可以使用 CRC,例如 Zlib#crc32 出于您的目的,或者您可能想要使用 OpenSSL::Digest 的消息摘要之一,尽管后者是矫枉过正的,因为为了检测重复项,您可能不需要安全属性。

If you run

ruby -e "puts '[deleted]'.hash"

several times, you will notice that the value is different. In fact, the hash value stays only constant as long as your Ruby process is alive. The reason for this is that String#hash is seeded with a random value. rb_str_hash (the C implementing function) uses rb_hash_start which uses this random seed which gets initialized every time Ruby is spawned.

You could use a CRC such as Zlib#crc32 for your purposes or you may want to use one of the message digests of OpenSSL::Digest, although the latter is overkill since for detection of duplicates you probably won't need the security properties.

醉生梦死 2024-12-16 03:03:22

我使用以下内容创建在时间和进程中保持一致的 String#hash 替代方案

require 'zlib'

def generate_id(label)
  Zlib.crc32(label.to_s) % (2 ** 30 - 1)
end

I use the following to create String#hash alternatives that are consistant across time and processes

require 'zlib'

def generate_id(label)
  Zlib.crc32(label.to_s) % (2 ** 30 - 1)
end
美男兮 2024-12-16 03:03:22

Ruby 故意使 String.hash 在不同会话中产生不同的值:为什么 Ruby String.hash 在不同机器上不一致?

Ruby intentionally makes String.hash produce different values in different sessions: Why is Ruby String.hash inconsistent across machines?

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文