Rabin Karp 滚动哈希生成的哈希未反映在文本上

发布于 2024-12-26 23:28:02 字数 762 浏览 1 评论 0原文

注意：有很多可能的重复项，但似乎没有什么可以解决我的问题。

我正在研究基于 MOSS 的抄袭检测。

成功实现删除所有必要细节（注释、标点符号等）的过滤器后，我使用滚动哈希实现（Rabin Karp）对内容进行哈希处理，

但是在源代码的两个文本文件中匹配的哈希值具有非常不同的底层文本（没有抄袭，但哈希值相同）

我实现的算法（Ruby） --> （部分片段）

 #Preprocessing from RobinKarp Algorithm
  for c in 0...k do
    text_hash=(radix*text_hash+text_to_process[c].ord)%q
  end

  #Main loop
  for c in 0...loop do   
        text_hash=((radix*text_hash-(text_to_process[c].ord)*highorder)+(text_hash[c+k].ord))%q

我的实施有问题吗？或者我指定的参数可能有问题？

我取基数=34 （我不确定它是否是正确的值，我假设删除的文本将只包含字母+一些特殊字符，如“+”、“-”、“*”、“/”，所以粗略估计总共 34字符）

我将 q(prime) 设为 101

这是我正在处理的碰撞问题吗？有关如何解决该问题的任何指示吗？

原文

Note: Lots of Possible duplicates, but nothing seems to be solving my problem.

I am working on a Plagiarism detection based on MOSS.

After successfully implementing a filter which strips out all the necessary details(comments,punctuations etc) I hash the content using a Rolling Hash Implementation(Rabin Karp)

However the Hashes which match in two text-files of source code, have very different underlying text(No plagiarism and yet same hashes)

The Algorithm I implemented(Ruby) -->
(Partial Snippet)

 #Preprocessing from RobinKarp Algorithm
  for c in 0...k do
    text_hash=(radix*text_hash+text_to_process[c].ord)%q
  end

  #Main loop
  for c in 0...loop do   
        text_hash=((radix*text_hash-(text_to_process[c].ord)*highorder)+(text_hash[c+k].ord))%q

Is there an issue with my implementation? Or the Parameters I specify can be at fault?

I take radix=34
( I am not sure if it is the right value, I am assuming the stripped out text will only contain alphabets+some special charcters like '+','-','*','/' so a rough estimate of total 34 characters)

I am taking q(prime) to be 101

Is this a collision issue I am dealing with? Any pointers as to how to tackle the problem?

分享到QQ

分享到微博