Rabin Karp 滚动哈希生成的哈希未反映在文本上
注意:有很多可能的重复项,但似乎没有什么可以解决我的问题。
我正在研究基于 MOSS 的抄袭检测。
成功实现删除所有必要细节(注释、标点符号等)的过滤器后,我使用滚动哈希实现(Rabin Karp)对内容进行哈希处理,
但是在源代码的两个文本文件中匹配的哈希值具有非常不同的底层文本(没有抄袭,但哈希值相同)
我实现的算法(Ruby) --> (部分片段)
#Preprocessing from RobinKarp Algorithm
for c in 0...k do
text_hash=(radix*text_hash+text_to_process[c].ord)%q
end
#Main loop
for c in 0...loop do
text_hash=((radix*text_hash-(text_to_process[c].ord)*highorder)+(text_hash[c+k].ord))%q
我的实施有问题吗?或者我指定的参数可能有问题?
我取基数=34 (我不确定它是否是正确的值,我假设删除的文本将只包含字母+一些特殊字符,如“+”、“-”、“*”、“/”,所以粗略估计总共 34字符)
我将 q(prime) 设为 101
这是我正在处理的碰撞问题吗?有关如何解决该问题的任何指示吗?
Note: Lots of Possible duplicates, but nothing seems to be solving my problem.
I am working on a Plagiarism detection based on MOSS.
After successfully implementing a filter which strips out all the necessary details(comments,punctuations etc) I hash the content using a Rolling Hash Implementation(Rabin Karp)
However the Hashes which match in two text-files of source code, have very different underlying text(No plagiarism and yet same hashes)
The Algorithm I implemented(Ruby) -->
(Partial Snippet)
#Preprocessing from RobinKarp Algorithm
for c in 0...k do
text_hash=(radix*text_hash+text_to_process[c].ord)%q
end
#Main loop
for c in 0...loop do
text_hash=((radix*text_hash-(text_to_process[c].ord)*highorder)+(text_hash[c+k].ord))%q
Is there an issue with my implementation? Or the Parameters I specify can be at fault?
I take radix=34
( I am not sure if it is the right value, I am assuming the stripped out text will only contain alphabets+some special charcters like '+','-','*','/' so a rough estimate of total 34 characters)
I am taking q(prime) to be 101
Is this a collision issue I am dealing with? Any pointers as to how to tackle the problem?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我注意到,当 q = 101 时,只有 101 个可能的哈希值 - 0、1、2...100。你试过增加q吗?另一种方法是查看哈希值是否看起来像是在 0,1..q-1 可能值内随机选择的值。
当然,您还应该在存在重复字符串的情况下测试您的程序以供查找 - 失败可能是任何问题的另一种症状,也会导致冲突,并且会更容易查找和调试。
I note that with q = 101, there are only 101 possible hash values - 0, 1, 2...100. Have you tried increasing q? Another approach would be to look and see if the hash values look like they are randomly chosen values within the possible values of 0,1..q-1.
You should of course also test your program on cases where there are repeated strings for it to find - a failure there could be another symptom of any problem that is also causing collisions, and it would be easier to find and debug.