Rabin–Karp算法利用滚动哈希实现抄袭
我正在使用 Rabin–Karp 算法来检查任意两个源代码文件的抄袭行为 所以首先我简单地用 C# 实现它的算法,这里是它的代码,但它的平均和最好情况运行时间是 O(n+m) 在空间 O(p) 中,但它的最坏情况时间是 O(nm) 。
public void plagiarism(string [] file1, string [] file2)
{
int percent = 0;
for (int i = 0; i <(file1.Length - file2.Length +1); i++)
{
for (int j = 0; j < file1.Length; j++)
{
if (file1[i + j - 1] != file2[j])
{
}
percent++;
Console.WriteLine(percent);
}
Console.WriteLine("not copied");
}
}
那么如何通过使用滚动哈希函数使其更有效,因为这比这更好..
i am using Rabin–Karp algorithm to check plagiarism for any two source code files
so firstly i simply implement its algorithm in c # here its code but its average and best case running time is O(n+m) in space O(p), but its worst-case time is O(nm).
public void plagiarism(string [] file1, string [] file2)
{
int percent = 0;
for (int i = 0; i <(file1.Length - file2.Length +1); i++)
{
for (int j = 0; j < file1.Length; j++)
{
if (file1[i + j - 1] != file2[j])
{
}
percent++;
Console.WriteLine(percent);
}
Console.WriteLine("not copied");
}
}
so how would make it more efficient by using rolling hash function because that is better than this..
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
维基百科文章对该算法进行了相当好的讨论,甚至提到了如何实现滚动哈希函数(请参阅“使用哈希进行移位子字符串搜索”)。它还解决了如何使用哈希表或布隆过滤器来提高运行速度。
您还必须明白,最坏的情况是一个相当人为的例子。维基百科文章中给出的示例是“在 1000 万个“a”的字符串中搜索由 10,000 个“a”组成的字符串,后跟“b”。”
您应该能够使用该维基百科条目中描述的技术来实现滚动哈希。如果您在实施时遇到困难,请留下一个关于如何完成的更具体的问题,并展示您已经尝试过的内容。
您不太可能在现实文档中遇到任何接近最坏情况的情况。即使遇到最坏的情况,滚动哈希也不会降低复杂性。实现滚动哈希可以在运行时上提供线性改进,但会被
n*m
复杂性所淹没。如果您发现最坏的情况经常发生,那么您可能需要不同的算法。另一件需要注意的事情是,虽然
O(m*n)
可能是一个问题,但您必须考虑规模。您正在检查的文档有多大?您说您正在使用源代码文件。如果您正在查看典型的课堂项目,那么您可能正在讨论大约 2,000 行代码。这些文件不会展示最坏的情况。即使他们这样做了,n*m
也不会是一个非常大的数字。但是,如果您有 100 个文档,并且您想知道其中是否有一个与另一个文档实质上重复,则更大的问题是 O(n^2),因为您必须对照所有其他文档检查每个文档。文档比较的次数等于
(n*(n-1))/2
。如果您希望优化流程,则需要不同的算法。理想情况下,能够为您提供文档“指纹”的东西。这样,您可以一次计算每个文档的指纹,然后比较指纹的相似性。文档指纹识别是一个众所周知的问题。然而,构建可用于比较目的的指纹有点不那么简单。您需要研究一种称为“shingling”的技术。我还看到了一些关于使用小型布隆过滤器(256 字节左右)来表示文档以及使用它进行快速比较的能力的研究。
话虽如此,我怀疑如果您正在谈论一百或两个源代码文件,每个文件可能有 1,000 或 2,000 行长,那么使用良好的 Rabin-Carp 实现的简单 O(n^2) 比较技术将完成您的任务想。这需要一些时间(您将进行 5,000 个单独的文档比较),但我不认为 RK 实施的速度会成为您的限制因素。
The Wikipedia article has a reasonably good discussion of the algorithm, and even mentions how you can implement the rolling hash function (see "Use of hashing for shifting substring search"). It also addresses how to improve runtime speed using a hash table or Bloom filter.
You also have to understand that the worst case is a fairly contrived example. The example given in the Wikipedia article is 'searching for a string of 10,000 "a"s followed by a "b" in a string of 10 million "a"s.'
You should be able to implement the rolling hash using the techniques described in that Wikipedia entry. If you're having trouble implementing that, leave a more specific question about how it's done, showing what you've tried.
It's unlikely that you'll encounter anything approaching the worst case in real-world documents. Even if you were to encounter the worst case, the rolling hash will not reduce the complexity. Implementing the rolling hash gives a linear improvement in runtime, which will be swamped by the
n*m
complexity. If you find that the worst case happens often, then you probably need a different algorithm.The other thing to note is that, whereas
O(m*n)
can be a problem, you have to look at the scale. How large are the documents you're examining? You say you're working with source code files. If you're looking at typical class projects, then you're probably talking maybe 2,000 lines of code. Those documents aren't going to exhibit the worst case. Even if they did,n*m
isn't going to be a very large number.However, if you have 100 documents and you want to know if any one is a substantial duplicate of the other, your larger problem is O(n^2) because you have to check every document against all the others. The number of document comparisons is equal to
(n*(n-1))/2
. If you're looking to optimize your process, you need a different algorithm. Ideally, something that will give you a "fingerprint" of a document. That way, you can compute the fingerprint for each document one time, and then compare the fingerprints for similarity.Document fingerprinting is a well known problem. However, constructing a fingerprint that's useful for comparison purposes is a bit less straightforward. You'd want to look into a technique called shingling. I also saw some research about using a small Bloom filter (256 bytes or so) to represent a document, and the ability to do fast comparisons using that.
All that said, I suspect that if you're talking a hundred or two source code files that are each maybe 1,000 or 2,000 lines long, the naive O(n^2) comparison technique using a good Rabin-Carp implementation will do what you want. It will take some time (you're going to do 5,000 separate document comparisons), but I don't think the speed of the R-K implementation will be your limiting factor.