在 Ruby 中为抄袭检测引擎设计噪声过滤器

发布于 2024-12-21 22:20:06 字数 438 浏览 2 评论 0原文

我一直致力于基于 MOSS(软件相似度测量)背后的学术论文实现抄袭检测引擎

链接到 MOSS

对于为 C/C++/Java 等语言设计噪声滤波器,我需要做出一些决定。

关键字是否与检测剽窃相关,或者是否应该删除? 相同语言的源文件必然共享同一组关键字。本文没有讨论如何处理它们。

如何处理标识符? 将所有关键字替换为单个字符“V”,使匹配独立于变量名称是有意义的。

如何处理包导入和库包含?

空格、注释和标点符号都必须被彻底删除。

我想知道在完成所有操作后,源文件将只是一堆“V”和其他一些乱码文本。

噪声滤波器应执行哪些操作?

关于处理噪音的最佳方法的见解和意见?

I have been working on an Implementation of a Plagiarism Detection Engine based on the academic paper behind MOSS(Measure of Software Similarity)

Link to MOSS

For designing a noise filter for a language like C/C++/Java, I have some decisions to make.

Are keywords relevant for detecting plagiarism or they should be removed?
Source files in same language are bound to share the same set of keywords. The paper does not discuss on how to deal with them.

How to deal with identifiers?
Replacing all keywords with a single character 'V' making matches independent of variable name makes sense.

What to do with package imports and library includes?

Whitespaces, Commments and punctuations are to be stripped definitely.

I am wondering after doing all the operations, the source file will be just a bunch of 'V' and some other garbled text.

What operations should the noise filter perform?

Insights and Opinions on the best way to deal with noise ?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

只是一片海 2024-12-28 22:20:06

对于单个函数:编译它们,并比较生成的汇编代码或对象。
对于整个程序:对所有函数执行上述操作,并创建模糊搜索以在已知函数和片段的数据库中查找片段。

所以基本上,您需要构建一个编译器,它会发出其输入的规范化表示,类似于 P 代码,但最好是人类可读的。

有些片段比其他片段更具特征,该片段

for (i=0; i < 12345; i++) {
  array[i] = 54321;
  }

可能会以某种形式出现在每个程序中。它的功能与 100% 相同

j=0;
while ( j < 12345) {
  foobar[j++] = 54321;
  }

,并且编译器可能会生成相同的代码。

变量名、数值常量、地址常量等都可能存在差异。但关键字的“骨架”(->{比较、循环、表达式、赋值、函数调用})将是相同的。所以:不要放弃关键字,它们是程序的脚手架。

For single functions: compile them, and compare the resulting assembler code or objects.
For a whole program: do the above for all the functions and create a fuzzy search to find back the fragments in a database of known functions and fragments.

So basically, you need to build a compiler, which emits a canonised representation of its input it, similar to P-code, but preferably human readable.

Some fragments are more characteristic than others, the fragment

for (i=0; i < 12345; i++) {
  array[i] = 54321;
  }

Will probably occur in some form in every program. It is 100% functional identical to

j=0;
while ( j < 12345) {
  foobar[j++] = 54321;
  }

, and a compiler would probably produce identical code.

There can be differences in variable-names, numerical constants, address constants, anything. But the "skeleton" of keywords (-> {comparisons, loops, expressions, assignments, function calls}) will be the same. So: don't drop the keywords, they are the scaffolding of a program.

薄荷梦 2024-12-28 22:20:06

如果你搜索“文本指纹木瓦”,可以在谷歌上找到很多东西。木瓦是一个 x 字(在许多研究项目中 x=7)。您逐字构建一组所有木瓦。

您可以在木瓦上构建哈希,然后比较文本中的 1000 尾木瓦。这很简单。有一些东西,比如特殊的哈希函数,你肯定在这个上下文之外没有听说过等等。

例如,从阅读开始,它并不是真正的火箭科学,但也不是微不足道的。

“有效的文本来源检测”Besnik Fetahu、Andreas Frische
http://resources.mpi-inf.mpg .de/d5/teaching/ws10_11/hir/reports/BesnikFetahu.pdf

“重复文档的算法”,安德烈·布罗德
http://www.cs.princeton.edu/courses /archive/spr05/cos598E/bib/Princeton.pdf

There is quite a lot to find on google if you search for "text fingerprint shingle". A shingle is a x-word (x=7 in many research projects). You build a set of all shingles word by word.

You the build a hash over a shingle and then compare the 1000end of shingles in a text. It's pretty simple. There are a few things like special hash functions you for sure haven't heared outside this context etc.

Start with reading for example, it's not really rocket science but not trivial either.

"Text Origin Detection in an Efficient Way" Besnik Fetahu, Andreas Frische
http://resources.mpi-inf.mpg.de/d5/teaching/ws10_11/hir/reports/BesnikFetahu.pdf

"Algorithms for duplicate documents", Andrei Broder
http://www.cs.princeton.edu/courses/archive/spr05/cos598E/bib/Princeton.pdf

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文