当前位置：文江博客话题详情

在 Ruby 中为抄袭检测引擎设计噪声过滤器

发布于 2024-12-21 22:20:06 字数 438 浏览 2 评论 0原文

我一直致力于基于 MOSS（软件相似度测量）背后的学术论文实现抄袭检测引擎

链接到 MOSS

对于为 C/C++/Java 等语言设计噪声滤波器，我需要做出一些决定。

关键字是否与检测剽窃相关，或者是否应该删除？相同语言的源文件必然共享同一组关键字。本文没有讨论如何处理它们。

如何处理标识符？将所有关键字替换为单个字符“V”，使匹配独立于变量名称是有意义的。

如何处理包导入和库包含？

空格、注释和标点符号都必须被彻底删除。

我想知道在完成所有操作后，源文件将只是一堆“V”和其他一些乱码文本。

噪声滤波器应执行哪些操作？

关于处理噪音的最佳方法的见解和意见？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

只是一片海 2024-12-28 22:20:06

对于单个函数：编译它们，并比较生成的汇编代码或对象。
对于整个程序：对所有函数执行上述操作，并创建模糊搜索以在已知函数和片段的数据库中查找片段。

所以基本上，您需要构建一个编译器，它会发出其输入的规范化表示，类似于 P 代码，但最好是人类可读的。

有些片段比其他片段更具特征，该片段

for (i=0; i < 12345; i++) {
  array[i] = 54321;
  }

可能会以某种形式出现在每个程序中。它的功能与 100% 相同

j=0;
while ( j < 12345) {
  foobar[j++] = 54321;
  }

，并且编译器可能会生成相同的代码。

变量名、数值常量、地址常量等都可能存在差异。但关键字的“骨架”（->{比较、循环、表达式、赋值、函数调用}）将是相同的。所以：不要放弃关键字，它们是程序的脚手架。

For single functions: compile them, and compare the resulting assembler code or objects.
For a whole program: do the above for all the functions and create a fuzzy search to find back the fragments in a database of known functions and fragments.

So basically, you need to build a compiler, which emits a canonised representation of its input it, similar to P-code, but preferably human readable.

Some fragments are more characteristic than others, the fragment

for (i=0; i < 12345; i++) {
  array[i] = 54321;
  }

Will probably occur in some form in every program. It is 100% functional identical to

j=0;
while ( j < 12345) {
  foobar[j++] = 54321;
  }

, and a compiler would probably produce identical code.

There can be differences in variable-names, numerical constants, address constants, anything. But the "skeleton" of keywords (-> {comparisons, loops, expressions, assignments, function calls}) will be the same. So: don't drop the keywords, they are the scaffolding of a program.

回复收藏 0 原文