使用子树查找相似的代码段

发布于 2024-11-01 08:01:10 字数 850 浏览 6 评论 0原文

我一直在阅读这篇题为 的论文使用抽象语法树进行克隆检测,作者:Ira D. Baxter 等人。我在下面转载了论文中的一段话:

原则上,寻找子树克隆 很简单:将每个子树与 每个其他子树都相等。在 实践中,出现了几个问题: 未遂克隆检测、亚克隆 和规模。 ...

当定位未遂事件时 克隆,在完整子树上进行散列 失败正是因为良好的散列 函数包括所有元素 树,从而对头发进行较小的排序 差异放入不同的桶中。我们 通过选择一个解决了这个问题 人为错误的哈希函数。该函数的特征必须是 这样的方式主要属性 人们想要找到未遂的克隆 被保留。侥幸逃脱的克隆是 通常通过复制和粘贴创建 程序之后是小 修改。这些修改 通常会对 与相关的树的形状 复制的一段代码。因此,我们 认为这种未遂事件 克隆通常只有一些不同 小子树。基于此 观察,一个哈希函数 忽略小子树是 不错的选择。 实验中 在这里,我们使用了哈希 函数只忽略 标识符名称(树中的叶子)。 因此我们的哈希函数将树 这是相似的模标识符 放入相同的哈希箱中 比较。

我正在尝试实现本文中讨论的技术,但一直试图理解这一段(不幸的是在本文的开头)。我理解该段落的内容,但作者没有提及要选择什么哈希函数或如何实际对 AST 进行哈希处理。有人可以从实现的角度用一个简单的例子来解释这一点吗?

I have been reading this paper titled Clone Detection using Abstract Syntax Trees by Ira D. Baxter et al. There is a paragraph from the paper that I reproduced below:

In principle, finding sub-tree clones
is easy: compare every subtree to
every other sub-tree for equality. In
practice, several problems arise:
near-miss clone detection, sub-clones
and scale.
...

When locating near-miss
clones, hashing on complete subtrees
fails precisely because good hashing
functions include all elements of the
tree, and thus sorts tress with minor
differences into different buckets. We
solved this problem by choosing an
artificially bad hash function. This function must be characterized in
such a way that the main properties
one wants to find on near-miss clones
are preserved. Near miss clones are
usually created by copy and paste
procedures followed by small
modifications. These modifications
usually generate small changes to the
shape of the tree associated with the
copied piece of code. Therefore, we
argue that this kind of near-miss
clone often have only some different
small sub-trees. Based on this
observation, a hash function that
ignores small sub-trees is a
goodchoice. In the experiment
presented here, we used a hash
function that ignores only the
identifier names (leaves in the tree).
Thus our hashing function puts trees
which are similar modulo identifiers
into the same hash bins for
comparison.

I am trying to implement the techniques discussed in this paper but am stuck in trying to understand this one paragraph (that is unfortunately at the beginning of the paper). I understand what the paragraph is saying but the authors do not mention what hash function to choose or how to actually hash the ASTs. Can someone please explain this with a simple example from an implementation standpoint?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

舟遥客 2024-11-08 08:01:11

作者本人应该回答的色调。 StackOverflow 不是很棒吗:-?

哈希函数的要点是,您选择哪一个并不重要,只要它将输入值均匀地分布在大量存储桶中即可。你需要一个可以应用于整棵树的哈希函数;通常的技术是以任何可能的方式序列化树(例如,通过有序树访问),然后将哈希函数应用于由此产生的值流(树节点)。 (这个想法来自于有关检测常见子表达式的编译器文献,这是原始 CloneDR 的灵感)。如果这还不清楚,您需要花费更多的精力来理解哈希函数如何应用于复杂的数据结构。关于散列的维基百科是一个很好的起点;如果这还不够,你需要找一本关于算法的书并学习。

您向哈希函数提供的内容取决于您。我在论文中提出的观点是,您可以计算忽略 AST 的标识符叶的哈希函数,这将导致具有相同结构但不同标识符的树哈希到同一桶。因此,具有相似模标识符的树很容易匹配,因为它们出现在同一个哈希桶中。

当然,整个克隆检测算法还有很多其他内容,只是匹配树模标识符。您需要担心匹配参数化序列(这是本文的重点)、报告结果,当然,无论您想要应用什么语言,您都需要一个高质量的语言解析器这个到。

您可以查看多种不同语言的 CloneDR 结果。

Shades that the author himself should answer. Isn't StackOverflow great :-?

The point of hash functions is that which one you choose doesn't matter, as long as it distributes input values evenly across a large number of buckets. You need a hash function that can be applied to the entire tree; the usual technique for such is to serialize the tree in any way possible (say, by an in-order tree visit) and then apply the hash function to the stream of values (tree nodes) this produces. (This idea is from the compiler literature on detecting common subexpressions, which was the inspiration for the original CloneDR). If this isn't clear, you need to spend more energy understanding how hash functions are applied to complex data structures. Wikipedia on hashing is a good place to start; if that's not enough, you need to find a book on algorithms and study up.

What you feed to the hash function is up to you. The point I made in the paper is that you can compute hash functions that ignore the identifier leaves of an AST, which will cause trees having the same identical structure but different identifiers to hash to the same bucket. Thus, trees which are similar modulo identifiers are easily matched, beause they occur in the same hash bucket.

Of course, there's a lot more to the whole clone detection algorithm that just matching trees modulo identifiers. You need to worry about matching parameterized sequences (which is sort of the big point in the paper), reporting the results, and of course you need a high-quality language parser for whatever language you care to apply this to.

You can see results of the CloneDR for a number of different languages.

冷…雨湿花 2024-11-08 08:01:11

如果您知道两个 AST 对于您的肉眼来说是“克隆”,您需要确保它们也具有相同的哈希值。

例如,将每个标识符哈希为一个常量,将每个字符串哈希为另一个常量,以避免被变量重命名所欺骗,而不是实际使用标识符名称作为哈希的实质部分。

或者对可交换的表达式使用交换哈希,即确保 a+b 和 b+a 获得相同的哈希值。

涉及变量、整数、运算符和括号的算术表达式示例:

 hash VariableName = 0x12345678
 hash IntegerConstant = 0xff77ff77
 hash x + y = (hash x) + (hash y)
 hash (x) = (hash x) <<< 13
 hash x * y = (hash x) xor (hash y)

等等。

If you know that two ASTs are "clones" to your human eye you want to make sure they have the same hash value also.

For example, hash every identifier to a constant and every string to another constant to avoid getting tricked by variable renaming, instead of actually using identifier name as material part of hashing.

Or use commutative hashing for expression that are commutative, I.e. make sure a+b and b+a get the same hash value.

Example for arithmetic expressions involving variables, integers, operators and parenthesis:

 hash VariableName = 0x12345678
 hash IntegerConstant = 0xff77ff77
 hash x + y = (hash x) + (hash y)
 hash (x) = (hash x) <<< 13
 hash x * y = (hash x) xor (hash y)

Etc.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文