在源代码中查找模式
如果我想总体了解模式识别,那么什么是一个好的起点(推荐一本书)?
另外,是否有人有关于如何应用这些算法来查找程序中的抽象模式的经验/知识? (重复的代码、执行相同操作但方式略有不同的代码块等)
谢谢
编辑:我不介意数学密集型书籍。 事实上,这将是一件好事。
If I wanted to learn about pattern recognition in general what would be a good place to start (recommend a book)?
Also, does anybody have any experience/knowledge on how to go about applying these algorithms to find abstraction patterns in programs? (repeated code, chunks of code that do the same thing, but in slightly different ways, etc.)
Thanks
Edit: I don't mind mathematically intensive books. In fact, that would be a good thing.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(8)
如果您在数学上有相当的信心,那么 Chris Bishop 的书“模式识别和机器学习”或“模式识别的神经网络”都非常适合学习模式识别。
If you are reasonably mathematically confident then either of Chris Bishop's books "Pattern Recognition and Machine Learning" or "Neural Networks for Pattern Recognition" are very good for learning about pattern recognition.
如果您有权访问编译期间生成的解析树,这会有所帮助。 通过这种方式,您可以寻找相似的树块,忽略比您正在查看的更深的节点,通过这种方式,您可以挑选出例如将两个子表达式相乘的节点,忽略子表达式的内容表达式。 您可以将相同的逻辑应用于节点集合,例如您想要查找两个子表达式的乘法,其中这两个子表达式是更多子表达式的相加。 您首先查找乘法,然后检查乘法下面的两个节点是否是加法,忽略任何更深层次的内容。
It helps if you have access to the parse tree generated during compilation. This way you can look for pieces of the tree which are similar, ignoring the nodes which are deeper than what you are looking at, this way you can pick out e.g. nodes which multiply together two sub-expressions, ignoring the contents of the sub-expressions. You can apply the same logic to a collection of nodes, e.g. you want to find a multiplication of two sub-expressions where those two sub-expressions are additions of more sub-expressions. You first look for multiplies, then check if the two nodes underneath the multiply are additions, ignoring anything any deeper.
我建议查看一些开源项目的代码(例如 FindBugs 或 SIM)
这就是你所说的那种事情。
I'd suggest looking at the code of some open source project (e.g. FindBugs or SIM)
that does the kind of thing you're talking about.
如果您使用受支持的语言之一,IntelliJ idea 具有非常智能的结构搜索并替换适合您问题的。
If you're working in one of the supported languages, IntelliJ idea has a really smart structural search and replace that would fit your problem.
其他有趣的项目有 PMD 和 Eclipse。
Eclipse 对任何项目中的所有源代码都使用 AST(抽象语法树)。 然后,工具可以注册某些类型的 AST(例如 Java 源代码)并获取预处理视图,在其中可以添加其他信息(例如文档链接、错误标记等)。
Other interesting projects are PMD and Eclipse.
Eclipse uses AST (abstract syntax trees) for all source code in any project. Tools can then register for certain types of ASTs (like Java source) and get a preprocessed view where they can add additional information (like links to documentation, error markers, etc).
您可以查看的另一个项目是 Duplo - 它是一个开源/GPL 项目,因此您可以仔细研究他们的方法是从 SourceForge 获取代码。
Another project you can look into is Duplo - it's an open-source/GPL project, so you can pore over their approach by grabbing the code from SourceForge.
这是特定于 .Net 和 Visual Studio 的,但它会在您的项目中找到重复的代码。 它确实报告了我发现的一些误报,但这可能是一个很好的起点。
克隆侦探
This is specific to .Net and visual studio, but it finds duplicate code in your project. It does report some false positives I've found but it could be a good place to start.
Clone Detective
一种模式是通过复制和粘贴方法克隆的代码。 请参阅 CloneDR 了解一个工具,该工具可以自动查找此类代码,无论布局变化甚至更改克隆的主体,通过比较相关语言的抽象语法树。
CloneDR 适用于多种语言:C、C++、C#、Java、JavaScript、PHP、COBOL、Python...该网站显示了多种编程语言的克隆检测报告。
One kind of pattern is code that has been cloned by copy and paste methods. See CloneDR for a tool that automatically finds such code in spite of variations in layout and even changes in the body of the clone, by comparing abstract syntax trees for the language in question.
CloneDR works with a variety of langauges: C, C++, C#, Java, JavaScript, PHP, COBOL, Python, ... The website shows clone detection reports for a variety of programming languages.