标识符分割以大致匹配文档

发布于 2024-10-24 23:37:11 字数 900 浏览 11 评论 0原文

不同的软件项目有不同的编码约定；即使在同一个项目中也可能使用不同的语言并且会有不同的约定。使用源代码中的标识符标记搜索文档（出现在源文件外部）有什么好处？

例如，如果源具有 self._def_passwd 或 this.defPasswrd，则文档树上的查询应努力匹配默认密码。

到目前为止，我一直在尝试按编辑距离进行排序，这对于较小的编辑距离非常有效，但是当我增加阈值时会出现太多误报，这对于文档中的空白来说是有问题的。

8 0.666667 announcement getContent AnnouncementBean.java(Token.Name.Function )
8 0.666667 announcement getPercent DataObservation.java (Token.Name.Function)
8 0.666667 announcement GroupBean GroupBean.java (Token.Name.Class)

其中第一个值是编辑距离，第二个值是距离除以匹配单词的长度。我正在考虑

研究 Jaccard、Tanimoto 算法
智能/建议类代码
在 SO 的某个地方有一些关于生物专家用于匹配序列的算法的帖子
提出基于 http://en.wikipedia.org/wiki/Naming_convention_%28programming%29

最后一个实际上是最后一个选项。您认为还有哪些其他算法可以为此类事情提供更好的结果？

原文

Different software projects have different coding convention; even in the same project there may be different languages used and will have different convention. What is good for searching documentation (which appear outside the source files), with identifier tokens from the source code?

For example if the source has self._def_passwd, or this.defPasswrd, a query on the documentation tree should strive to match default password.

So far I've been trying to sort by Levenshtein distance, which works nicely for small edit distances, but there are too many false positives when I increase the threshold, which is problematic with white spaces in documentation.

8 0.666667 announcement getContent AnnouncementBean.java(Token.Name.Function )
8 0.666667 announcement getPercent DataObservation.java (Token.Name.Function)
8 0.666667 announcement GroupBean GroupBean.java (Token.Name.Class)

where the first value is the Levenshtein distance, second one the distance divided by the length of the word matched.
I'm thinking to

look into Jaccard, Tanimoto algorithms
intellisence/suggest kinda code
Somewhere in SO there were posts on some algorithms that bio guys use for matching sequences
Come up with regular expressions chain rules based on http://en.wikipedia.org/wiki/Naming_convention_%28programming%29

the last one being literally the last option. Which other algorithms do you think would could give better results for this kinda stuff?

分享到QQ

分享到微博