标识符分割以大致匹配文档

发布于 2024-10-24 23:37:11 字数 900 浏览 8 评论 0原文

不同的软件项目有不同的编码约定;即使在同一个项目中也可能使用不同的语言并且会有不同的约定。使用源代码中的标识符标记搜索文档(出现在源文件外部)有什么好处?

例如,如果源具有 self._def_passwdthis.defPasswrd,则文档树上的查询应努力匹配默认密码

到目前为止,我一直在尝试按编辑距离进行排序,这对于较小的编辑距离非常有效,但是当我增加阈值时会出现太多误报,这对于文档中的空白来说是有问题的。

8 0.666667 announcement getContent AnnouncementBean.java(Token.Name.Function )
8 0.666667 announcement getPercent DataObservation.java (Token.Name.Function)
8 0.666667 announcement GroupBean GroupBean.java (Token.Name.Class)

其中第一个值是编辑距离,第二个值是距离除以匹配单词的长度。 我正在考虑

  1. 研究 Jaccard、Tanimoto 算法
  2. 智能/建议类代码
  3. 在 SO 的某个地方有一些关于生物专家用于匹配序列的算法的帖子
  4. 提出基于 http://en.wikipedia.org/wiki/Naming_convention_%28programming%29

最后一个实际上是最后一个选项。您认为还有哪些其他算法可以为此类事情提供更好的结果?

Different software projects have different coding convention; even in the same project there may be different languages used and will have different convention. What is good for searching documentation (which appear outside the source files), with identifier tokens from the source code?

For example if the source has self._def_passwd, or this.defPasswrd, a query on the documentation tree should strive to match default password.

So far I've been trying to sort by Levenshtein distance, which works nicely for small edit distances, but there are too many false positives when I increase the threshold, which is problematic with white spaces in documentation.

8 0.666667 announcement getContent AnnouncementBean.java(Token.Name.Function )
8 0.666667 announcement getPercent DataObservation.java (Token.Name.Function)
8 0.666667 announcement GroupBean GroupBean.java (Token.Name.Class)

where the first value is the Levenshtein distance, second one the distance divided by the length of the word matched.
I'm thinking to

  1. look into Jaccard, Tanimoto algorithms
  2. intellisence/suggest kinda code
  3. Somewhere in SO there were posts on some algorithms that bio guys use for matching sequences
  4. Come up with regular expressions chain rules based on http://en.wikipedia.org/wiki/Naming_convention_%28programming%29

the last one being literally the last option. Which other algorithms do you think would could give better results for this kinda stuff?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

長街聽風 2024-10-31 23:37:11

尝试使用加权编辑距离,在这里您可以通过键盘中的距离对常用缩写、可能的字符错误的知识进行编码。例如,您可以将 [ao] 等元音权重为零,并且密码将等于 pswrd。其他选项是建立单词级编辑距离并在此处使用同义词。我还构建了 EditDistance,它可以同时处理单词和字符。

Try using weighted edit distance, here you can encode knowledge of usual abbreviation, probable character mistakes by distance in keyboard. For example you can zero weight to vowels like [ao] and password will be equal to pswrd. Other option is to build word level edit distance and use synonyms here. I also have builded EditDistance which works simultaneousnesly with words and characters.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文