如何计算两个license.txt文件之间的相似度?

发布于 2024-12-27 12:51:28 字数 110 浏览 5 评论 0原文

我想计算许可证的 txt 文件之间的相似性,以便我可以根据 license.txt 识别它对应的许可证。我应该使用什么样的信息检索技术?一旦我编写了 tf-idf 但我不确定这是否适用于这里。你有什么建议?

I would like to compute similarity between licenses' txt files so I could then based on the license.txt identify to which license it corresponds. What kind of information retrieval technique should I use? Once I programmed tf-idf but I am not sure whether this is applicable here. What do you suggest?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

岁月蹉跎了容颜 2025-01-03 12:51:28

我已经研究这个问题 3 年多了,让我告诉你,这远非微不足道,你不会用单一算法来解决它,更不用说 tf-idf 和余弦相似度了。

有很多挑战,我写了其中一些:

  1. 类似的许可证文本(agpl/gpl/lgpl、bsd/apache1.1/openssl、mit/isc/curl)非常难以消除歧义,并且具有极高的余弦值相似性(除非您对功能选择非常聪明,也许...)
  2. 同样适用于同一许可证的不同版本(lgpl 2.0/2.1)
  3. LICENSE.TXT 文件通常包含多个许可证
  4. bsd 通知很难捕捉到,即。除了权利持有者之外,您拥有相同的文本。

您最终将使用多种方法的组合,不幸的是没有灵丹妙药。

I've been working on this issue for 3+ years, let me tell you it's far from trivial, and you are not going to solve it with a single algorithm, let alone tf-idf and cosine similarity.

There are a number of challenges, I write some of them:

  1. similar license texts (agpl/gpl/lgpl, bsd/apache1.1/openssl, mit/isc/curl) are extremely difficult to disambiguate, and would have an extremely high cosine similarity (unless you are very smart about feature selection, maybe...)
  2. same applies to different versions of the same license (lgpl 2.0/2.1)
  3. LICENSE.TXT files often contain multiple licenses
  4. bsd notices are very hard to catch, ie. you have the same text, except for the rights holder

You will end up using a combination of approaches, there's no silver bullet unfortunately.

时光倒影 2025-01-03 12:51:28

您可以使用 Lucene 将所有 License 索引为文档(每个 Lucene 文档就是一个 License )。当你有一个新的license.txt,你想检查它对应的许可证时,你可以使用整个license.txt作为查询来查询lucene。

那将使用 TF-IDF 和所有 IR 东西。但您也可以使用针对问题更具体的方法,例如检查特定关键字。

You can use Lucene to index all Licenses as documents (each Lucene document is a License). When you have a new license.txt you want to check which licene it corresponds to, you can just query lucene using the whole license.txt as a query.

That would be using TF-IDF and all the IR stuff. But you could also use something more specific to the problem, like checking specific keywords.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文