如何计算两个license.txt文件之间的相似度?
我想计算许可证的 txt 文件之间的相似性,以便我可以根据 license.txt 识别它对应的许可证。我应该使用什么样的信息检索技术?一旦我编写了 tf-idf 但我不确定这是否适用于这里。你有什么建议?
I would like to compute similarity between licenses' txt files so I could then based on the license.txt identify to which license it corresponds. What kind of information retrieval technique should I use? Once I programmed tf-idf but I am not sure whether this is applicable here. What do you suggest?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我已经研究这个问题 3 年多了,让我告诉你,这远非微不足道,你不会用单一算法来解决它,更不用说 tf-idf 和余弦相似度了。
有很多挑战,我写了其中一些:
您最终将使用多种方法的组合,不幸的是没有灵丹妙药。
I've been working on this issue for 3+ years, let me tell you it's far from trivial, and you are not going to solve it with a single algorithm, let alone tf-idf and cosine similarity.
There are a number of challenges, I write some of them:
You will end up using a combination of approaches, there's no silver bullet unfortunately.
您可以使用 Lucene 将所有 License 索引为文档(每个 Lucene 文档就是一个 License )。当你有一个新的license.txt,你想检查它对应的许可证时,你可以使用整个license.txt作为查询来查询lucene。
那将使用 TF-IDF 和所有 IR 东西。但您也可以使用针对问题更具体的方法,例如检查特定关键字。
You can use Lucene to index all Licenses as documents (each Lucene document is a License). When you have a new license.txt you want to check which licene it corresponds to, you can just query lucene using the whole license.txt as a query.
That would be using TF-IDF and all the IR stuff. But you could also use something more specific to the problem, like checking specific keywords.