java中Tf Idf的任何教程或代码
我正在寻找一个可以计算 tf-idf 计算的简单 java 类。我想对两个文档进行相似性测试。我发现很多使用 tf-idf 类的大 API。我不想使用大的 jar 文件,只是为了进行简单的测试。请帮忙! 或者至少有人能告诉我如何找到 TF 吗?和以色列国防军?我会计算结果:) 或者 如果你能告诉我一些好的java教程。 请不要告诉我寻找谷歌,我已经做了三天,但找不到任何东西:( 也请不要让我参考 Lucene :(
I am looking for a simple java class that can compute tf-idf calculation. I want to do similarity test on 2 documents. I found so many BIG API who used tf-idf class. I do not want to use a big jar file, just to do my simple test. Please help !
Or atlest if some one can tell me how to find TF? and IDF? I will calculate the results :)
OR
If you can tell me some good java tutorial for this.
Please do not tell me for looking google, I already did for 3 days and couldn't find any thing :(
Please also do not refer me to Lucene :(
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
术语频率是术语在特定文档中出现的次数的平方根。
逆文档频率是(文档总数除以包含该术语的文档数量)的对数,如果该术语出现零次,则加一——如果出现零次,显然不要尝试除以零。
如果这个答案还不清楚,每个文档的每个术语都有一个 TF,每个术语有一个 IDF。
然后 TF-IDF(term, document) = TF(term, document) * IDF(term)
最后,使用向量空间模型来比较文档,其中每个 term 是一个新维度以及该部分的“长度”指向该维度的向量就是 TF-IDF 计算。每个文档都是一个向量,因此计算两个向量,然后计算它们之间的距离。
因此,要在 Java 中执行此操作,请使用 FileReader 或其他工具一次读取一行文件,并按空格或您想要使用的任何其他分隔符进行分割 - 每个单词都是一个术语。计算每个术语在每个文件中出现的次数,以及每个术语出现的文件数量。然后您就拥有了执行上述计算所需的一切。
由于我无事可做,我查了一下向量距离公式。在这里:
为此,x1 是文档 1 中术语 x 的 TF-IDF。
编辑:回答您有关如何计算文档中单词数的问题:
new BufferedReader(new FileReader(filename))
- 您可以在 while 循环中调用BufferedReader.readLine()
,每次检查是否为 null。line.split("\\s")
- 这将在空格上分割行并为您提供所有单词的数组。HashMap
来完成。现在,在计算每个文档的 D 后,您将获得 X 值,其中 X 是文档的数量。要将所有文档相互比较,只需进行 X^2 比较 - 对于 10,000 个文档来说,这不会花费特别长的时间。请记住,如果两个文档的 D 值之间的差异的绝对值较小,则它们更加相似。因此,您可以计算每对文档的 D 之间的差异,并将其存储在优先级队列或其他排序结构中,以便最相似的文档冒泡到顶部。有道理吗?
Term Frequency is the square root of the number of times a term occurs in a particular document.
Inverse Document Frequency is (the log of (the total number of documents divided by the number of documents containing the term)) plus one in case the term occurs zero times -- if it does, obviously don't try to divide by zero.
If it isn't clear from that answer, there is a TF per term per document, and an IDF per term.
And then TF-IDF(term, document) = TF(term, document) * IDF(term)
Finally, you use the vector space model to compare documents, where each term is a new dimension and the "length" of the part of the vector pointing in that dimension is the TF-IDF calculation. Each document is a vector, so compute the two vectors and then compute the distance between them.
So to do this in Java, read the file in one line at a time with a FileReader or something, and split on spaces or whatever other delimiters you want to use - each word is a term. Count the number of times each term appears in each file, and the number of files each term appears in. Then you have everything you need to do the above calculations.
And since I have nothing else to do, I looked up the vector distance formula. Here you go:
For this purpose, x1 is the TF-IDF for term x in document 1.
Edit: in response to your question about how to count the words in a document:
new BufferedReader(new FileReader(filename))
- you can callBufferedReader.readLine()
in a while loop, checking for null each time.line.split("\\s")
- that will split your line on whitespace and give you an array of all of the words.HashMap
.Now, after computing D for each document, you will have X values where X is the number of documents. To compare all documents against each other is to do only X^2 comparisons - this shouldn't take particularly long for 10,000. Remember that two documents are MORE similar if the absolute value of the difference between their D values is lower. So then you could compute the difference between the Ds of every pair of documents and store that in a priority queue or some other sorted structure such that the most similar documents bubble up to the top. Make sense?
agazerboy,Sujit Pal 的博客文章 给出了计算 TF 和 IDF 的详细说明。
WRT验证结果,我建议你从一个小语料库(比如说100个文档)开始,这样你就可以很容易地看到你是否正确。对于 10000 个文档,使用 Lucene 开始看起来是一个非常理性的选择。
agazerboy, Sujit Pal's blog post gives a thorough description of calculating TF and IDF.
WRT verifying results, I suggest you start with a small corpus (say 100 documents) so that you can see easily whether you are correct. For 10000 documents, using Lucene begins to look like a really rational choice.
虽然您特别要求不要引用 Lucene,但请允许我向您指出确切的类。您要查找的类是 DefaultSimilarity 。它有一个非常简单的 API 来计算 TF 和 IDF。请参阅java代码 此处。或者您可以按照 DefaultSimilarity 文档中的指定自行实现。
log和
sqrt 函数用于抑制实际值。使用原始值可能会极大地影响结果。
While you specifically asked not to refer Lucene, please allow me to point to you the exact class. The class you are looking for is DefaultSimilarity. It has an extremely simple API to calculate TF and IDF. See the java code here. Or you could just implement yourself as specified in the DefaultSimilarity documentation.
and
The log and sqrt functions are used to damp the actual values. Using the raw values can skew results dramatically.