重要文件的识别
我有一组 java 文本文档。我必须使用计算机识别最重要的文档(就像专家识别的那样)。
例如。我有 10 本关于 java 的书,系统将 Java 完整参考文档识别为最重要的文档或最相关的文档。(基于与有关 java 的维基百科页面的相似性)
一种方法是拥有一个参考文档,并找到该文档与手头的文档集(如前面的示例中所述)。并提供一个结果,表示相似度最大的文件是最重要的文件。
我想找出其他更有效的方法来执行此操作。 请建议其他查找相关文档的方法(如果可能,以无人监督的方式)。
I have a set of text documents in java . I have to identify the most important document (just as what an expert would identify) using a computer.
eg. I have 10 books on java , the system identifies Java complete reference as the most important document or the most relevant.(based on similarities with the wikipedia page about java)
One method would be to have a reference document and find similarities between this document and the set of documents at hand (as mentioned in the previous example). And provide a result saying the one which has maximum similarity is the most important docuemnt .
I want to identify other more efficient methods of performing this.
please suggest other methods for finding the relevant document (in a unsupervised way if possible) .
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我认为另一种机制是,拥有与每个文档相关联的单词词典和排名图。
例如,在Java完全参考书案例中,会有一个关键字及其排名的字典。
Java-10
J2ee-5
J2SDK-10
Java5-10等,
注意:如果您的文档是动态流并且名称也是动态的,我不知道如何处理它。
I think another mechanism would be, have a dictionary of words and ranking map associated with each document.
For example, in Java complete reference book case, there will be a dictionary of keywords and its ranking.
Java-10
J2ee-5
J2SDK-10
Java5-10 etc.,
Note:If your documents are dynamic streams and names also dynamic, I am not sure how to handle it.