基于关键字的最近邻算法或库
我想找到一个库或算法(所以我自己编写代码)来识别网页的最近 k 个邻居,其中网页被定义为一组关键字。我已经完成了提取关键字的部分。
它不必非常好,只要足够好即可。
任何人都可以建议解决方案,或者从哪里开始。我过去曾浏览过 Yury Lifshits 的讲座,但如果可能的话,我希望能得到一些现成的东西。
首选 Java 库。
I want to find a library or an algorithm (so I write the code myself) for identifying the nearest k neighbours of a webpage, where the webpage is defined as being a set of keywords. I have already done the part where I extract the keywords.
It doesn't have to be very good, just good enough.
Can anyone suggest a solution, or where to start. I have looked through lectures by Yury Lifshits in the past, but I am hoping to get something ready-made if possible.
Java libraries preferred.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
正如您所说,您已经从页面中提取了关键字。我假设您用单词向量表示每个文档/页面。类似于文档术语频率矩阵。
我想页面的最近邻居理想情况下是具有相似内容的页面。因此,您希望找到每个单词的相对频率与您要搜索的单词相似的文档。所以首先对doc-term矩阵WRT每行进行归一化;即用%tage 出现次数替换出现次数。
接下来,您必须在由这些向量表示的两个文档之间分配一些距离。您可以使用正常的欧几里得距离或曼哈顿距离。然而,对于文本文档,通常效果最好的相似性度量是余弦相似度。使用适合您的问题的任何距离或相似度函数(请记住,对于最近邻,您希望最小化距离;但最大化相似度)。
一旦你有了向量和距离函数,运行最近邻或K-最近邻算法。
As you said, you already have the keywords extracted from a page. I am assuming that you represent each document/page by a vector of words. Something like a document term-frequency matrix.
I guess the nearest neighbour of a page is ideally a page with similar contents. So you'd like to find documents where the relative frequency of each word is similar to the one you are searching for. So first normalize the doc-term matrix WRT each row; i.e. replace the occurrence count by %tage occurrence.
Next you have to assign some distance between 2 documents represented by these vectors. You can use the normal Euclidean distance or Manhattan Distance. However for text document the similarity measure that usually works best is Cosine Similarity. Use whatever distance or similarity function suits your problem (remember for nearest neighbour you want to minimize the distance; but maximize similarity).
Once you have the vectors and your distance function in place, run the Nearest neighbour or the K-Nearest neighbour algorithm.