有哪些经过验证且真实的推荐相关文章的算法?
我敢打赌,这种情况很常见。 您有一个博客或新闻网站,并且有大量文章或博客或无论您如何称呼它们,并且您想在每个文章的底部推荐其他似乎相关的内容。
我们假设有关每个项目的元数据非常少。 也就是说,没有标签、类别。 将其视为一大块文本,包括标题和作者姓名。
您如何找到可能相关的文档?
我对实际的算法很感兴趣,而不是现成的解决方案,尽管我可以看看用 ruby 或 python 实现的东西,或者依赖 mysql 或 pgsql。
编辑:当前的答案非常好,但我想看到更多。 也许有一些非常简单的示例代码用于一两件事。
Pretty common situation, I'd wager. You have a blog or news site and you have plenty of articles or blags or whatever you call them, and you want to, at the bottom of each, suggest others that seem to be related.
Let's assume very little metadata about each item. That is, no tags, categories. Treat as one big blob of text, including the title and author name.
How do you go about finding the possibly related documents?
I'm rather interested in the actual algorithm, not ready solutions, although I'd be ok with taking a look at something implemented in ruby or python, or relying on mysql or pgsql.
edit: the current answer is pretty good but I'd like to see more. Maybe some really bare example code for a thing or two.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
你应该阅读《集体智能编程:构建智能 Web 2.0 应用程序》(ISBN 0596529325)这本书!
对于某些方法和代码:首先问问自己,是否想要根据单词匹配找到直接相似之处,或者是否想要显示可能与当前文章不直接相关但属于同一文章集群的相似文章。
请参阅聚类分析/分区聚类。
查找直接相似性的一个非常简单(但理论上且缓慢)的方法是:
预处理:
int word_matches[narticles][narticles]
(您不应该像这样存储它) ,A->B 的相似度与 B->A 相同,因此稀疏矩阵节省了几乎一半的空间)。查找相似文章:
You should read the book "Programming Collective Intelligence: Building Smart Web 2.0 Applications" (ISBN 0596529325)!
For some method and code: First ask yourself, whether you want to find direct similarities based on word matches, or whether you want to show similar articles that may not directly relate to the current one, but belong to the same cluster of articles.
See Cluster analysis / Partitional clustering.
A very simple (but theoretical and slow) method for finding direct similarities would be:
Preprocess:
int word_matches[narticles][narticles]
(you should not store it like that, similarity of A->B is same as B->A, so a sparse matrix saves almost half the space).Find similar articles:
这是文档分类的典型案例,机器学习的各个课程都会研究它。 如果您喜欢统计学、数学和计算机科学,我建议您看看无监督方法,例如 kmeans++< /a>、贝叶斯方法 和 LDA。 特别是,贝叶斯方法非常适合您正在寻找什么,它们的唯一的问题是速度慢(但除非您运行一个非常大的网站,否则这不会太困扰您)。
对于更实用、更少理论的方法,我建议您看看 this 和 < a href="http://www.blogcatalog.com/blog/the-mendicant-bug/d857e27bdc3e83b205a63b0f63ef7a08" rel="nofollow noreferrer">其他很棒的代码示例。
This is a typical case of Document Classification which is studied in every class of Machine Learning. If you like statistics, mathematics and computer science, I recommend that you have a look at the unsupervised methods like kmeans++, Bayesian methods and LDA. In particular, Bayesian methods are pretty good at what are you looking for, their only problem is being slow (but unless you run a very large site, that shouldn't bother you much).
On a more practical and less theoretical approach, I recommend that you have a look a this and this other great code examples.
Ruby 中的小型向量空间模型搜索引擎。 基本思想是,如果两个文档包含相同的单词,则它们是相关的。 因此,我们计算每个文档中单词的出现次数,然后计算这些向量之间的余弦(每个术语都有一个固定索引,如果在该索引处出现,则为 1,如果不是 0)。 如果两个文档具有所有共同项,则余弦将为 1.0;如果它们没有共同项,则余弦将为 0.0。 您可以直接将其转换为%值。
Array#cosine 的定义留给读者作为练习(应该处理 nil 值和不同的长度,但我们得到了 Array#zip 对吗?)
顺便说一句,示例文档取自 Deerwester 等人的 SVD 论文:)
A small vector-space-model search engine in Ruby. The basic idea is that two documents are related if they contain the same words. So we count the occurrence of words in each document and then compute the cosine between these vectors (each terms has a fixed index, if it appears there is a 1 at that index, if not a zero). Cosine will be 1.0 if two documents have all terms common, and 0.0 if they have no common terms. You can directly translate that to % values.
the definition of
Array#cosine
is left as an exercise to the reader (should deal with nil values and different lengths, but well for that we gotArray#zip
right?)BTW, the example documents are taken from the SVD paper by Deerwester etal :)
前段时间我实现了类似的东西。 也许这个想法现在已经过时了,但我希望它能有所帮助。
我运行了一个用于编程常见任务的 ASP 3.0 网站,并从这个原则开始:用户有疑问,只要他/她能找到有关该主题的有趣内容,就会留在网站上。
当用户到达时,我启动一个 ASP 3.0
Session
对象并记录所有用户导航,就像链接列表一样。 在Session.OnEnd
事件中,我获取第一个链接,查找下一个链接并递增计数器列,例如:因此,要检查相关文章,我只需列出前 n < code>NextPage 实体,按计数器列降序排列。
Some time ago I implemented something similiar. Maybe this idea is now outdated, but I hope it can help.
I ran a ASP 3.0 website for programming common tasks and started from this principle: user have a doubt and will stay on website as long he/she can find interesting content on that subject.
When an user arrived, I started an ASP 3.0
Session
object and recorded all user navigation, just like a linked list. AtSession.OnEnd
event, I take first link, look for next link and incremented a counter column like:So, to check related articles I just had to list top n
NextPage
entities, ordered by counter column descending.这是一个相当大的话题——除了人们在这里给出的答案之外,我建议追踪一些信息检索课程的教学大纲,并查看为其分配的教科书和论文。 也就是说,这里是我自己研究生时代的简要概述:
最简单的方法称为词袋< /a>. 每个文档都被简化为
{word: wordcount}
对的稀疏向量,您可以在表示文档集的向量集中抛出 NaiveBayes(或其他一些)分类器,或者计算相似度每个包与其他每个包之间的分数(这称为 k 最近邻分类)。 KNN 查找速度很快,但需要 O(n^2) 存储分数矩阵; 然而,对于博客来说,n 并不是很大。 对于大报纸大小的东西,KNN 很快就变得不切实际,因此动态分类算法有时会更好。 在这种情况下,您可以考虑使用排名支持向量机。 SVM 很简洁,因为它们不会将您限制在线性相似性度量上,而且速度仍然相当快。词干提取是词袋技术的常见预处理步骤; 这涉及到在计算词袋之前将形态相关的单词(例如“cat”和“cats”、“Bob”和“Bob's”或“similar”和“similarly”)减少到其词根。 有很多不同的词干算法; 维基百科页面有几个实现的链接。
如果词袋相似度不够好,您可以将其抽象为 N 元语法袋相似度,在其中创建表示基于单词对或三元组的文档的向量。 (您可以使用 4 元组甚至更大的元组,但实际上这没有多大帮助。)这样做的缺点是会产生更大的向量,并且分类将相应地需要更多的工作,但您获得的匹配会更接近从句法上来说。 OTOH,您可能不需要这个来实现语义相似性; 它更适合诸如抄袭检测之类的事情。 也可以使用 分块,或将文档缩减为轻量级解析树(有是树的分类算法),但这对于诸如作者问题(“给定来源不明的文档,谁写的?”)之类的问题更有用。
也许对您的用例更有用的是概念挖掘,其中涉及将单词映射到概念(使用同义词库,例如 WordNet ),然后根据所使用的概念之间的相似性对文档进行分类。 这通常比基于单词的相似性分类更有效,因为从单词到概念的映射是简化的,但预处理步骤可能相当耗时。
最后,还有话语解析,它涉及解析文档的语义结构; 您可以像在分块文档上一样在话语树上运行相似性分类器。
这些几乎都涉及从非结构化文本生成元数据; 在原始文本块之间进行直接比较是很困难的,因此人们首先将文档预处理为元数据。
This is a pretty big topic -- in addition to the answers people come up with here, I recommend tracking down the syllabi for a couple of information retrieval classes and checking out the textbooks and papers assigned for them. That said, here's a brief overview from my own grad-school days:
The simplest approach is called a bag of words. Each document is reduced to a sparse vector of
{word: wordcount}
pairs, and you can throw a NaiveBayes (or some other) classifier at the set of vectors that represents your set of documents, or compute similarity scores between each bag and every other bag (this is called k-nearest-neighbour classification). KNN is fast for lookup, but requires O(n^2) storage for the score matrix; however, for a blog, n isn't very large. For something the size of a large newspaper, KNN rapidly becomes impractical, so an on-the-fly classification algorithm is sometimes better. In that case, you might consider a ranking support vector machine. SVMs are neat because they don't constrain you to linear similarity measures, and are still quite fast.Stemming is a common preprocessing step for bag-of-words techniques; this involves reducing morphologically related words, such as "cat" and "cats", "Bob" and "Bob's", or "similar" and "similarly", down to their roots before computing the bag of words. There are a bunch of different stemming algorithms out there; the Wikipedia page has links to several implementations.
If bag-of-words similarity isn't good enough, you can abstract it up a layer to bag-of-N-grams similarity, where you create the vector that represents a document based on pairs or triples of words. (You can use 4-tuples or even larger tuples, but in practice this doesn't help much.) This has the disadvantage of producing much larger vectors, and classification will accordingly take more work, but the matches you get will be much closer syntactically. OTOH, you probably don't need this for semantic similarity; it's better for stuff like plagiarism detection. Chunking, or reducing a document down to lightweight parse trees, can also be used (there are classification algorithms for trees), but this is more useful for things like the authorship problem ("given a document of unknown origin, who wrote it?").
Perhaps more useful for your use case is concept mining, which involves mapping words to concepts (using a thesaurus such as WordNet), then classifying documents based on similarity between concepts used. This often ends up being more efficient than word-based similarity classification, since the mapping from words to concepts is reductive, but the preprocessing step can be rather time-consuming.
Finally, there's discourse parsing, which involves parsing documents for their semantic structure; you can run similarity classifiers on discourse trees the same way you can on chunked documents.
These pretty much all involve generating metadata from unstructured text; doing direct comparisons between raw blocks of text is intractable, so people preprocess documents into metadata first.