用于对新闻文章进行分组的增量聚类算法?
我正在做一些关于如何将文章聚集到谷歌新闻的“新闻报道”中的研究。
看看之前关于该主题的问题,我经常看到建议简单地从文章中提取单词向量,如果某些单词位于文章的某些部分(例如标题),则增加它们的权重,然后使用类似 k-means 算法的方法对文章进行聚类。
但这会带来几个问题:
使用 k 均值,您如何提前知道 k 应该是多少?在动态新闻环境中,您的故事数量可能非常可变,并且您不会提前知道一组文章代表多少个故事。
使用分层聚类算法,您如何决定使用哪些聚类作为您的故事?您将在树的底部有一些集群,这些集群只是单个文章,您显然不想使用它们,并且在树的根部有一个集群,其中包含所有文章,这也是您不想使用的...但是您如何知道应该使用中间的哪些集群来表示故事?
最后,无论是 k 均值算法还是分层算法,我读过的大多数文献似乎都假设您有一个想要聚类的预设文档集合,并且它会立即将它们全部聚类。但是,如果经常有新文章出现,该怎么办?会发生什么?既然多了一篇文章,您是否必须从头开始对所有文章进行聚类?这就是为什么我想知道是否有一些方法可以让您随时“添加”文章,而无需从头开始重新聚类。我无法想象这是非常有效的。
I'm doing a little research on how to cluster articles into 'news stories' ala Google News.
Looking at previous questions here on the subject, I often see it recommended to simply pull out a vector of words from an article, weight some of the words more if they're in certain parts of the article (e.g. the headline), and then to use something like a k-means algorithm to cluster the articles.
But this leads to a couple of questions:
With k-means, how do you know in advance how much k should be? In a dynamic news environment you may have a very variable number of stories, and you won't know in advance how many stories a collection of articles represents.
With hierarchal clustering algorithms, how do you decide which clusters to use as your stories? You'll have clusters at the bottom of the tree that are just single articles, which you obviously won't want to use, and a cluster at the root of the tree which has all of the articles, which again you won't want...but how do you know which clusters in between should be used to represent stories?
Finally, with either k-means or hierarchal algorithms, most literature I have read seems to assume you have a preset collection of documents you want to cluster, and it clusters them all at once. But what of a situation where you have new articles coming in every so often. What happens? Do you have to cluster all the articles from scratch, now that there's an additional one? This is why I'm wondering if there are approaches that let you 'add' articles as you go without re-clustering from scratch. I can't imagine that's very efficient.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我在一家初创公司工作,该公司正是构建了这个:新闻文章的增量集群引擎。我们的算法基于本文:Web Document Clustering using Document Index Graph (http://ieeexplore.ieee.org/xpl/articleDetails.jsp?reload=true&arnumber=4289851)。对于我们每天 10K 篇文章来说效果很好。
它有两个主要优点:
1)它是增量的,它解决了您必须处理传入文章流的问题(而不是一次性集群)
2)它使用基于短语的建模,而不是仅仅使用“词袋”,这会带来更高的准确性。
谷歌搜索会弹出http://www.similetrix.com,他们可能有你要找的东西。
I worked on a start-up that built exactly this: an incremental clustering engine for news articles. We based our algorithm on this paper: Web Document Clustering Using Document Index Graph (http://ieeexplore.ieee.org/xpl/articleDetails.jsp?reload=true&arnumber=4289851). Worked well for us for 10K articles / day.
It has two main advantages:
1) It's incremental, which addresses the problem you have with having to deal with a stream of incoming articles (rather than clustering all at once)
2) It uses phrase-based modeling, as opposed to just "bag of words", which results in much higher accuracy.
A Google search pops up http://www.similetrix.com, they might have what you're looking for.
我会搜索自适应 K 均值聚类算法。有一个很好的研究部分专门针对您所描述的问题。这是一个这样的 论文 (pdf)
I would do a search for adaptive K-means clustering algorithms. There is a good section of research devoted to the problems you describe. Here is one such paper (pdf)