我正在计划一个应用程序,它将根据主题创建短信/推文集群。主题的数量将受到限制,例如体育 [NBA、NFL、板球、足球]、娱乐 [电影、音乐] 等等...
我可以想到两种方法来
- 要求用户标记问题,就像 Stackoverflow 所做的那样。用户可以从预定义的标签列表中选择标签。然后在服务器端我将根据标签对它们进行聚类。
优点:- 设计简单。代码复杂度较低。
缺点:- 用户的选择将受到限制。
集群不会是动态的。如果发生新事件,预定义标签将错过它。
- 获取消息,删除停用词[在字典中预定义],对词干消息应用某种聚类算法以形成聚类,并根据其受欢迎程度显示聚类。该集群将一直显示到它仍然流行的时间[许多消息/分钟]。新消息将被略读并分配给相应的集群。
优点:- 基于事件/事故的流行程度的动态聚类。
缺点:- 增加了复杂性。需要更多的服务器资源。
我想知道是否还有其他方法可以解决这个问题。或者有什么办法可以改进上述方法?
还建议一些好的聚类算法。我认为“K-Nearest Clustering”算法很适合这种情况。
I am planning an application which will make clusters of short messages/tweets based on topics. The number of topics will be limited like Sports [ NBA, NFL, Cricket, Soccer ], Entertainment [ movies, music ] and so on...
I can think of two approaches to this
- Ask users to tag questions like Stackoverflow does. Users can select tags from a predefined list of tags. Then on server side I will cluster them based on tags.
Pros:- Simple design. Less complexity in code.
Cons:- Choices for users will be restricted.
Clusters will not be dynamic. If a new event occurs, the predefined tags will miss it.
- Take the message, delete the stopwords [ predefined in a dictionary ], apply some clustering algorithm on the stemmed message to make a cluster and depending on its popularity display the cluster. The cluster will be displayed till the time it remains popular [ many messages/minute].New messages will be skimmed and assigned to corresponding clusters.
Pros:- Dynamic clustering based on the popularity of the event/accident.
Cons:- Increased complexity. More server resources required.
I would like to know whether there are any other approaches to this problem. Or are there any ways of improving the above mentioned methods?
Also suggest some good clustering algorithms.I think "K-Nearest Clustering" algorithm is apt for this situation.
发布评论
评论(3)
查看 Carrot2,该工具从文本和簇中提取标签。您可以从此处下载它并检查实现的算法(主要是Lingo)此处。
希望这对您有帮助。
Check out Carrot2, this tool extracts the tags from the text and clusters. You can download it from here and check the algorithms implemented (Lingo, mainly) here.
Hope this help you.
使用贝叶斯分类。使用一些预定义的语料库训练过滤器,并(可选)为用户提供一种通过标记错误分类的内容来进一步细化过滤器的方法。
以下是使用贝叶斯分类器的一些 ="http://www.nltk.org/" rel="nofollow noreferrer">NLTK。
Use Bayesian classification. Train the filter with some predefined corpus, and (optionally) provide a way for users to further refine it by flagging things that were incorrectly categorized.
Here's some examples of using the Bayesian classifier in NLTK.
我也在做类似的事情。我认为如果你专门谈论 Twitter,主题标签是一个好方法。您还可以执行一些分类,但应该通过一些外部知识库(例如维基百科等)来丰富它。
无论如何,如果您的解决方案更好,请发布在这里
I am also doing a similar kind of thing. I think hashtags are a good way if you are talking specifically about twitter. You could also perform some classification but it should be enriched with some external knowledge base like Wikipedia etc.
Anyways, if your solution is better, please post it here