如何分类但不使用分类或聚类算法?

发布于 2024-11-28 04:23:20 字数 990 浏览 2 评论 0原文

我有一个爬虫程序,每天存储来自 7 个不同新闻机构的体育数据。它每天存储约1200条体育新闻。 我想将最近两天的新闻分类为子类别。因此,每两天我都会收到大约 2400 条新闻,这些新闻都是针对这些天的,而且它们的许多主题都在谈论同一事件。 例如:

70 条新闻正在谈论 Brad Keselowski 的 500 英里比赛。

120 条新闻正在谈论美国游泳运动员 Nyad 开始游泳。

28位新人正在谈论曼联和曼城之间的比赛。

。 。 .

换句话说,我想要制作类似 Google 新闻 的内容。

问题是这种情况不是分类问题,因为我没有特殊的类。例如,我的课不是游泳、高尔夫、足球等。我的课是这两年发生的各个领域的特殊事件。所以我不能使用朴素贝叶斯等分类算法。

另一方面,我的问题也不是用聚类算法来解决。因为我不想强迫他们放入n个簇。也许其中一条新闻没有任何相似的新闻,或者也许在两天的一包中,有 12 个不同的故事,但在另外两天里,有 30 个不同的问题。所以我不能使用诸如“单链接(最大相似度)”、“完整链接(最小相似度)”、“最大加权匹配”或“组平均(平均内部相似度)”之类的聚类算法。

我自己对此有一些想法,例如,每两条有 10 个常用词的新闻应该在同一类中。但如果我们不考虑一些参数,例如文档的长度、常用词和生僻词的影响以及其他一些因素,这将无法很好地发挥作用。

我已阅读这篇论文< /a>,但这不是我的答案。

有没有已知的算法可以解决这个问题?

I have a crawler program that stores sport data from 7 difference news agencies every day. it stores about 1200 sport news every day.
I want to categorize news of last two days into sub-categories. So every two days I have about 2400 news that are exactly for these days and many of their topics are talking exactly about the same event.
for example:

70 news are talking about 500 miles racing of Brad Keselowski.

120 news are talking about US swimmer Nyad that begins swimming.

28 new are talking about the match between Man United and Man City.

. . .

In other words, I want to make something like Google News.

The problem is that this situation is not a classification problem, because I don't have special classes. for example, my classes are not swimming, golf, football, etc. my classes are a special events in every field that happened in these two years. So I cannot use classification algorithms such as Naive Bayes.

On the other hand, my problem is not solving with clustering algorithms too. Because I don't want to force them to put into n clusters. Maybe one of the news doesn't have any similar news or maybe in one pack of two days, there are 12 different stories, but in other two days, there are 30 different issues. So I cannot use clustering algorithms such as "Single Link( Maximum Similarity)", "Complete Link( Minimum Similarity)", "Maximum Weighted Matching" or "Group Average( Average Intra Similarity)".

I have some ideas myself to do this, for example, each two news that have 10 common words, should be in the same class. But if we don't consider some parameters such as length of documents, influence of common and rare words and some other things, this will not work well.

I have read this paper, but it was not my answer.

Is there any known algorithm to solve this problem?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

鱼窥荷 2024-12-05 04:23:20

在我看来,这个问题是一个聚类问题,聚类的质量度量未知。这指向一种无监督方法,该方法最终基于使用数据中的冗余来检测相关性。也许类似于主成分分析潜在语义分析可能很有用。不同的维度(主成分或奇异向量)将指示不同的主要主题,与向量成分相对应的术语希望是出现在描述中的单词。一个缺点是不能保证最强的相关性能够轻松地产生合理的描述。

The problem strikes me as a clustering problem with an unknown quality measure for the clusters. That points to an unsupervised method, which is ultimately based on detecting correlations using redundancy in the data. Perhaps something like principal component analysis or latent semantic analysis could be useful. The different dimensions (principal components or singular vectors) would indicate distinct major themes, with the terms corresponding to the vector components hopefully being the words appearing in the description. One drawback is that there's no guarantee that the strongest correlations would lead easily to a sensible description.

满栀 2024-12-05 04:23:20

看看“主题模型”和“潜在狄利克雷分配”。这些很受欢迎,您会发现各种语言的代码。

Take a look at "topic models" and "Latent Dirichlet Allocation". These are popular and you'll find code in a variety of languages.

抚笙 2024-12-05 04:23:20

您可以使用层次聚类算法来研究项目之间的关系 - 最接近的项目(具有几乎相同描述的新闻)将位于相同的集群中,最接近的集群(相似新闻组)将位于相同的超级集群中,等等。
另外,还有一个非常漂亮且快速的算法,称为 CLOPE - http://www.google.com.ua/url?sa=t&source=web&cd=11&sqi=2&ved=0CF0QFjAK&url=http%3A%2F%2Fciteseerx。 ist.psu.edu%2Fviewdoc%2Fdownl oad%3Fdoi%3D10.1.1.13.7142%26rep%3Drep1%26type%3Dpdf&rct=j&q=CLOPE&ei =gvo_Tsi4AsKa-gas-uCkAw&usg=AFQjCNGcR9sFqhsEkAJowEjIGbDBwSjeXw&cad=rja

You might use hierarchical clustering algorithms to investigate relationships between your items - the closest items (news with almost the same description) would be in the same clusters, and the closest clusters (groups of similar news) would be in the same super-cluster etc.
Also, there is pretty nice and fast algorithm called CLOPE - http://www.google.com.ua/url?sa=t&source=web&cd=11&sqi=2&ved=0CF0QFjAK&url=http%3A%2F%2Fciteseerx.ist.psu.edu%2Fviewdoc%2Fdownload%3Fdoi%3D10.1.1.13.7142%26rep%3Drep1%26type%3Dpdf&rct=j&q=CLOPE&ei=gvo_Tsi4AsKa-gas-uCkAw&usg=AFQjCNGcR9sFqhsEkAJowEjIGbDBwSjeXw&cad=rja

吃不饱 2024-12-05 04:23:20

有许多文档聚类算法。看看“使用频繁项集的分层文档聚类”,看看这是否与您想要的类似。如果您使用 Java 进行编程,您可能会从 S-space 包,其中包括潜在语义分析(LSA)算法等。

There are many document clustering algorithms out there. Take a look at "Hierarchical document clustering using frequent itemsets", for example, and see if that is similar to what you want. If you're programming in Java, you may get some mileage out of the S-space package, which includes algorithms for latent semantic analysis (LSA) among others.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文