文本分类/分类算法

发布于 2024-09-16 09:40:21 字数 1539 浏览 11 评论 0 原文

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(7

ゞ记忆︶ㄣ 2024-09-23 09:40:21

这样做并不是一件小事。显然,您可以构建一个字典,将某些关键字映射到类别。只要找到一个关键词就会推荐一个特定的类别。

然而,在自然语言文本中,关键字通常不会采用其词干形式。您需要一些形态学工具来查找词干形式并将其用于字典。

但有人可能会写这样的话:“这篇文章不是关于……”。这将引入语法和语义分析的需要。

然后你会发现某些关键词可以用在几个类别中:“乐队”可以用在音乐、技术、甚至手工艺品中。因此,您需要本体论和统计或其他方法来权衡选择类别(如果不确定)的概率。

有些关键词甚至可能不容易融入本体:数学家更接近程序员还是园丁?但你在问题中说,这些类别是由人建立的,所以他们也可以帮助建立本体论。

在这里和 计算语言学 wiki/Computational_linguistics" rel="nofollow noreferrer">维基百科 进行进一步研究。

现在,你的文本领域越窄,结构就越结构化,词汇量越小,问题就变得越容易。

进一步研究的一些关键词:词法、句法分析、语义、本体、计算语言学、索引、关键词

Doing this is not trivial. Obviously you can build a dictionary that maps certain keywords to categories. Just finding a keyword would suggest a certain category.

Yet, in natural language text, the keywords would usually not be in their stem form. You would need some morphology tools to find the stem form and use it on the dictionary.

But then somebody could write something like: "This article is not about ...". This would introduce the need for syntax and semantical analysis.

And then you would find that certain keywords can be used in several categories: "band" could be used in musics, Technics, or even handicraft work. You would therefore need an ontology and statistical or other methods to weigh the probability of the category to choose if not definite.

Some of the keywords might not even be easy to fit into an ontology: is mathematician closer to programmer or gardener? But you said in your question that the categories are built by men, so they could also help building the ontology.

Have a look on computational linguistics here and in Wikipedia for further studies.

Now, the more narrow the field your texts are from, the more structured they are, and the smaller the vocabulary, the easier the problem becomes.

Again some keywords for further studies: morphology, syntax analysis, semantics, ontology, computational linguistics, indexing, keywording

转角预定愛 2024-09-23 09:40:21

自动文本分类有多种方法。朴素贝叶斯分类器可能是其中最简单的。另一种是您可以使用的 K 最近邻。这个关于文本分类的 Google 答案可能会对您有所帮助。

There are multiple approaches to automatic text classification. A naive Bayes classifier is possibly the simplest of them. Another one is the K-nearest neighbor that you can use. This google answer on categorization of text might help you.

家住魔仙堡 2024-09-23 09:40:21

观看我关于这个主题的视频系列。

http://vancouverdata.blogspot.com/2010/11 /text-analytics-with-rapidminer-loading.html

分类位于视频 5 中,但其他视频可能会帮助您加快速度。

这一切都基于 FOSS 程序 RapidMiner。

Watch my video series on exactly this topic.

http://vancouverdata.blogspot.com/2010/11/text-analytics-with-rapidminer-loading.html

Classification is in video 5, but the other videos may help you get up to speed.

It's all based on the FOSS program RapidMiner.

傾城如夢未必闌珊 2024-09-23 09:40:21

查看这个来自 scikit learn 的示例。该示例中应用了一大堆不同的算法,因此您可以比较结果。

Check out this example from scikit learn. There is a whole bunch of different algorithms applied in the example so you can compare the results.

怪我鬧 2024-09-23 09:40:21

支持向量机。每个人都喜欢支持向量机。你需要做大量的阅读,甚至可能买一本书。但您可以先阅读一篇论文,看看是否你喜欢这个主意。

Support vector machine. Everyone loves support vector machines. You'll need to do quite a bit of reading, and perhaps even buy a book. But you could start by reading a paper to see if you like the idea.

煮茶煮酒煮时光 2024-09-23 09:40:21

这些方法的总称是“多元方法”。通过搜索“文本分类”或“文本分类”应该会带来一些有用的线索。祝你好运 !

The general term for these methods is "multivariate methods". That with a search on "text classification" or "text categorization" should bring up some useful leads. Good luck !

手心的温暖 2024-09-23 09:40:21

我已经寻找这个问题的答案有一段时间了。今天我找到了答案。

有一个名为“dbacl”的开源程序可以执行此操作。它将文档分类为您喜欢的任意多个类别(最多达到一定的最大值)。

其他答案说“不平凡”之类的事情都是正确的,但是拥有一个易于使用的包来完成困难的事情有助于使其易于管理。

I've been looking for the answer to this question for quite a while. Today I found my answer.

There is an open-source program called "dbacl" that does this. It classifies documents into as many categories as you like (up to a certain maximum).

The other answers saying things like "not trivial" are all true, but having an easy-to-use package that does the hard stuff helps a lot at making it manageable.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文