文档分类的决策树

发布于 2024-09-06 15:08:26 字数 177 浏览 7 评论 0原文

我想知道是否可以使用决策树进行文档分类,如果可以,那么数据表示应该如何? 我知道 R 包 party 的用法 用于决策树。

I wanted to know that is it possible to use decision trees for document classification, and if yes then how should the data representation be?
I know the use of the R package party for Decision Trees.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

夜唯美灬不弃 2024-09-13 15:08:26

一种方法是拥有一个巨大的矩阵,其中每一行都是一个文档,每一列都是一个单词。单元格中的值是该单词在该文档中出现的次数。

然后,如果您正在处理“监督学习”情况,则应该为分类器设置另一列,从那里您可以使用“rpart”(来自 rpart 包)之类的命令来创建分类树。该命令将以与线性模型 (lm) 类似的方式向 rpart 输入公式。

如果需要,您还可以尝试首先将单词分组为“单词组”,然后让每一列属于不同的单词组,并用一个数字指示文档中有多少单词属于该组。为此,我会看看“tm”包。 (如果您最终对此做了一些事情,请考虑在这里发布相关内容,以便我们可以从中学习)

One way is to have a huge matrix where each row is a document, and each column is a word. And the values in the cells are the number of times that word showed in that document.

Then, if you are dealing with "supervised learning" case, you should have another column for the classifier, and from there on you can use a command like "rpart" (from the rpart package), to create your classification tree. The command would be entering a formula to rpart, in a similar fashion as you would to a linear model (lm).

If you want, you could also try to first group your words to "groups of words", and then have each column belonging to a different group of words, with a number indication how many words in the document belonged to that group. For that I would have a look at the "tm" package. (If you end up doing something with that, please consider maybe posting about it here, so we could learn from it)

携余温的黄昏 2024-09-13 15:08:26

本文对不同的文本分类技术及其准确性进行了调查。简而言之,您可以使用决策树对文本进行分类,但还有其他更好的算法。

塞巴斯蒂亚尼,F.(2002)。自动文本分类中的机器学习。 ACM计算
调查,cs.IR/0110053v1。来源:http://arxiv.org/abs/cs.IR/0110053v1

This paper gives a survey of different text categorization techniques and their accuracies. In short, you can categorize text with decision trees, but there are other algorithms that are much better.

Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing
Surveys, cs.IR/0110053v1. Available from: http://arxiv.org/abs/cs.IR/0110053v1.

鲜肉鲜肉永远不皱 2024-09-13 15:08:26

我对此表示怀疑——至少按照通常的定义,决策树使用单个标准来指定子分支。在对文档进行分类时,您很少可以将任何内容都基于单个标准 - 您需要多个标准,即使如此,您也不会得到明确的树状决策,但是“这比这更接近”另一件事”这样的结果。

I doubt it -- at least as typically defined, a decision tree uses a single criterion to specify a sub-branch. In classifying documents, you can rarely base much of anything on a single criterion -- you need multiple criteria, and even then you don't get a clear-cut tree-like decision, but a "this is a bit closer to that than the other thing" kind of result.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文