交互式决策树分类器

发布于 2024-09-09 10:51:39 字数 214 浏览 8 评论 0原文

谁能推荐一个可以增量使用的 Python 或 Java 决策树分类器实现?

我发现的所有实现都要求您立即向分类器提供所有功能才能获得分类。但是,在我的应用程序中,我有数百个功能,其中一些功能可能需要很长时间才能评估。由于并非树的所有分支都可以使用所有特征,因此一次为分类器提供所有特征是没有意义的。我希望分类器询问特征,一次一个,按照需要它们的顺序最大程度地减少熵并提供最终分类。

Can anyone recommend a decision tree classifier implementation, in either Python or Java, that can be used incrementally?

All the implementations I've found require you to provide all the features to the classifier at once in order to get a classification. However, in my application, I have hundreds of features, and some of the features are functions that can take a long time to evaluate. Since not all branches of the tree may use all the features, it doesn't make sense to give the classifier all the features at once. I'd like the classifier to ask for features, one at a time, in the order that they are needed to make the maximum reduction in entropy and provide a final classification.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

烟燃烟灭 2024-09-16 10:51:39

我相信没有这样的实现,但是决策树的实现非常简单,您自己编写这样的程序应该不会有任何问题。
另一方面,我不认为动态计算特征的想法可以提高速度,因为即使使用某些特征进行先前的某些分割,仍然必须考虑其余特征,因此对于许多记录来说会被重新计算很多次(不过它可能会节省内存)。
这在随机森林的情况下是有意义的,其中每次分割只考虑随机的、有限的特征子集——RF 仍然只能用作分类器,它不会为你构建漂亮的、人类可解释的决策树。

I believe there is no such implementation, but decision trees are so simple to implement that you shouldn't have any problems with writing such program on your own.
On the other hand I don't think the idea of counting features on the fly can increase speed, because even if some feature was used to make some previous split, it still must be considered on the rest of them, so for many records it would be recalculated many times (it may save memory though).
This could make sense in case of random forest, where only a random, limited subset of features is considered on each split -- still RF is usable only as a classifier, it won't build you nice, human-interpretable decision trees.

小兔几 2024-09-16 10:51:39

通常,此类包(特别是 Weka 中的 J48 树)允许您指定缺失值来代替特征值,这将按照与 C4.5 算法相同的方式进行处理:

当我们到达节点分裂时
具有缺失值的属性,我们
将实例发送到每个可能的位置
分支权重与
正在进行的训练实例数
沿着那些树枝,最终
将结果累加到叶子处
节点。

当然,您可以采用更积极的方法,并更改树在训练阶段选择要分割的属性的方式。一种简单的方法是为属性分配权重,并将分裂标准(熵、信息增益等)乘以该权重作为惩罚系数,这样“昂贵的属性”就不太可能被选为分裂节点。

Usually such packages (J48 trees in Weka in particular) allows you to specify a missing value in place of a feature value, which will be dealt with the same way C4.5 algorithm would do:

when we reach the node splitting by
that attribute with missing value, we
send the instance down each possible
branch weighted proportional to the
number of training instances going
down those branches, eventually
accumulating the result at the leaf
nodes.

Of course you could have a more aggressive approach and change the way the tree chooses the attributes to split by during the training phase. A naive way would be assign weights to attributes and to multiply the splitting criterion (entropy, information gain, ..) by that weight as a penalty coefficient, so that "expensive attributes" would be less likely to be chosen as a splitting node.

情话墙 2024-09-16 10:51:39

一种选择可能是 gaenari 一种 C++ 增量决策树算法。

它旨在最大限度地减少概念漂移造成的损害。

它基于C++,但可以扩展到python和java。

One option might be gaenari a c++ incremental decision tree algorithm.

It is designed to minimize concept drifting damage.

it is based on C++, but can be extended to python and java.

时光与爱终年不遇 2024-09-16 10:51:39

您在训练时或分类时担心这一点吗?由于这些时期是分开的,因此您可以采取一些技巧来避免评估它,直到很晚(如果是后者)。训练期间没有任何技巧可以玩。您必须在培训时提供所有功能。但是,由于这可能发生在程序之外,因此您不必担心计算时间。训练树是最密集的部分。

因此,我建议将所有数据放在一起,对其进行训练,从训练中获取产品,然后在将对象发送到树上时在对象中使用惰性评估。让您的对象实现一些接口来获取值,您可以使用代码来延迟评估事物。如果一个对象从未命中需要昂贵值的节点,那么您不必评估它。

您昂贵的计算可能不会被选择作为分割的选择,因此您无需在分类时评估它们。一旦你训练和修剪你的树,你可能只会有 3-5 个具有统计相关性的特征。然后,您可以仅通过缓存优化那些功能,因此分类速度很快。

如果你想要增量训练,那完全是另一个蜡球,并且有相应的算法。但是,它们没有得到很好的解释,你必须深入研究研究论文才能得到它们。

Are you concerned about this during training time or at classification time? Since those periods are separate you can play tricks to avoid evaluating it until it's very late if it's the later. There are no trick you can play during training time. You have to provide all the features at the time of training. However, since that can happen outside your program you don't have to be as concerned about calculation time. Training the tree is the most intensive part.

So I'd suggest get all your data together, train it, take the product from training then using lazy evaluation in your objects as you send them down the tree. Have your objects implement some interface for getting values that you can use code to lazy evaluate things. If an object never hits a node requiring an expensive value then you don't have to evaluate it.

Your expensive calculations might not be chosen as picks to split on so then you eliminate the need to evaluate them at classification time. You'll probably only have 3-5 features that are statistically relevant once you train and prune your tree. Then you can optimize only those features with caching and such so classification is fast.

You if you want incremental training that's a whole another ball of wax, and there are algorithms for that. But, they aren't explained very well, and you'll have to dig into research papers to get them.

旧人九事 2024-09-16 10:51:39

所以这就是我要做的。鉴于我之前问题的答案,我认为您有类似以下内容。听起来您想实施某种 20 个问题,例如方法。

对于二十个问题,您有是/否答案,因此二叉树效果最好。但是,您可以分层选择多个选项,但用户只能选择一个选项。因此,该算法假设您已经提前训练了树,并且它是根据您希望使用的数据集构建的。

举例来说,我们正在尝试进行医学诊断,因此我们的数据可能如下所示:

Disease Name  Head Ache   Fever  Back Pain  Leg Pain  Blurry Vision  Hearing Loss
Common Cold   Yes         Yes    No         No        No             No
Migraine      Yes         No     No         No        Yes            No
Herpes        No          Yes    No         No        No             No

在此示例中,头痛、发烧、背痛、腿痛等是影响因素,疾病名称是目标。每一行都是对单个患者的实际诊断,因此一种疾病可以在数据中多次重复。

  1. 修改步行算法以从根开始。
  2. 如果您到达了一片叶子,请告诉用户可能的答案。
  3. 选取用于分割该节点的影响者并将其呈现给用户并询问“是/否”问题(您头痛吗)。
  4. 如果用户回答“是”,则向左走。
  5. 如果用户回答“否”,则“向右”。
  6. 转到第 2 步

在叶节点中,您必须找到到达该位置的实际行,以便您可以向用户显示它,说明您可能有以下情况之一:

头痛
偏头痛
断头

处方是:等等等等。

拥有 100 万影响者,构建这棵树需要一段时间。如果您想降低这一点,可以使用多值影响者而不是是/否。尽管即使针对每种医疗状况,也很难想出 100 万个是/否的独特问题。一旦你构建了树,它就可以提供你想要的尽可能多的诊断。

So here is what I'd do. Given the answer to my previous question I think you have something like the following. Sounds like you want to implement some sort of 20 questions like approach.

With twenty questions you have yes/no answers so a binary tree works best. However, you could layer in multiple choice options, but the user picks one choice. So this algorithm assumes you've trained your tree ahead of time and it's been built from a dataset you wish to use.

Say for example we're trying to do a medical diagnosis so our data might look like the following:

Disease Name  Head Ache   Fever  Back Pain  Leg Pain  Blurry Vision  Hearing Loss
Common Cold   Yes         Yes    No         No        No             No
Migraine      Yes         No     No         No        Yes            No
Herpes        No          Yes    No         No        No             No

In this example, Head Ache, Fever, Back Pain, Leg Pain, etc are the influencers, and Disease Name is the target. Each row would be an actual diagnosis of a single patient so a disease could be repeated in the data more than once.

  1. Modify a walk algorithm to start at the root.
  2. If you've reached a leaf tell the user the potential answers.
  3. Take the influencer used to split this node and present it to the user and ask the "Yes/No" question (Do you have a Head Ache).
  4. Go left if the user answers Yes.
  5. Go Right if the user answers No.
  6. Goto Step 2

In the leaf nodes you'll have to actual rows that made it to that location so you can display it to the user saying you might have one of these:

Head ache
Migraine
Severed Head

Prescription is: blah blah blah.

With 1 million influencers it will take a while to build the tree. If you wanted to lower that it might be possible to use multi-valued influencers instead of yes/no. Although it's really hard to think of 1 million yes/no unique questions even for every medical condition. Once you build the tree it can offer as many diagnosis as you want.

独自←快乐 2024-09-16 10:51:39

Python 决策树包位于
https://pypi.python.org/pypi/DecisionTree
具有使用决策树的交互模式。它不是您需要的意义上的增量。但是,也许可以轻松更改函数中的代码以进行交互操作,以便您可以增量地查看结果。

The Python decision tree package at
https://pypi.python.org/pypi/DecisionTree
has an interactive mode for using a decision tree. It is not incremental in the sense you need. However, it may be possible to easily change the code in the function for interactive operation to allow you to see the results incrementally.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文