多目标决策树
这是我的问题。我需要实现一个多目标决策树算法。多目标是多标签学习的扩展,其中标签不是二进制的,而是可以是连续的、分类的等。例如,多标签分类问题的标签向量可能如下所示 {1,0,1,0,0,0,1},而多目标分类问题的标签向量可能如下所示 {2,35,3,-2, 24}。 我的问题是这样的。如果我有一个带有 3 个离散值的标签,我如何在向量中表示它们? 假设我有一个名为“工作”的标签,有 3 个值:机械师、教师和运动员。我如何编码这个标签以便在向量中使用它? 在决策树中的每个节点上,为了找到我的分割,我需要计算该节点中所有标签向量的均值向量(我使用方差方法方程来找到我的分割)。如果我有二进制标签,这会很容易,因为添加 0 和 1 不会造成任何问题。如果我用 0,1,2 对这 3 个工作进行编码,那么这就是问题,因为添加具有标签运动员的标签向量比添加具有工作机制的向量更重要,并且平均向量不准确。
让我们以这个例子为例。我有这 3 个标签:
job: {mechanic,teacher,athlete}
married:{yes,no}
age: continuous value
很容易说,已婚标签可以编码为 {0,1},年龄标签可以编码为连续数字。但是我该如何对工作标签进行编码呢?将其编码为 {0,1,2} 会导致下一个问题。想象一个节点中有 2 个标签向量:{0,0,45} 对应于机械师,已婚,45 岁;{2,1,48} 对应于运动员,未婚,45 岁。平均向量为{1,0.5,46.5}。通过这个向量,我可以预测落入该节点的实例的年龄是 46.5,我可以说该实例未婚(规则规定大于或等于 0.5 为 1),并且我可以说它的工作是一名教师。老师的工作完全错误,而其他工作都还好。您现在看到了编码分类标签的问题。帮助或建议???谢谢:D
Here is my problem. I need to implement a multi target decision tree algorithm. A multi target is an extension of multi label learning where the labels are not binary but can be continuous, categorical and so on. For example a label vector for a multi label classification problem could look like this {1,0,1,0,0,0,1}, while for a multi target could look like this {2,35,3,-2,24}.
My problem is this. If i have a label that takes 3 discrete values how do i represent them in a vector?
Lets say i have a label called job and takes 3 values, mechanic,teacher and athlete. How can i code this label in order to use it in a vector?
At each node in a decision tree in order to find my split, i need to compute the mean vector of all the label vectors in this node ( i am using the variance method equation to find my split). If i had binary label this would be easy because adding 0s and 1s doesn't pose any problem. If i code these 3 jobs with 0,1,2, then this is problem because adding a label vector that has the label athlete, counts more than adding a vector that has the job mechanic and the mean vector is inaccurate.
Lets take this example. I have these 3 labels:
job: {mechanic,teacher,athlete}
married:{yes,no}
age: continuous value
It is easy to say that the married label can be coded as {0,1} and the age label as a continuous number. But how can i code the job label? Coding it as {0,1,2} causes the next problem. Imagine 2 label vectors in a node: {0,0,45} which corresponds to mechanic,married and 45 years old and {2,1,48} which corresponds to athlete,not married,45 years old. The mean vector is {1,0.5,46.5}. With this vector i can predict that the age of the instance that falls in to that node is 46.5, i can say that the instance in not married (with a rule that says greater or equal than 0.5 is 1) and i can say that its job is a teacher. The teacher job is totally wrong while the others are OK. You see now the problem of coding categorical labels. An help or advice??? Thanks :D
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
如果值大于 2,如何将某个特征的所有离散值全部转换为特征,例如:
将产生 5 维矢量
(机械师 0/1、教师 0/1、运动员 0/1、已婚) 0/1,年龄 0-inf)
How about taking all your discrete values of a feature and transform them all into features if values more than 2, for example:
will result in an 5-dimensional vecor
(mechanic 0/1, teacher 0/1, athlete 0/1, married 0/1, age 0-inf)