ARFF 中的缺失值 (Weka)
Weka 中的分类器(例如决策树)将如何解释“?” (代表 ARFF 文件中的缺失值)在学习阶段? Weka 是否会用一些预定义值(例如“0”或“假”)替换它,还是会以某种方式影响训练过程?
How will classifiers (such as decision trees) in Weka interpret '?' (that stands for missing values in ARFF files) during learning stage?
Will Weka just replace it with some predefined value (e.g. '0' or 'false') or will it somehow affect the training process?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
除了将缺失值本身视为属性值之外,在 J48 分类器的情况下,对具有缺失值的属性的任何分割都将使用与观察到的非缺失值的频率成比例的权重来完成。 Witten 和 Frank 的教科书《数据挖掘实用机器学习工具和技术》(2005 年,第 2 版,第 63 页和第 191 页)中记录了这一点,然后他们报告说
有关处理决策树中缺失值的更多信息,例如 CART 中的代理分割(与 C4.5 或其后继 J48 相反),可以在 wiki 部分找到 分类树;多篇文章也讨论了插补的使用,例如 处理缺失数据在树中:替代分割或统计插补。
Apart from treating missing value as an attribute value on its own, in the case of the J48 classifier any split on an attribute with missing value will be done with weights proportional to frequencies of the observed non-missing values. This is documented in Witten and Frank's textbook, Data Mining Practical Machine Learning Tools and Techniques (2005, 2nd. ed., p. 63 and p. 191), who then reported that
More information about handling missing values in decision trees, like surrogate splits in CART (and contrary to C4.5 or its successor J48), can be found on the wiki section for Classification Trees; the use of imputation is also discussed in several articles, e.g. Handling missing data in trees: surrogate splits or statistical imputation.