WEKA:属性尺度问题
我有一个训练数据集和多个测试集(我正在集群框架中对实例进行分类,因此测试集的实例是动态计算的)。
实例属性具有不同的标度(第一个从 0 到 1,第二个从 0 到 100)。
我的分类器(逻辑回归和 SMO)如何处理它们无法立即获得整个测试集的事实?
换句话说,如果他们不知道测试集中的最大值是多少,他们如何处理不同的尺度属性?
谢谢
I've a training data sets and multiple test sets (I'm classifying instances in a clustering framework, so the instances of the test set are computed on fly).
The instances attributes have different scales (the first one varies from 0 to 1, and the second from 0 to 100).
How do my classifiers (logistic regression and SMO) deal with the fact they don't have the entire test set at once ?
In other terms, how do they deal with different scale attributes if they don't know what the maximum value is in the test set ?
thanks
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
根据Weka Javadocs,SMO“默认规范化所有属性” (请注意,输出中的系数基于标准化/标准化数据,而不是原始数据。)”即,如果您的训练集未覆盖每个属性的完整范围,您将得到错误的标准化。这有多糟糕取决于您的数据。
我建议您尝试使用标准化和不使用标准化进行训练(使用 setFeatureSpaceNormalization(false) 将其关闭),然后看看哪种效果最好。
According to the Weka Javadocs, SMO "normalizes all attributes by default. (Note that the coefficients in the output are based on the normalized/standardized data, not the original data.)" I.e., you'll get erroneous normalization if your training set doesn't cover the full range for each attribute. How bad that is depends on your data.
I suggest you try training both with and without normalization (use
setFeatureSpaceNormalization(false)
to turn it off) and see what works best.