svmlib 缩放与 pyml 标准化、缩放和转换
标准化线性核 SVM 中使用的特征向量的正确方法是什么?
看看 LIBSVM,它看起来像是通过将每个特征重新缩放到单个标准上限/下限范围来完成的。然而,PyML 似乎并没有提供一种以这种方式扩展数据的方法。相反,可以选择按向量的长度对向量进行标准化,按其平均值移动每个特征值,同时按标准差重新缩放等。
我正在处理大多数特征都是二进制的情况,除了少数特征是数字。
What is the proper way to normalize feature vectors for use in a linear-kernel SVM?
Looking at LIBSVM, it looks like it's done by just rescaling each feature to a single standard upper/lower range. However, it doesn't seem like PyML provides a way to scale the data this way. Instead, there are options to normalize the vectors by their length, shift each feature value by its mean while rescaling by the standard deviation, etc.
I am dealing with a case when most features are binary, except a few that are numeric.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我不是这方面的专家,但我相信通过减去每个特征向量的均值并除以标准差来居中和缩放每个特征向量是标准化用于 SVM 的特征向量的典型方法。在 R 中,这可以通过缩放函数来完成。
另一种方法是将每个特征向量变换到 [0,1] 范围:
如果分布非常扭曲,也许某些特征可以从对数变换中受益,但这也会改变分布的形状,而不仅仅是“移动” “ 它。
我不确定通过使用 L1 或 L2 范数对向量进行标准化,在 SVM 设置中会获得什么收益,就像 PyML 使用其标准化方法所做的那样。我猜二进制特征(0 或 1)不需要标准化。
I am not an expert in this, but I believe centering and scaling each feature vector by subtracting its mean and dividing thereafter by the standard deviation is a typical way to normalize feature vectors for use with SVMs. In R, this can be done with the scale function.
Another way is to transform each feature vector to the [0,1] range:
Maybe some features could benefit from a log-transformation if the distribution is very scewed, but this would change the shape of the distribution as well and not only "move" it.
I am not sure what you gain in an SVM-setting by normalizing the vectors by their L1 or L2 norm like PyML does with its normalize method. I guess binary features (0 or 1) don't need to be normalized.