如何在k近邻算法中同时使用二值特征和连续特征?
我的特征向量具有连续(或广泛范围)和二进制分量。如果我简单地使用欧几里德距离,连续分量将产生更大的影响:
将对称与不对称表示为 0 和 1 以及一些从 0 到 100 的不太重要的比率,从对称更改为不对称与更改相比,距离影响很小比率乘以 25。
我可以为对称性添加更多权重(例如将其设置为 0 或 100),但是有更好的方法吗?
My feature vector has both continuous (or widely ranging) and binary components. If I simply use Euclidean distance, the continuous components will have a much greater impact:
Representing symmetric vs. asymmetric as 0 and 1 and some less important ratio ranging from 0 to 100, changing from symmetric to asymmetric has a tiny distance impact compared to changing the ratio by 25.
I can add more weight to the symmetry (by making it 0 or 100 for example), but is there a better way to do this?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
您可以尝试使用归一化欧几里得距离,例如,此处第一部分末尾处所描述的距离。
它只是按每个特征(连续或离散)的标准差来缩放。这比另一张海报建议的按范围(
max-min
)进行缩放更稳健。You could try using the normalized Euclidean distance, described, for example, at the end of the first section here.
It simply scales every feature (continuous or discrete) by its standard deviation. This is more robust than, say, scaling by the range (
max-min
) as suggested by another poster.如果我正确理解你的问题,标准化(又名“重新缩放”)数据集中的每个维度或列是处理过度加权维度的传统技术,例如,
在 R 中,例如,你可以编写这个函数
:这:
If i correctly understand your question, normalizing (aka 'rescaling) each dimension or column in the data set is the conventional technique for dealing with over-weighting dimensions, e.g.,
In R, for instance, you can write this function:
which works like this:
您还可以尝试马哈拉诺比斯距离而不是欧几里得距离。
You can also try Mahalanobis distance instead of Euclidean.