如何在k近邻算法中同时使用二值特征和连续特征?

发布于 2024-10-05 08:43:33 字数 187 浏览 4 评论 0原文

我的特征向量具有连续(或广泛范围)和二进制分量。如果我简单地使用欧几里德距离,连续分量将产生更大的影响:

将对称与不对称表示为 0 和 1 以及一些从 0 到 100 的不太重要的比率,从对称更改为不对称与更改相比,距离影响很小比率乘以 25。

我可以为对称性添加更多权重(例如将其设置为 0 或 100),但是有更好的方法吗?

My feature vector has both continuous (or widely ranging) and binary components. If I simply use Euclidean distance, the continuous components will have a much greater impact:

Representing symmetric vs. asymmetric as 0 and 1 and some less important ratio ranging from 0 to 100, changing from symmetric to asymmetric has a tiny distance impact compared to changing the ratio by 25.

I can add more weight to the symmetry (by making it 0 or 100 for example), but is there a better way to do this?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

地狱即天堂 2024-10-12 08:43:33

您可以尝试使用归一化欧几里得距离,例如,此处第一部分末尾处所描述的距离。

它只是按每个特征(连续或离散)的标准差来缩放。这比另一张海报建议的按范围(max-min)进行缩放更稳健。

You could try using the normalized Euclidean distance, described, for example, at the end of the first section here.

It simply scales every feature (continuous or discrete) by its standard deviation. This is more robust than, say, scaling by the range (max-min) as suggested by another poster.

紫瑟鸿黎 2024-10-12 08:43:33

如果我正确理解你的问题,标准化(又名“重新缩放”)数据集中的每个维度或列是处理过度加权维度的传统技术,例如,

ev_scaled = (ev_raw - ev_min) / (ev_max - ev_min)

在 R 中,例如,你可以编写这个函数

ev_scaled = function(x) {
    (x - min(x)) / (max(x) - min(x))
}  

:这:

# generate some data: 
# v1, v2 are two expectation variables in the same dataset 
# but have very different 'scale':
> v1 = seq(100, 550, 50)
> v1
  [1] 100 150 200 250 300 350 400 450 500 550
> v2 = sort(sample(seq(.1, 20, .1), 10))
> v2
  [1]  0.2  3.5  5.1  5.6  8.0  8.3  9.9 11.3 15.5 19.4
> mean(v1)
  [1] 325
> mean(v2)
  [1] 8.68

# now normalize v1 & v2 using the function above:
> v1_scaled = ev_scaled(v1)
> v1_scaled
  [1] 0.000 0.111 0.222 0.333 0.444 0.556 0.667 0.778 0.889 1.000
> v2_scaled = ev_scaled(v2)
> v2_scaled
  [1] 0.000 0.172 0.255 0.281 0.406 0.422 0.505 0.578 0.797 1.000
> mean(v1_scaled)
  [1] 0.5
> mean(v2_scaled)
  [1] 0.442
> range(v1_scaled)
  [1] 0 1
> range(v2_scaled)
  [1] 0 1

If i correctly understand your question, normalizing (aka 'rescaling) each dimension or column in the data set is the conventional technique for dealing with over-weighting dimensions, e.g.,

ev_scaled = (ev_raw - ev_min) / (ev_max - ev_min)

In R, for instance, you can write this function:

ev_scaled = function(x) {
    (x - min(x)) / (max(x) - min(x))
}  

which works like this:

# generate some data: 
# v1, v2 are two expectation variables in the same dataset 
# but have very different 'scale':
> v1 = seq(100, 550, 50)
> v1
  [1] 100 150 200 250 300 350 400 450 500 550
> v2 = sort(sample(seq(.1, 20, .1), 10))
> v2
  [1]  0.2  3.5  5.1  5.6  8.0  8.3  9.9 11.3 15.5 19.4
> mean(v1)
  [1] 325
> mean(v2)
  [1] 8.68

# now normalize v1 & v2 using the function above:
> v1_scaled = ev_scaled(v1)
> v1_scaled
  [1] 0.000 0.111 0.222 0.333 0.444 0.556 0.667 0.778 0.889 1.000
> v2_scaled = ev_scaled(v2)
> v2_scaled
  [1] 0.000 0.172 0.255 0.281 0.406 0.422 0.505 0.578 0.797 1.000
> mean(v1_scaled)
  [1] 0.5
> mean(v2_scaled)
  [1] 0.442
> range(v1_scaled)
  [1] 0 1
> range(v2_scaled)
  [1] 0 1
拍不死你 2024-10-12 08:43:33

您还可以尝试马哈拉诺比斯距离而不是欧几里得距离。

You can also try Mahalanobis distance instead of Euclidean.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文