转换功能以增加相似性
我有一个大的数据集(约有20,000个样本x 2,000个功能 - 每个样本,带有相应的y值),我正在为回归ML模型构建回归ML模型。 输入向量是每个位置的BitVectors,其位置为1或0。
有趣的是,我注意到,当我“随机”选择n个样品时,使它们的y值在两个任意值A和B之间(因此BA远小于y中的总值范围),后续模型是很多更好地通过在模型训练中使用A - > b范围来预测其他值。
然而,这些值的输入x矢量的总体相似性绝不比整个数据集中x值的任何随机选择都相似。
是否有一种可用的方法来转换输入X向量,使得具有更相似Y值的X向量是“更接近”的(我不是特别的方法,但它可能是余弦的相似性),而那些不相似的方法 - 值分开了吗?
I have a large dataset (~20,000 samples x 2,000 features-- each sample w/ a corresponding y-value) that I'm constructing a regression ML model for.
The input vectors are bitvectors with either 1s or 0s at each position.
Interestingly, I have noticed that when I 'randomly' select N samples such that their y-values are between two arbitrary values A and B (such that B-A is much smaller than the total range of values in y), the subsequent model is much better at predicting other values with the A-->B range not used in the training of the model.
However, the overall similarity of the input X vectors for these values are in no way more similar than any random selection of X values across the whole dataset.
Is there an available method to transform the input X-vectors such that those with more similar y-values are "closer" (I'm not particular the methodology, but it could be something like cosine similarity), and those with not similar y-values are separated?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
经过更多的思考,我相信这个问题可以被重新构架为有监督的聚类问题。可能能够实现这一目标的方法可能很简单:
After more thought, I believe this question can be re-framed as a supervised clustering problem. What might be able to accomplish this might be as simple as: