将数据从双精度映射到整数而不失去一致性的最简单方法
我正在努力寻找持续正常化的最佳方法。
基本上我有一定数量的实例,每个实例都有一定数量的具有浮动值的属性:
例如:
At1 At2 At3
0.1 0.3 3.0
0.1 4.5 2.1
...
我想将每个属性映射到整数值,试图将与数据一致。
例如,我尝试简单地,对于每个属性,将该属性的最大值和最小值之间的差除以任意值,例如 10,然后将每个属性的所有双精度值映射到它的索引相应的间隔,并通过这样做,将我的属性标准化为 1 到 10 之间的整数值...
但我想要一种方法,该方法可以为每个属性使用尽可能短的间隔数,而不会失去一致性,例如,如果我有一个具有三个可能值的属性:1.2、3.5 和 223.3 通过我的方法,例如使用 10 个可能值的间隔,我将为该属性带来大量不必要的间隔,并浪费大量空间...
有什么建议吗?
I'm trying to find the best way to normalize consistently.
Basically I have a certain number of instances, each one with a certain number of attributes with floating values:
For example:
At1 At2 At3
0.1 0.3 3.0
0.1 4.5 2.1
...
And I want to map each attribute to integer values, trying to be consistent with the data.
I tried for example to simply , for each attribute, divide the difference between the max value and the min value for that attribute, dividing it into an arbitrary value like 10, and then map all the double values of each attributes to the index of it's corresponding interval, and by doing so, normalizing my attributes to integer values between 1 and ten...
But I would like an approach that would use the shortest number possible of intervals for each attributes without losing consistency, for example, If I have one attribute with three possible values: 1.2, 3.5 and 223.3 by my approach using for example intervals of 10 possible values I would have a ton of unnecessary intervals for that attribute, and a LOT of wasted space...
Any suggestions?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我认为您正在询问 编码 压缩,或更具体地说,如何找到实数到整数的 1-1 映射。
霍夫曼编码可能是最著名的,并且可以被证明是最小的(具有最少的数量)浪费的时间间隔,用你的术语来说)。 范围编码也很流行。
I think you're asking about encoding for compression, or more specifically, how to find a 1-1 map of reals to integers.
Huffman encoding is probably the most famous, and can be proven to be the smallest (have the least number of wasted intervals, in your terminology). Range encoding is also popular.