数据高度偏斜,值范围太大
我试图重新恢复和标准化我的数据集 我的数据高度偏斜,值范围太大,影响了我的模型性能 我已经尝试使用brounstscaler()和powerTransFormer(),但是
在下面没有改进,您可以看到BoxPlot和KDE绘图,以及我的数据的skew()测试
df_test.agg(['skew', 'kurtosis']).transpose()
数据是财务数据,因此可以采用大量的价值(它们不是真正的Ouliers)
i'am trying to rescale and normalize my dataset
my data is highly skewed and also the values range is too large which affecting my models performance
i've tried using robustscaler() and powerTransformer() and yet no improvement
below you can see the boxplot and kde plot and also skew() test of my data
df_test.agg(['skew', 'kurtosis']).transpose()
the data is financial data so it can take a large range of values ( they are not really ouliers)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
根据您的数据,有多种处理方法。但是,有一个功能可以通过对标准化工作进行初步转换来帮助您处理偏斜数据。
转到此仓库( https://github.com/github.com/datamadness/automatic in -skewness-transformation-for-pandas-dataframe )并下载功能
skew_autotransform.py
test_skew_autotransform.py 。将此功能与代码相同的文件夹中。以与此示例相同的方式使用它:它将返回每个变量的偏度的几个图表和度量,但最重要的是,处理已处理的偏斜数据的转换数据框架:
原始数据:
和转换的数据:
完成此操作后 :如果需要的话,数据。
Update
给定某些数据的范围,您可能需要按照情况和反复错误进行这种情况。您可以使用几种标准化器来测试不同的方法。我会在示例列中给您一些,
sigmoid
定义以下功能
,并执行
“ libibilitiesnoncurrent_total”的“正态”偏度为8.85,
skewness。
转换为-2.81 log+1归一化
另一种方法是使用对数函数,然后进行归一化。
偏度减少了大致相同的量。
我会选择最后一个选项,而不是Sigmoidal方法。我还怀疑您可以将此解决方案应用于所有功能。
Depending on your data, there are several ways to handle this. There is however a function that will help you handle skew data by doing a preliminary transformation to your normalization effort.
Go to this repo (https://github.com/datamadness/Automatic-skewness-transformation-for-Pandas-DataFrame) and download the functions
skew_autotransform.py
andTEST_skew_autotransform.py
. Put this function in the same folder as your code. Use it in the same way as in this example:It will return several graphs and measures of skewness of each variable, but most importantly a transformed dataframe of the handled skewed data:
Original data:
and the tranformed data:
After having done this, normalize the data if you need to.
Update
Given the ranges of some of your data, you need to probably do this case by case and by trial and error. There are several normalizers you can use to test different approaches. I'll give you a few of them on an example columns,
Sigmoid
Define the following function
and do
'LiabilitiesNoncurrent_total' had 'positive' skewness of 8.85
The transformed one has a skewness of -2.81
Log+1 Normalization
Another approach is to use a logarithmic function and then to normalize.
The skewness is reduced by approxiamately the same amount.
I would opt for this last option rather than a sigmoidal approach. I also suspect that you can apply this solution to all your features.