数据高度偏斜，值范围太大

发布于 2025-01-31 05:10:28 字数 555 浏览 2 评论 0原文

我试图重新恢复和标准化我的数据集我的数据高度偏斜，值范围太大，影响了我的模型性能我已经尝试使用brounstscaler（）和powerTransFormer（），但是

在下面没有改进，您可以看到BoxPlot和KDE绘图，以及我的数据的skew（）测试

df_test.agg(['skew', 'kurtosis']).transpose()

数据是财务数据，因此可以采用大量的价值（它们不是真正的Ouliers）

原文

i'am trying to rescale and normalize my dataset
my data is highly skewed and also the values range is too large which affecting my models performance
i've tried using robustscaler() and powerTransformer() and yet no improvement

below you can see the boxplot and kde plot and also skew() test of my data

df_test.agg(['skew', 'kurtosis']).transpose()

the data is financial data so it can take a large range of values ( they are not really ouliers)

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

两人的回忆 2025-02-07 05:10:28

根据您的数据，有多种处理方法。但是，有一个功能可以通过对标准化工作进行初步转换来帮助您处理偏斜数据。

转到此仓库（ https://github.com/github.com/datamadness/automatic in -skewness-transformation-for-pandas-dataframe ）并下载功能skew_autotransform.py test_skew_autotransform.py 。将此功能与代码相同的文件夹中。以与此示例相同的方式使用它：

import pandas as pd
import numpy as np
from sklearn.datasets import load_boston

from skew_autotransform import skew_autotransform

exampleDF = pd.DataFrame(load_boston()['data'], columns = load_boston()['feature_names'].tolist())

transformedDF = skew_autotransform(exampleDF.copy(deep=True), plot = True, exp = False, threshold = 0.5)

print('Original average skewness value was %2.2f' %(np.mean(abs(exampleDF.skew()))))
print('Average skewness after transformation is %2.2f' %(np.mean(abs(transformedDF.skew()))))

它将返回每个变量的偏度的几个图表和度量，但最重要的是，处理已处理的偏斜数据的转换数据框架：

原始数据：

 CRIM    ZN  INDUS  CHAS    NOX     RM   AGE     DIS  RAD    TAX  \
0    0.00632  18.0   2.31   0.0  0.538  6.575  65.2  4.0900  1.0  296.0   
1    0.02731   0.0   7.07   0.0  0.469  6.421  78.9  4.9671  2.0  242.0   
2    0.02729   0.0   7.07   0.0  0.469  7.185  61.1  4.9671  2.0  242.0   
3    0.03237   0.0   2.18   0.0  0.458  6.998  45.8  6.0622  3.0  222.0   
4    0.06905   0.0   2.18   0.0  0.458  7.147  54.2  6.0622  3.0  222.0   
..       ...   ...    ...   ...    ...    ...   ...     ...  ...    ...   
501  0.06263   0.0  11.93   0.0  0.573  6.593  69.1  2.4786  1.0  273.0   
502  0.04527   0.0  11.93   0.0  0.573  6.120  76.7  2.2875  1.0  273.0   
503  0.06076   0.0  11.93   0.0  0.573  6.976  91.0  2.1675  1.0  273.0   
504  0.10959   0.0  11.93   0.0  0.573  6.794  89.3  2.3889  1.0  273.0   
505  0.04741   0.0  11.93   0.0  0.573  6.030  80.8  2.5050  1.0  273.0   

     PTRATIO       B  LSTAT  
0       15.3  396.90   4.98  
1       17.8  396.90   9.14  
2       17.8  392.83   4.03  
3       18.7  394.63   2.94  
4       18.7  396.90   5.33  
..       ...     ...    ...  
501     21.0  391.99   9.67  
502     21.0  396.90   9.08  
503     21.0  396.90   5.64  
504     21.0  393.45   6.48  
505     21.0  396.90   7.88  

[506 rows x 13 columns]

和转换的数据：

      CRIM         ZN  INDUS           CHAS       NOX     RM         AGE  \
0   -6.843991   1.708418   2.31 -587728.314092 -0.834416  6.575  201.623543   
1   -4.447833 -13.373080   7.07 -587728.314092 -1.092408  6.421  260.624267   
2   -4.448936 -13.373080   7.07 -587728.314092 -1.092408  7.185  184.738608   
3   -4.194470 -13.373080   2.18 -587728.314092 -1.140400  6.998  125.260171   
4   -3.122838 -13.373080   2.18 -587728.314092 -1.140400  7.147  157.195622   
..        ...        ...    ...            ...       ...    ...         ...   
501 -3.255759 -13.373080  11.93 -587728.314092 -0.726384  6.593  218.025321   
502 -3.708638 -13.373080  11.93 -587728.314092 -0.726384  6.120  250.894792   
503 -3.297348 -13.373080  11.93 -587728.314092 -0.726384  6.976  315.757117   
504 -2.513274 -13.373080  11.93 -587728.314092 -0.726384  6.794  307.850962   
505 -3.643173 -13.373080  11.93 -587728.314092 -0.726384  6.030  269.101967   

          DIS       RAD       TAX        PTRATIO             B     LSTAT  
0    1.264870  0.000000  1.807258   32745.311816  9.053163e+08  1.938257  
1    1.418585  0.660260  1.796577   63253.425063  9.053163e+08  2.876983  
2    1.418585  0.660260  1.796577   63253.425063  8.717663e+08  1.640387  
3    1.571460  1.017528  1.791645   78392.216639  8.864906e+08  1.222396  
4    1.571460  1.017528  1.791645   78392.216639  9.053163e+08  2.036925  
..        ...       ...       ...            ...           ...       ...  
501  0.846506  0.000000  1.803104  129845.602554  8.649562e+08  2.970889  
502  0.776403  0.000000  1.803104  129845.602554  9.053163e+08  2.866089  
503  0.728829  0.000000  1.803104  129845.602554  9.053163e+08  2.120221  
504  0.814408  0.000000  1.803104  129845.602554  8.768178e+08  2.329393  
505  0.855697  0.000000  1.803104  129845.602554  9.053163e+08  2.635552  

[506 rows x 13 columns]

完成此操作后：如果需要的话，数据。

Update

给定某些数据的范围，您可能需要按照情况和反复错误进行这种情况。您可以使用几种标准化器来测试不同的方法。我会在示例列中给您一些，

exampleDF = pd.read_csv("test.csv", sep=",")
exampleDF = pd.DataFrame(exampleDF['LiabilitiesNoncurrent_total'])

     LiabilitiesNoncurrent_total
count                 6.000000e+02
mean                  8.865754e+08
std                   3.501445e+09
min                  -6.307000e+08
25%                   6.179232e+05
50%                   1.542650e+07
75%                   3.036085e+08
max                   5.231900e+10

sigmoid

定义以下功能

def sigmoid(x):
    e = np.exp(1)
    y = 1/(1+e**(-x))
    return y

，并执行

df = sigmoid(exampleDF.LiabilitiesNoncurrent_total)
df = pd.DataFrame(df)

“ libibilitiesnoncurrent_total”的“正态”偏度为8.85，

skewness。

转换为-2.81 log+1归一化

另一种方法是使用对数函数，然后进行归一化。

def normalize(column):
    upper = column.max()
    lower = column.min()
    y = (column - lower)/(upper-lower)
    return y

df = np.log(exampleDF['LiabilitiesNoncurrent_total'] + 1)
df_normalized = normalize(df)

偏度减少了大致相同的量。

我会选择最后一个选项，而不是Sigmoidal方法。我还怀疑您可以将此解决方案应用于所有功能。

Depending on your data, there are several ways to handle this. There is however a function that will help you handle skew data by doing a preliminary transformation to your normalization effort.

Go to this repo (https://github.com/datamadness/Automatic-skewness-transformation-for-Pandas-DataFrame) and download the functions skew_autotransform.py and TEST_skew_autotransform.py. Put this function in the same folder as your code. Use it in the same way as in this example:

import pandas as pd
import numpy as np
from sklearn.datasets import load_boston

from skew_autotransform import skew_autotransform

exampleDF = pd.DataFrame(load_boston()['data'], columns = load_boston()['feature_names'].tolist())

transformedDF = skew_autotransform(exampleDF.copy(deep=True), plot = True, exp = False, threshold = 0.5)

print('Original average skewness value was %2.2f' %(np.mean(abs(exampleDF.skew()))))
print('Average skewness after transformation is %2.2f' %(np.mean(abs(transformedDF.skew()))))

It will return several graphs and measures of skewness of each variable, but most importantly a transformed dataframe of the handled skewed data:

Original data:

 CRIM    ZN  INDUS  CHAS    NOX     RM   AGE     DIS  RAD    TAX  \
0    0.00632  18.0   2.31   0.0  0.538  6.575  65.2  4.0900  1.0  296.0   
1    0.02731   0.0   7.07   0.0  0.469  6.421  78.9  4.9671  2.0  242.0   
2    0.02729   0.0   7.07   0.0  0.469  7.185  61.1  4.9671  2.0  242.0   
3    0.03237   0.0   2.18   0.0  0.458  6.998  45.8  6.0622  3.0  222.0   
4    0.06905   0.0   2.18   0.0  0.458  7.147  54.2  6.0622  3.0  222.0   
..       ...   ...    ...   ...    ...    ...   ...     ...  ...    ...   
501  0.06263   0.0  11.93   0.0  0.573  6.593  69.1  2.4786  1.0  273.0   
502  0.04527   0.0  11.93   0.0  0.573  6.120  76.7  2.2875  1.0  273.0   
503  0.06076   0.0  11.93   0.0  0.573  6.976  91.0  2.1675  1.0  273.0   
504  0.10959   0.0  11.93   0.0  0.573  6.794  89.3  2.3889  1.0  273.0   
505  0.04741   0.0  11.93   0.0  0.573  6.030  80.8  2.5050  1.0  273.0   

     PTRATIO       B  LSTAT  
0       15.3  396.90   4.98  
1       17.8  396.90   9.14  
2       17.8  392.83   4.03  
3       18.7  394.63   2.94  
4       18.7  396.90   5.33  
..       ...     ...    ...  
501     21.0  391.99   9.67  
502     21.0  396.90   9.08  
503     21.0  396.90   5.64  
504     21.0  393.45   6.48  
505     21.0  396.90   7.88  

[506 rows x 13 columns]

and the tranformed data:

      CRIM         ZN  INDUS           CHAS       NOX     RM         AGE  \
0   -6.843991   1.708418   2.31 -587728.314092 -0.834416  6.575  201.623543   
1   -4.447833 -13.373080   7.07 -587728.314092 -1.092408  6.421  260.624267   
2   -4.448936 -13.373080   7.07 -587728.314092 -1.092408  7.185  184.738608   
3   -4.194470 -13.373080   2.18 -587728.314092 -1.140400  6.998  125.260171   
4   -3.122838 -13.373080   2.18 -587728.314092 -1.140400  7.147  157.195622   
..        ...        ...    ...            ...       ...    ...         ...   
501 -3.255759 -13.373080  11.93 -587728.314092 -0.726384  6.593  218.025321   
502 -3.708638 -13.373080  11.93 -587728.314092 -0.726384  6.120  250.894792   
503 -3.297348 -13.373080  11.93 -587728.314092 -0.726384  6.976  315.757117   
504 -2.513274 -13.373080  11.93 -587728.314092 -0.726384  6.794  307.850962   
505 -3.643173 -13.373080  11.93 -587728.314092 -0.726384  6.030  269.101967   

          DIS       RAD       TAX        PTRATIO             B     LSTAT  
0    1.264870  0.000000  1.807258   32745.311816  9.053163e+08  1.938257  
1    1.418585  0.660260  1.796577   63253.425063  9.053163e+08  2.876983  
2    1.418585  0.660260  1.796577   63253.425063  8.717663e+08  1.640387  
3    1.571460  1.017528  1.791645   78392.216639  8.864906e+08  1.222396  
4    1.571460  1.017528  1.791645   78392.216639  9.053163e+08  2.036925  
..        ...       ...       ...            ...           ...       ...  
501  0.846506  0.000000  1.803104  129845.602554  8.649562e+08  2.970889  
502  0.776403  0.000000  1.803104  129845.602554  9.053163e+08  2.866089  
503  0.728829  0.000000  1.803104  129845.602554  9.053163e+08  2.120221  
504  0.814408  0.000000  1.803104  129845.602554  8.768178e+08  2.329393  
505  0.855697  0.000000  1.803104  129845.602554  9.053163e+08  2.635552  

[506 rows x 13 columns]

After having done this, normalize the data if you need to.

Update

Given the ranges of some of your data, you need to probably do this case by case and by trial and error. There are several normalizers you can use to test different approaches. I'll give you a few of them on an example columns,

exampleDF = pd.read_csv("test.csv", sep=",")
exampleDF = pd.DataFrame(exampleDF['LiabilitiesNoncurrent_total'])

     LiabilitiesNoncurrent_total
count                 6.000000e+02
mean                  8.865754e+08
std                   3.501445e+09
min                  -6.307000e+08
25%                   6.179232e+05
50%                   1.542650e+07
75%                   3.036085e+08
max                   5.231900e+10

Sigmoid

Define the following function

def sigmoid(x):
    e = np.exp(1)
    y = 1/(1+e**(-x))
    return y

and do

df = sigmoid(exampleDF.LiabilitiesNoncurrent_total)
df = pd.DataFrame(df)

'LiabilitiesNoncurrent_total' had 'positive' skewness of 8.85

The transformed one has a skewness of -2.81

Log+1 Normalization

Another approach is to use a logarithmic function and then to normalize.

def normalize(column):
    upper = column.max()
    lower = column.min()
    y = (column - lower)/(upper-lower)
    return y

df = np.log(exampleDF['LiabilitiesNoncurrent_total'] + 1)
df_normalized = normalize(df)

The skewness is reduced by approxiamately the same amount.

I would opt for this last option rather than a sigmoidal approach. I also suspect that you can apply this solution to all your features.

回复收藏 0 原文

~没有更多了~