数值与分类量:为什么要与高基数的分类变量100%相关?
我是数据科学的新手,并试图掌握探索性数据分析。我的目标是在所有变量之间获得相关矩阵。对于数值变量,我使用Pearson的R,对于分类变量,我使用校正后的Cramer的V。现在问题是在分类变量和数值变量之间获得有意义的相关性。为此,我使用相关比,如概述在这里。问题的问题是,具有较高基数的分类变量无论如何都显示高相关性:
这似乎是毫无意义的,因为这实际上将显示分类变量的基数,而不是与数值变量的相关性。问题是:如何处理问题以获得有意义的相关性。
下面的Python代码显示了我如何实现相关率:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
train = pd.DataFrame({
'id': [0,1,2,3,4,5,6,7,8,9,10,11], 'num3': [6,3,3,9,6,9,9,3,6,3,6,9],
'cat2': [0,1,0,1,0,1,0,1,0,1,0,1], 'cat3': [0,1,2,0,1,2,0,1,2,0,1,2],
'cat6': [0,4,8,2,6,10,0,4,8,2,6,10], 'cat12': [0,7,2,9,4,11,6,1,8,3,10,5],
})
cat_cols, num_cols = ['cat2','cat3','cat6','cat12'], ['id','num3']
def corr_ratio(cats, nums):
avgtotal = nums.mean()
elements_avg, elements_count = np.zeros(len(cats.index)), np.zeros(len(cats.index))
cu = cats.unique()
for i in range(cu.size):
cn = cu[i]
filt = cats == cn
elements_count[i] = filt.sum()
elements_avg[i] = nums[filt].mean(axis=0)
numerator = np.sum(np.multiply(elements_count, np.power(np.subtract(elements_avg, avgtotal), 2)))
denominator = np.sum(np.power(np.subtract(nums, avgtotal), 2)) # total variance
return 0.0 if numerator == 0 else np.sqrt(numerator / denominator)
rows = []
for cat in cat_cols:
col = []
for num in num_cols:
col.append(round(corr_ratio(train[cat], train[num]), 2))
rows.append(col)
df = pd.DataFrame(np.array(rows), columns=num_cols, index=cat_cols)
sns.heatmap(df)
plt.tight_layout()
plt.show()
I am new to data science and trying to get a grip on exploratory data analysis. My goal is to get a correlation matrix between all the variables. For numerical variables I use Pearson's R, for categorical variables I use the corrected Cramer's V. The issue now is to get a meaningful correlation between categorical and numerical variables. For that I use the correlation ratio, as outlined here. The issue with that is that categorical variables with high cardinality show a high correlation no matter what:
correlation matrix cat vs. num
This seems nonsensical, since this would practically show the cardinality of the the categorical variable instead of the correlation to the numerical variable. The question is: how to deal with the issue in order to get a meaningful correlation.
The Python code below shows how I implemented the correlation ratio:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
train = pd.DataFrame({
'id': [0,1,2,3,4,5,6,7,8,9,10,11], 'num3': [6,3,3,9,6,9,9,3,6,3,6,9],
'cat2': [0,1,0,1,0,1,0,1,0,1,0,1], 'cat3': [0,1,2,0,1,2,0,1,2,0,1,2],
'cat6': [0,4,8,2,6,10,0,4,8,2,6,10], 'cat12': [0,7,2,9,4,11,6,1,8,3,10,5],
})
cat_cols, num_cols = ['cat2','cat3','cat6','cat12'], ['id','num3']
def corr_ratio(cats, nums):
avgtotal = nums.mean()
elements_avg, elements_count = np.zeros(len(cats.index)), np.zeros(len(cats.index))
cu = cats.unique()
for i in range(cu.size):
cn = cu[i]
filt = cats == cn
elements_count[i] = filt.sum()
elements_avg[i] = nums[filt].mean(axis=0)
numerator = np.sum(np.multiply(elements_count, np.power(np.subtract(elements_avg, avgtotal), 2)))
denominator = np.sum(np.power(np.subtract(nums, avgtotal), 2)) # total variance
return 0.0 if numerator == 0 else np.sqrt(numerator / denominator)
rows = []
for cat in cat_cols:
col = []
for num in num_cols:
col.append(round(corr_ratio(train[cat], train[num]), 2))
rows.append(col)
df = pd.DataFrame(np.array(rows), columns=num_cols, index=cat_cols)
sns.heatmap(df)
plt.tight_layout()
plt.show()
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
可能是因为我认为您正在可视化与海洋剧情中CHI-2有关的东西。 Cramer的V是源自CHI-2的数字,但不是等效的。因此,这意味着您可能对特定单元格具有很高的价值,但是对于Cramer的V,我什至不确定比较原始模态值是否有意义,因为它们可能处于完全不同的数量级。
chi 2公式
cramer的v公式
It could be because I think you are visualising something more related to chi-2 in your seaborn plot. Cramer's V is a number derived from chi-2 but not equivalent. So it means you could have a high value for a specific cell but a more relevant value for Cramer's V. I'm not even sure it makes sense to compare raw modalities values because they could be on a totally different order of magnitude.
Chi 2 formula
Cramer's V formula
如果我没记错的话,还有另一种称为 theil的u 的方法。如何尝试一下,看看是否会出现同样的问题?
您可以使用此:
num_cols:
your_df.select_dtypes(include = ['number'])。columns.to_list()
cat_target_cols:
your_df.select_dtypes(include = ['object'])。columns.to_list()
If I am not mistaken, there is another method called Theil’s U. How about trying this out and see if the same problem will occur?
You can use this:
num_cols:
your_df.select_dtypes(include=['number']).columns.to_list()
cat_target_cols:
your_df.select_dtypes(include=['object']).columns.to_list()